Opportunities for Pharmacogenomics And Personalized Medicine

MIKE: So today I'm happy to introduce Professor Russ Altman, who is a professor of genetics and medicine at Stanford University, and computer science, by courtesy. Russ is very involved in many biomedical research projects at Stanford, including being the director of the biomedical informatics program. The program, you are-- I forget your particular position-- heavily involved in the [UNINTELLIGIBLE] program, and the principle investigator for the pharmacogenomics knowledge base. This video is going to be made publicly available through Google video. So if you guys have any questions that are Google proprietary, make sure you save them to the very end when we're not live anymore. I'd like to introduce now, Professor Russ Altman. RUSS B. ALTMAN: Thank you. Thanks very much. I will also add that one of my of feathers in my cap is that I was Mike's PhD advisor, which he forgot to mention. But very proud of him, and I'm happy to be here. And today I just want to tell you about what's colloquially known as personalized medicine. And one of the forms it's taking is in pharmacogenetics. And I'll tell you a little bit about what we're doing, really to get a conversation started and see if there's any interest in what we're doing and potential further interactions. So I want to start out real basic, because I know that this is a very eclectic group with different backgrounds. The human genome is made out of DNA, which has four bases, A, T, C and G. The famous four bases. And three billion of those bases make a human, plus a few other things. But basically, the genetic plan for each human is in three billion bases, 99.7% of which are the same across all humans. So obviously, that remaining 0.3% makes a huge difference, and is responsible, in addition to the environmental factors, with why we all don't look like clones of each other right now. Now, there are about a million positions-- so if you think of that literally has a long string of As, Ts, Cs and Gs, out of those three billion, there are about a million that change across all humans at a significant frequency. A mutation would be something that might be very rare. You might be only one who has a certain DNA change from like A to G at a certain position. But there are other changes that are quite common in the population, and actually come from the fact that we share a common heritage, kind of the out of Africa hypothesis. And a few tens of thousands of humans, about 100,000 thousand years ago, who are basically our ancestors. Those humans generated some diversity. And then there was a huge population explosion, similar to the growth curves you see for Google, actually. And that population explosion meant that a few of the variations that were in the population when we were only 100,000, have now fixed themselves in 10%, 5%, 50% of the population. So T turning into G is one example. And there's about a million such examples, depending on how you define it, in the human genome. One of the consequences of all this, by the way, is that there are many possible human genomes. Even if we only differed at a million positions, with four choices at each position, that would basically be four to the million possible humans. So we haven't even begun to sample human diversity. That's good news. Because if we have a lot of problems to solve, you can be sure that there's a lot of humans still left to make, before we have to recycle and start making the same ones over again. So we sequenced the human genome. It was announced a couple years ago that the human genome, the average human genome, had been sequenced. That was the 99.7% that was shared. Then it became a great interest to understand for humans, where don't we shared the genome, and what's the consequences of that? So the last five years in particular, there's been a lot of activity in what they call re-sequencing, or genotyping, where, OK, we know the 99.7% that's all shared. Let's look at that small percent that's not shared. And let's characterize the ways in which it's not shared. What are the choices, T and A, for example? Which populations-- African, Asian, Caucasian-- which populations show what frequency of those variations? And so that's the second bullet point, characterizing the variation in the genotype. But we're not just doing that because it's fun to do. We're doing that because we believe that with that genotypic information, we can, if not predict perfectly, we can adjust probabilities of things like probability of disease, or probability of responding well to a medication, which is what I'm interested in. And those are what I call phenotypes, which may be a word that you're familiar with or not. But basically, a phenotype is any measurable feature of an organism that's not its DNA sequence. So the DNA sequence is the genotype. And the how its molecules, cells, organs, and organisms, how that responds to various stimuli would be the phenotype. And what we'd really like to do now is understand the relationship between this very easily measured genotype and phenotypes that we care about. Like risk for disease, or likelihood of responding well to a medication, and a variety of other health related outcomes. You could also have non-health related outcomes having to do with, well-- it's the Olympics-- athletic performance. And though there's a lot of interest in-- the phenotype would be, I skate fast. And the question is, what's the genotype? That's not of particular interest to me. There are people in the world who care about that deeply. So I'm an informatician. I do informatics. I guess many of you, in some form or another, do something like that. And the challenges in this field of post-genome biomedical research, first of all it's to exchange clear information. I'm not going to say much more about that. You all know that that's hard and boring. But very hard. And so we work on that. We also want to understand the statistical relationship between genotype and phenotype. If I give you a genotype, can you predict the phenotype, even if you don't fully understand why that genotype and phenotype are correlated. Kind of a machine learning, data mining approach. And for some people in the world, that's fine. For example, the FDA approves drugs because they work statistically. It doesn't approve drugs because there's a really good story about how the drug works. I'm a physician, I practice medicine on Friday afternoons. And there are a lot of drugs that I use that we don't have a good story for how they work. But we know that they do work, and that's all, really, that physicians and patients care about. So you don't want to dis the statistical story, even though many scientists want to have a much more profound mechanistic understanding. Tell me the story about how this DNA variation leads to changes in the cell that leads to a different phenotype. A perfectly legitimate question of interest to a different set of people. We're interested in both. And then, I think you can imagine that, as we begin to be able to predict phenotypes, based on genotypes, plus some environmental variables like, do you smoke, do you drink benzine at night, stuff like that, that we will be able to do a better job at making prognosis for disease, which is basically the anticipated time course of the disease, diagnosing the disease in the first place, and treating the disease. So let me just make this a little bit more concrete. Here I have two fragments of two genomes. So that's about 25 bases. Imagine three billion of those. It's very finite, right? So three billion is a big number, but it's not a big number here at Google. It's not even a big number to my kids, because it fits on all their iPods, right? Your genome fits on your iPod and you still have plenty of room for more music. It's three mega bases, and there's only four bases. So there's only a couple of bits you need to represent that information. But here we have two individuals, and they're exactly the same except for that one position, where the one guy at the top has a T, and the guy at the bottom has a G. That's called a SNP-- single nucleotide polymorphism. Single because it's one position, nucleotide because that's what they call the As, Gs and Cs and Ts, and polymorphism just is a fancy word for a difference in the population. So that's a SNP because there's a T in some people and a G in other people. What we'd like to do, at least from a machine learning point of view, is relate that by some complicated, probably nonlinear, function to observable phenotypes that we care about. Like whether you respond well to a cholesterol medication or not. That's essentially what pharmacogenomics is. But I'm going to tell you a little bit more detail about this. So pharmacogenomics and pharmacogenetics are the study of how genetic variation leads to variation in the response to drugs. If I were to give you all a drug right now, which, in the state of California, I'm allowed to do, since I'm licensed, you would all probably have a slightly different, or maybe a very-- depending on the drug-- a very different response. Some of you might get a headache, so of you might get nauseous, some you might actually have the desired effect of calming you down. Others, it might make you anxious. And we'd like to understand that, because as physicians, we don't practice, particularly, personally informed medicine. Of course, we talk to the patient, we understand what's important to them. You know, doc it would be bad for me to be sedated tomorrow because I'm giving a big talk at work, and I would like to be awake for it. So of course, doctors do personalized medicine at that level. We're talking about a personalized-- when you see personalized medicine in the press, it often means genome informed. And in fact, I'm starting to use that phrase more, because I don't want to insult my fellow doctors, to imply that they haven't been doing personalized medicine for the last couple of hundred years. But they have certainly not been doing genome informed medicine. Which is, if I have your genotypes, I can make better decisions. This is one of the promises of the genome project. We got the Congressmen to pay for this by promising them that we would address important health issues, like heart disease and prostate disease, which is what congressmen get. And so, we want to deliver on that. Let me give you a couple of examples. So codeine is a pain medication. Many of you probably have been prescribed codeine. It's in Tylenol number three, if you've ever gotten Tylenol number three, it's the thing that makes the number three necessary, and why you have to get a prescription for it, instead of just getting it over the counter. And it's an opioid. It's in the morphine family. But when you take codeine, it is not active. There is an enzyme in your liver that turns it into morphine, actually. It does a little chemical reaction that changes codeine to morphine. And then of course, you get all the benefits of morphine, which are quite considerable, especially if you had a tooth extracted or surgery recently. The name of that protein is-- it's an ugly name. I apologize, I didn't name it. It's CYP2D6. There's an elaborate gene naming system, which is a whole 'nother topic. But there's a gene called CYP2D6, and it makes a protein, it encodes for a protein in the genome, that performs this metabolism. And as you can see in the third point, 7% of Caucasians have a version that is a SNP-- and there's actually a number of different SNPs that can cause this-- that renders CYP2D6 totally inactive in their liver. Which means that, for 7% percent of Caucasian patients, codeine is like a placebo. It does absolutely nothing. Now, I don't know if there's anybody, probably statistically looking at the crowd, there is probably somebody here who has been given codeine and they said, geez, it really didn't help my pain very much. If that happened, this is one of the most likely reasons. I assume you took the medication. As a physician, whether the patient took the medication is actually probably more important, in terms of assessing whether the medication worked. But once you know that they took it, the genetics plays a role. So codeine is a very common question. AUDIENCE: You said it's the cause of this, or was it just a marker? RUSS B. ALTMAN: CYP2D6 is the enzyme that does the transformation. So there is this idea of markers, where they correlate with things that we either know or don't know. In this case, this SNP-- and there's a bunch of them that can do the same thing-- causes the inactivity of CYP26. So in this case we know a mechanistic story. That's a good point. Because I was talking about, it could be a correlative story, where I just genotyped a bunch of people, saw that there was this SNP that, whenever I saw it, codeine didn't work. And that would have been a perfectly useful thing for clinical practice. But we actually know a little bit more in this case. One more example that's been in the press recently is a medication called by BiDil. It's actually a combination pill. Two differed medications that we use separately, but some company probably was running out of patent time and needed something to patent, and created a pill that had the two medications grouped together. And the bad news was, they did a trial on the general population, and it showed no overall benefit for heart disease. This is a pair of drugs that are used for heart failure. And they did a big trial, you know, 3,000 people get it, 3,000 people don't get it. And there was no statistical difference in the outcome. And they were bummed because, wow, this could have been a good drug, but it's not. Their actuarial statistics guys looked at the data, and they noticed that African descent patients, in the study with 3,000 and 3,000, showed a little bit of benefit. Now, you can't get FDA to approve a drug based on a subgroup analysis that's not the primary goal of the study. But it was enough for them to go back and do another study focused on African descent patients. And they showed that it actually improved outcome quite well. And so the FDA gave them approval-- the Food and Drug Administration in the United States gave them approval for using BiDil in African descent patients. Well, why is this genetic? I didn't mention any genetic test. Well, I have mixed feelings about this. The genetic test that they're using is looking at the color of the patient's skin. That's not a great genetic test, for a number of reasons which I can't go into. Some day, and I hope soon, we will do the work to find the genetic variables that are actually responsible for predicting the success of this drug. And then what we'll do is we'll test all patients, African or otherwise, and if they have this SNP-- then we'll know that they should benefit from BiDil. And if they don't, they won't. Right now, the genetic test is staring at the color of the patient's skin. And it's been approved, and it's a very interesting story. You can imagine that bioethicists are having a field day writing about this, and what its implications might be. But the fact is, it's on the market and it's helping some people. And so that's good. So I think you can see the clinical promise of pharmacogenomics. Focused treatment by finding people who are likely to respond. Not giving it to people who are going to have bad adverse reactions. That's something I'd like to come back to. And then, for the drug companies, as in the case of Bidil, it's a way to save drugs that they were hoping would be everybody, one size fits all, and instead, it's one size fits some people. And if you're a drug company trying to make some money, it's not bad to get a drug that at least you can sell to half the population, or whatever. And then, from a scientific point of view, we'd like to understand how drugs work better, and understand the genetic basis for drug response. Is it in routine use? Not really. There are few cases. There's a breast cancer medication that, before you take it, you need to get a genetic test. I do not test CYP2D6 before I give patients codeine. What do I do? I given them codeine. I say, if it doesn't work, give me a call. And if it doesn't, I give them like Vicodin, or Percodan, or whatever. But for whatever reason, it has not entered practice. Even though, in principle, you could check CYP2D6 levels once, early in life, and then tell the patient forever, don't do codeine, it's not going to work for you. There's a little downside for the drug companies, because they don't really like splitting their markets if they don't have to. So they're a little bit schizophrenic about this. Do we like it or not? Can't quite decide. There are lots of SNPs in the genome that make no difference at all. So we have still science to do to try to figure out which ones really matter. Genotyping, the cost of testing for DNA sequence is not expensive, but it's still not cheap enough, and the QC, quality control, hasn't been done for routine clinical applications. That is going to go away very fast. There are a lot of companies working very hard to get accurate genotypes very cheap. I think you can look in the next five years for your genotype being available for the same amount that you pay for a head MRI, or a head CT. On order of hundreds or thousands of dollars, not tens of thousands of dollars. There are companies-- Prologen, right down the street, will do it for $5,000 if you do it in bulk. But that's for research purposes, and there's a certain challenge of QC for clinical purposes. There are ethical issues in testing individual patients. What do you do with the data? Where do you put it? Who gets to see it? Do insurance companies get to see it? Do employers get to see it? And what are they allowed to act on? What happens if you have genetic background that increases the risk of certain diseases that will make you an insurance liability to yourself or others? It's also, in America today-- and this might be a challenge for Google-- totally unclear how to deliver this information to a practitioner. When I'm practicing medicine, 12 to 15 minutes per patient, that includes saying hello, talking, examining, about one to two minutes for making drug prescription decisions. I have 30 medications in my head that I use a lot, that I know really well, and I can make prescribing decisions. If you tell me that the world is a much bigger place, and every patient is going to have a genetically informed prescribing decisions, I need to have a terminal next to me, telling me what the right answer is, and giving me the information to both agree or disagree, and to explain to the patient what the plan is. We don't have a health information infrastructure in the United States today, as you may be all aware. And so how to deliver that, in the United States today, is entirely unclear. It's much clearer on how to do that in UK, Canada, Estonia, Iceland, because they all have some sort of centralized effort, where you can imagine putting in a new app for drug prescriptions, or integrating it, really, with the existing apps. That's not happening. So there's a big challenge. And I know that Google is looking at stuff in that area. We think about it a lot as well. OK, so we're building the PharmGKB, b the pharmacogenetics knowledge base. This is not meant for you to really read. We tried not to innovate too much on user interface because, you guys and Amazon and eBay are doing all the user training for us. And so, we try to-- tabs like Amazon, search boxes like you, the real meat of our stuff is in the upper left. External links, or I want to buy this information-- you don't really buy it-- but the download buttons are always in the upper right, just where you have the add to cart type buttons, et cetera. And this is a database. And what's our mission? I'll tell you our mission in a second. I'm going to skip that. PharmGKB was funded by the National Institute of Health, because they wanted there to be a public Pharmacogenetics effort. Drug companies are doing a lot of this work, but they are not charged with sharing the data. Perfectly understandable. That means that the public, if they want to get data in the public, so that universities and nonprofits can do research in this area, they're going to have to pay for it. And so NIH kind of ponied up a bunch of money, funded a bunch of centers to do pharmacogenomics research, about 12 national centers. And they funded one database, that was us, to kind of hold the data. So produce a public repository with broad applicability. This is the NIGMS at NIH, which is who funds this. That's not really important right now. This gives you a sense for the things we're looking at in our network of researchers. A lot of heart disease and lung disease along the left. Cancer in blue. And a lot of drug metabolism on the right. And then we're in red in the middle, because we're the database. And this is really a national network. Our goals are to create a national data resource with high quality data linking genotype data with phenotype data, both laboratory phenotypes, as well as clinical phenotypes. We have to figure out how to represent the data, the data model. Tough problem. Provide analytic functionality so that our users can do stuff with the data. And link with all the complimentary databases that we don't want to do, but we need to integrate with. You know, we're excited about this, because it's an incredibly complex domain. We have the core data that we have to worry about in green. Genotype data, molecular phenotypes, clinical phenotypes. In order to make sense of that, our system has to know about individuals, on the far right there, the environment in which they operate, the drugs they take, and the molecules in their body that deal with those drugs. So from an informatics point of view, it's kind of a fun representational challenge. Our mission can be simply and graphically summated as these three little balls that intersect. We have to curate data and knowledge. So the data that comes in is raw data, but we also curate it and kind of represent it for the community as digested knowledge. We have PhD-level curators who do that. We have our user interface and functionality, the website, the database. And then we have outreach and dissemination, which we're doing right now. And both to users and to people who submit data. So this is the site again. Again, I don't want to really go through the details. It's a site. It has lots of information, you type in words, you get hits. I just want to point out kind of some of the key features. This is a gene page. A gene page would have genotype data. What are all the variations in the human genome, in that gene, that we know about? I'll show you a little bit more detail if you press on that link. And we have phenotype data. What are measured phenotypes known to be related to that gene that we have in the database? So that's data. We also have pathway, I would call it knowledge. Which is, we've put together pictures-- and I'll show you one in a minute-- of how genes work together to metabolize a drug, or to respond to a drug. So it's more of a systems level view, not a gene by gene view. That's in our pathways. And then finally, we have at the bottom, if you scroll down, we have annotated literature data, where our curators are constantly looking at the literature, and hand annotating for important articles that are giving pharmacogenomic information. So two types of data, genotype and phenotype. Two types of knowledge, pathways and literature curations. When you look at a gene page, this is kind of scientific detail, which I don't think most of you care about. But that's the genome in that big thick bar. And all those little tick marks that look like the skyline, those are all the locations in the gene where there's known variations in humans. And there's a table at the bottom that tells you the percentages. So for example, that first SNP is a TA SNP, you can see at the bottom left, which 52% of the humans we've looked at have a T, 58% have an A. And then we gather that data for the entire genome, provide browsing. And links, which I'm not going to talk about. I should say that it's very important, though, to have the population breakdowns, because as I said before, most populations share all these variations, but there frequencies can be very different. So the right average drug in Asia might be different from the right average drug in Africa. But of course, we don't do average drugs. But if you're making decisions about formularies and what drugs our country should buy, then the frequencies are going to matter. The most exciting thing about PharmGKB is when we have a genotype for which we also have one of these little green phis. And that's third to last column. That green phi means that we've measured phenotypes on people for whom we've also measured that genotype for that row. And that's what it's all about. If this whole effort is about relating genotypes to phenotypes, then you need to have a big data set of genotypes that you measured, and genotypes that you've measured on the same people. When you see those green phis, it means if you press that, you basically get a huge Excel or SAS spreadsheet, which is going to have-- each row is going to be a human, which has a bunch of genotypes and then a bunch of phenotypes, and then you can go to town trying to figure out what's the functional relationship between those. I should add, environment is also important. So if you want to put it very succinctly, phenotype is a function of genotype plus environment. And what that F is, is really what the name of the game is for the next 10, 20, 50 years. The functional form of that F, and then which genotypes matter, which environmental variables matter. So this is a phenotype file, which is not particularly-- it's a bunch of gobbledy *** about pharmacology. But this is the exciting table I was telling you about, where each row is a human, de-identified of course. To get to this data, I should say, you have to register and log in, which I hate. Because we want to disseminate as much information to the world as possible. But we have to protect the privacy of these patients. Even though they're de-identified, we want to make sure we know who sees this data. So we have a lot of information on the site that you can look at anonymously. And that the Google-- in fact, the Googlebots are the number one hits on many of these pages. But then there's a certain level of data where we have to ask people to login. And of course, we lose a lot of people at that point. We lose most of the drug companies, for example. Because they don't want anybody to know what they're looking at. So if a drug company does log in, they download the entire database, so that they can look at what they're really interested in, but nobody can tell what it was. So this is just an example of a bunch of phenotypes. Each column is a phenotype that's been measured, and each row is a person. And behind that hot link is the full genotype collection for that person. That's a person circled. And that would be their genotypes, which I don't need to show you. Of course, the other side of pharma-- yes, question? AUDIENCE: [INAUDIBLE]? RUSS B. ALTMAN: We have some gene expression data, and we have a road map to kind of integrate that more tightly. In the pharmacogenomic field, there's been remarkably little data created in gene expression. And where a lot of times we're driven by the data that's submitted to us. But we are seeing an increase in that. And so we're building up for that. So right now we have some. I would say modest, but-- and the woman four ahead of you and to the left, my left, your right, is in charge of the gene expression effort. Tina. So we have drug pages as well, because it's genes and drugs. So kind of a very parallel universe. We have the phenotype data sets, pathways, very similar situation. Links to outside, like the PubChem, which you may have heard of, which is a national effort to give drug information and chemical information. Very valuable, because it saves us a lot of time of having to download a lot of drug structures. And then we have the annotations. This is the converse list of genes that are known to be involved in that drug. Before, I showed you drugs that were known to be involved with a gene. This is a pathway. It's highly curated, made by scientists who have conference calls and argue about it until our curators say enough is enough, let's stop, and let's put it up on the web. It's a great user interface for us, because instead of having to type things in, you can look at pictures and click on genes or drugs, and go right into the database. So we have a fairly big effort building pathways, curating them, making sure they're high quality, and providing links to the supporting data. We have templated queries, and kind of all the things you would expect so that users can fill in a blank, and we can give them formatted output meant especially for them. Since we now know what question they're asking. As opposed to the Google-type text box, where we basically give them a list of everything that was hit that is scored, but we don't really know what they're thinking. Whereas here, we kind of know what they're thinking, because they've chosen one of the templated queries. We do all kinds of search features, which nobody ever uses, except like me and the project director. Boosting and spelling, and people just type in words. And I think you guys know that better than me. We have almost 2,000 registered users. And I'll show you the data for total users, but this is a small fraction, and that's why I hate it. But these are people who actually want to download the de-identified individual data. And that's why we have to keep an audit trail. We have about 1,300-- yes? AUDIENCE: [INAUDIBLE]? RUSS B. ALTMAN: I'll show you, but we opened in about-- well, we started in 2000, and had no hits for about four years because we had no data. And then I'll show you kind of the hit rate. We're getting about 100 per month, and that's kind of growing. So we're happy about that. We have about 1,300 hundred manually curated articles by our curators. We have primary data on about 613 out of 25,000 human genes. So this is very curated, I would say, especially if you're thinking from a Google-type perspective. 407 drugs that we have primary data on. A total of 74 phenotype data sets, but that's growing pretty significantly. 16 pathways, that's growing. And we have about 21,000 people in the database for whom we have genotype data. And about 9,000 of those also have phenotype. These are our unique IP visits per month. You know, this is the most intimidating slide to give to this crowd, I have to say. So we're getting about 30,000 to 40,000 unique IP addresses per month. And you guys get what, 500 million a day, I don't know. So, you know, give me a break. But you can see that when we started getting data, in late '03, we had a big rise in hits. And then, there was a drop. But then we released a new set of features, and we're hopeful that we're now back on a growth curve, or at least up around 30,000, 35,000. If you look at other biological databases, the most successful is GenBank, which holds the raw DNA sequence data. They are used by all biologists everywhere, every day, and they get 400,000 unique visitors a day. And so, I think that's an upper bound on what we could ever do, because we're a subset of the biologists who care about drugs and stuff. So you can do your own calculation. But an upper bound is at least about 400,000 a day. And I would say, probably 40,000 a day, which means we have at least an order of magnitude of growth that I could expect to happen, and not be surprised. I would be disappointed if we stayed constant here. And I would be not surprised if we went up by an order of magnitude in the next four or five years, two years, whatever. This is our user profiles. As you expect, since it's a public effort, it's mostly educational. The other is all educational from non-US sites, so it's really-- 50 plus 20-- it's really 70% educational. Maybe 65%. And then 23% dot com. As you know, in fact I learned on my last-- my last visit to Google was in that other place down on 101. It was a cozy little shack. And this is much bigger. So congratulations. But one of the things that we can do just like you do is, we can look at what people actually type into the search boxes to figure out what they care about. And so it's marketing research for us. And for one three month period, or whatever-- I don't even remember what the period was-- these are exact text matches. And of course, you could get much better statistics if you do spelling errors and stuff. But this gives us a sense of what the community of users wants. And that's great for us, because then we can say, how are we doing on searches for these drugs, these diseases, these genes? We are entering a new phase now where people can now measure all the SNPs in an entire genome. Whereas, for the last five years, it was much more focused. So we're having to build kind of chromosome level browsing capabilities. Question? AUDIENCE: What are the people you do have the genomes for now, the 21,000, how many SNPs have you [INAUDIBLE]? RUSS B. ALTMAN: So for those 21,000 people, we probably have an average of only 20 to 50 SNPs each. But in the next three months, we're going to be starting to get 300,000, 400,000 SNPs per individual. So there's going to literally be this qualitative and quantitative jump, because of the technologies that Affymetrix, Illumina, Prologen, which is right down the street, there's this quantum leap. And we hope to be the first, or if not one of the first, I'm hoping the first database, that can accept that data and vend it out to the public. And our team is working very hard writing code to parse those binaries, display them, in fact, that's exactly what I'm showing here. You'll be able to look at a chromosome. We will label all the regions of the chromosome for which we have data. Today it's only three, because we have low throughput data. But then we'll show segments of the genome with little tick marks showing all the SNPs, where they are, what the frequencies are. And so we're about to enter a very exciting time. I have a little blog. This is in beta. But I made my own little blog, which is news items. And the amazing thing about Google is, you guys found it and made it my number one hit on my name in like two days, before I even announced it. So way to go. So in return, we used Google groups as our key way for technical support with our various users and stuff. And so there's a beta-- Google groups, I know, is in beta. And we're also using it for our own beta technical discussion group. Just set that up last week. Now, I just want to finish up talking about us some opportunities in a vague way. I don't have a plan, but I wanted this group to hear about this. So first of all, one of the things that drives pharmacogenomics is two things, getting drugs that will work better, and not using drugs that will cause adverse events. And there's been a lot of press about adverse events recently. There was a study out of a blue ribbon panel that said 60,000 preventable diseases in the United States a year because of drug errors. If you look at the most common drug-- ADRs is adverse drug reaction-- so it's like an adverse event due to taking a drug. The three most common are heart arrhythmias, and QT prolongation is a technical-- it's basically arrhythmias from drugs that mess with your conduction system of your heart. Liver responses that are fulminant and bad. And severe dermatological rashes. Those are the three most common. I'll tell you how I know that in a second. And there are many other minor ones. Some of you may have had some of these when you took your favorite drug, rash, stomach ache, dry mouth, drowsiness, headache. The United States Food and Drug Administration is charged with tracking these, because they are charged with then going back to the drug companies and saying, hey, we've got a problem. We're finding a lot of reports of your drugs causing trouble. And they have the AERS, the adverse event reporting system, where they get 400-- in 2004, they had 420,000 adverse events reported, which they estimated a, are totally biased by the people who actually like making these reports. And there's lots of people who don't do it. And this is at no more than 4% or 5% of total adverse reactions, because it's kind of a pain. A health care worker, a pharmacist, a doctor, or nurse, may say, oh, you have a bad effect. I think it's from the drug. Let me fill out this form and fax it to the FDA. They don't even have an online system yet, OK? Based on this, FDA scientists look at these 420,000 adverse event reports. I have no idea how they do this. They have a way to do it and I don't know what it is. But they generate about 600 studies a year saying, we looked at this drug and we think we better think about modifying the label of this drug, because we're seeing adverse events here that are not listed, and that the world needs to know about. And then these 600 reports are done by 40 scientists working at the FDA doing these studies. And you can even find it by using Google. Now, the great thing is, this is all de-identified, and so they make it absolutely public. So you can go to their website and you can download all 420,000 reports from 2004, and for a number of other years back. And what you get is patient demography, basic like gender, age, I don't know if they have ethnic background, some of the drugs they were taking, what happened that was bad, why they were taking the drug in the first place, whatever happened to the patient, the outcome, some dates, and a few other things. So pretty minimal report, but very, very valuable. I'm mentioning this because I'm just going to leave it out there that I think there are ways for FDA, maybe PharmGKB, and Google to partner to do a much better job at tracking drug events, adverse drug events, in the population. And potentially really accelerate the time to when we see the event in a statistical fashion, come up with a research plan for addressing it, and then hopefully changing the world in terms of how that drug is used and the events that it causes. So I'm just throwing that out there because I was thinking, boy, they're not doing a great job at capturing these. And think about how many-- you guys probably know-- how many people type in drug names because they've just vomited, they just took their first dose of a drug, and they say, huh, let me type it into Google and see if this is a known effect. And there's a moment there when, in addition to giving them information, we could potentially do stuff that would help the population. So there's the idea. We have done some-- this is just to finish up-- we are interested in data mining ourselves. We're an informatics lab. And one of the things we recently published was a scan of the medical literature for pharmacogenetics articles. So which articles should I tell my curators, or ask-- I don't tell my curators anything-- should I ask my curators to annotate? Right now, they have a very large pallet of 15 million articles to kind of consider. So we wrote a machine learning algorithm that was actually quite accurate at picking out articles based on word uses. So it was a statistical, natural language processing. Didn't do any language models or anything like that. And we picked out about 5,000 pharmacogenomics articles, that when we did a subset of them manually, kind of checked, we had about a 92% agreement with our curators. So pretty good, definitely as a screening method for finding articles. And we're just updating that now. So we'll have an updated list. And that allowed us to kind of do things like create a web resource, where you type in a drug, and we tell you all the genes that have ever been mentioned in the context of that drug, in an article that we think is about pharmacogenomics. And I'll let you read that article if you find that useful. So I want to some questions and stuff. So I want to stop, thank the team. Teri Klein is here today, and she's the director. We have some of the curators here. Some of the technical staff. It's nicer to look at their pictures, so I'll stop and we can take some questions. Hope I gave you a flavor for what we're trying to do. Thanks. [APPLAUSE] RUSS B. ALTMAN: Yes? AUDIENCE: How clean are your data [INAUDIBLE]? RUSS B. ALTMAN: Question is how clean is our data. The genotype data, which is the variation position of the DNA, is extremely clean, because we have written a lot of quality control, both automatic code for like-- we take in an XML, and we do a lot of syntactic, as well semantic validation. And our curators look at it. AUDIENCE: [INAUDIBLE]. RUSS B. ALTMAN: Yeah. AUDIENCE: The person who generated that data [INAUDIBLE]? RUSS B. ALTMAN: The people who generate the data are working pretty hard to have high quality data. Because, since we're relatively low throughput, or have been, there's been a premium on becoming the site known for good data. And so, they have social mechanisms and social pressures on them to do a good job, because they use this data then to publish papers that are peer reviewed. And the peers take a look at it very closely. And so, the best source of data for quality control is the fact that sometimes people measure the same thing in two different labs, and then the concordance is quite good. And that gives us confidence that these people are collecting quality data. AUDIENCE: Is it one SNP, is it 10,000 SNPs, a million SNPs, how much is it? Do people measure the same [INAUDIBLE]? RUSS B. ALTMAN: I would say that 5% of our data has been collected by two different groups. So it's not the majority of it, but it's enough to get a statistically reasonable sample for doing concordance analysis. Now, phenotypes are much harder, because everybody measures them differently. And the only quality control there, again, is the peer review process. Getting the paper published, and having people say, I don't like the way you measured blood pressure, I reject this paper. So that's the primary pressure there, again. There's nothing we can do technically. We're relying on these other sources for quality control. AUDIENCE: So, this is kind of tongue in cheek, but how many-- given all the data that already exists, do you have any intuition for how many, like, PhD thesis are out there that would require no more work in the wet labs? It's just a matter of integrating data from all these different sources to make a new discovery that we didn't know before? RUSS B. ALTMAN: Well, there are probably on the order of 1,000 PhD candidates in biomedical informatics around the country right now, most of whom don't generate primary data themselves, but have collaborations or use publicly available. So I think there's a sense that there's a lot of data and a lot of discoveries to be made. On the other hand, clever biologists, every day, generate new experiments that are smaller and faster. So it's even better than the scenario that you're painting, because in addition to a lot of undiscovered stuff for mining, I think that we're really on the beginning exponential of clever biologists generating large, large data sets. There's now an ethic in biology that, if you can do it, why not do it fast and small, and in high throughput. And that puts people who do informatics in a very good mood. So I'm bullish on informatics. AUDIENCE: You have people's genome, even though it's depersonalized in some sense. Is any concern that that's like a fingerprint that you could find the person, because it says whether they have seizures and everything else? RUSS B. ALTMAN: Yes. So thank you very much. This is one of those beautiful moments when your next slide-- So, you only need 60 to 100 SNPs. I told you, we have about a million SNPs each. You only need about 60 or 100 of them to be a unique fingerprint. So this DNA, even though I've removed your name and address and social security number, if I get a piece of your DNA, I can very cheaply figure out if you're in the PharmGKB or not. So there's 20,000 people who, when they signed a disclosure, basically took that risk. And we're very grateful for that. But there is a worry about genetic databases in general. And it's one of the reasons we have that password step for getting to that identifiable-- it's not identified, and it might not even be identifiable-- but it's risky, personalized data. There's definitely some risk there. And I'll just tell you a story. And I've written about this. Basically, there's a tradeoff between what you need to do research and what you need to guarantee privacy and their attention. But you may have heard about a month ago, two months ago, a teenager in the UK who-- so here's the story. A father donates *** anonymously 15 years ago. Teen, 15 years later, is interest in his heritage. Teen in UK uses a commercial kit to check variations in his Y chromosome. OK, Y chromosome, you get that from your father. Mom doesn't have any say in the Y chromosome. There's a company that has a website that if you say you'll make your Y chromosome information available, they'll let you see other people who've also made their Y chromosome available. So the kid is smart, and he says, you know what, there's two things I inherited from my father. His Y chromosome, and I would have inherited his last name if I knew what it was. So he goes into this database and he finds two men who have a very similar Y chromosome. And they both have the same last name. He doesn't know these guys are, but he says, I bet my father had that last name, or a version of that last name. He knew his father's age, and he knew his father's-- the part of the country where he came from. So he combined this with some voter records, or some DMV equivalent records, calls his father up. He says, hi, Dad, I'm your *** donor child. 15 years old, this happens in the news on November 3. So I used to talk about this and have to make up these hypotheticals. And this kid saved me a lot of time. So there is no doubt that you can link multiple databases, if you're clever. And if you have any access to genetic information. And then you can do some re-identification. Re-identification can be going down to a single person, but it can also be narrowing it down to a relatively small group of people, which could be just as damaging. So this is a concern. And we've wound up writing a lot about this, because if we ignore this problem, it could bring down the entire genomic research. All you need is one bad thing to happen, worse than this, and then Congress has says, no more genetic databases, and then I'm out of business. And I'm applying to Google for a job. So thank you for the question. AUDIENCE: I'm sorry, so this is a risk, but this is an acceptable risk? Is that sort of your answer? RUSS B. ALTMAN: Yes. So we do our best to de-identify. When people register, we make sure that they have bona fide research questions. They can be industrial, industry is no problem, academic. But like hotmail accounts. Uh-uh. Like, why do you want to see this data? Then we, of course, have the weblog. So we know who has seen, at least at the IP level and, oh, since they login, we actually know who they are, who saw data. And so, if we get into trouble, we could roll back some of that data and figure out, you know, if the judge asks us, we could figure out. And then we could find ourselves in the exact same position that you guys have found yourself in recently. So we have written about it. We've publicly declared it. We have a trillion usage policies that, when you register, you have to agree to. And there's a sense that we've done due diligence. But I would never claim that we're bulletproof. Yes? AUDIENCE: With the frequency of SNPs, is it higher in genes than junk DNA, or is it similar across both? RUSS B. ALTMAN: So you actually see SNPs throughout both coding DNA and the so-called junk DNA. In fact, the junk DNA has a little bit more, because it's less sensitive in some sense. If it's not coding for a protein, it can tolerate these variations. You know, a random cosmic ray hit's your ovum or your ***, a DNA base flips, and nobody cares, because it's not a critical part of the genome. Whereas, if you hit a critical part, you might be very sensitive. So the rate of SNPs tends to go down in regions that are very important, and that have been conserved over many billions of years, or millions of years. AUDIENCE: Do you ever find surprising correlations with SNPs that you though were in junk DNA? RUSS B. ALTMAN: Yes. There are SNPs that correlate very beautiful with phenotypes, and those SNPs are hanging out in the middle of nowhere. We have no idea why that SNP is correlating with the phenotype we care about. It just shows you how much biology we still need to figure out. It probably isn't junk, is the answer. Question in the back? AUDIENCE: [INAUDIBLE]? RUSS B. ALTMAN: One more, yeah. AUDIENCE: So you maybe alluded to this with your adverse event reporting, but can you think of ways, I mean, to expand the client base from even 400,000 per week-- like, what's the ultimate number of users who would-- RUSS B. ALTMAN: So the ultimate number of users is all the people on the face of the earth are genotyped. And that information is used responsibly by their various health care systems to do optimal medical decision. Right? So six billion people on earth, six billion data files with all of their genotypes, and enough phenotype information to make predictions for all of them and say, use these drugs, use those drugs. And then, of course, you need a delivery mechanism. I kind of alluded to that. It's a challenge in America. It would certainly be a challenge in Zimbabwe, or in South Africa. And so that's the big challenge. AUDIENCE: [UNINTELLIGIBLE] or just the health care providers? RUSS B. ALTMAN: There's an issue of who carries around the genotype, is it centralized? Certainly the health care providers need access to the data. But does it get stuck on a chip on your tooth? Do you carry it around on a card? Is it in a central government database? These are, as you can imagine, are very excit-- is it on Google with a password? As long as you don't give your Google password to anybody, your genotype is safe. So these are very interesting kind of sociological discussions. Society hasn't made a decision. The societies that have centralized medical facilities have an obvious default answer, which is, let's put it in with the rest of the stuff. We don't have that in the United States. And so, I've written a little bit about a distributed, patient controlled system, where the genotype is measured, but then you are in control of it. It's, you know, there's public private key infrastructures in place. And for any individual health care provider you can say, go ahead and use my genotype. And for a researcher like me, you can say, you know what, I think PharmGKB is a good thing. You may have my genotype and phenotype data for research. Please use it responsibly. And I trust you will, because I've read about you and I trust you. That kind of infrastructure, it really is technically not that far-- you know, that's within reach. And it's really a lot of sociological agreement that we need to get that to happen. Or somebody to do it and just do it well and say, look, this is a good thing, any questions. Yes. AUDIENCE: When you get cancer, the cancer cell genotype is different from yours. Do you track that? RUSS B. ALTMAN: So the question-- I'm sorry, I haven't been repeating the question. The question was, the genotypes of cancer cells can be different from the genotype of the host. That's absolutely the case. In fact, one of the causes of cancer is a rearrangement or a messing up of the DNA, which creates multiple copies of genes, missing copies of genes, or mutations or polymorphisms, SNPs, in those genes. And so, it is of great interest to sequence those. And in fact, yes, the National Cancer Institute has a project called the cancer genome project, where they're sequencing, not humans, but cancers from humans. We also are interested in that data. You can use all the same measurement technology. You just have to remember that it's not from a normal blood cell. It's from an abnormal cancer cell. A big issue there is the number of copies of a gene. Turns out that's one of the big things that happens in cancer, is you get extra copies. And so the fine balance that's established in a normal cell is messed up, because you have 10 copies of one gene. And so it's over-eager, and it just causes the cell problems. So let me stop here since it's noon, and thank you very much.