Francis Collins Speaking at The Genbank 25th Anniversary

Well, thanks, Steve, and thanks for the chance to be part of this historic symposium celebrating this quarter century of GenBank's contributions to biomedical research, which are indeed substantial and it's hard to imagine how we would be anywhere near where we are right now without the dedication of the folks that have been engaged in this for all these 25 years. The challenges that lie ahead are at least as much to be dealing with a somewhat anxious way as what we've traveled through before because clearly the acceleration of the pace of discovery is moving at an exponential pace and that's going to, I think, put a great deal of stresses upon David and Jim and the rest of the very capable GenBank, NCBI staff. But I am sure, based on past performance, that they're up to it. Since this is a historic occasion, I thought myself a little bit back on the ways in which the use of sequence data has played out in my own personal experience, and I recalled the first real major involvement I got involved in in terms of trying to determine sequence at reasonable scale - at least it seemed really reasonable at the time - and that was in fact the publication of a paper in 1984 that had this as one page of a very long appendix. This was an effort that I conducted while working as a post-doc in Sherm Weissman's lab in Yale to pull together all of the DNA sequence that had been derived on the human beta globin cluster as of that point. So this was just about 25 years ago, and in fact, what you're looking at here is an output, which I had to actually travel across campus to find this fancy piece of equipment called a laser printer, in order to make this very nice output. The output itself was painstakingly put together by me, by basically taking sequence papers that had been produced by various groups. The Weissman lab, partly myself, and partly others then tried to fill in the gaps because people hadn't actually tried to do this in a contiguous way. And I pretty much took all of the parts that were not already in electronic form, which is most off it, and then entered it by hand and had my, at that time 14-year-old daughter do the proofreading. And she got paid two dollars an hours, which was pretty good, although she got pretty tired of it before this was over. And then I annotated it as you can see there with some very clever indications about what's present. This happens to be a stretch of DNA just upstream of the beta globin locus. You'll notice from the numbers over there on the far side that we're already up in this huge stretch of DNA, more than 40,000 base pairs. At this point this was, I think, the largest contiguous stretch of human DNA that had been assembled. And, yes, I did notice there's this run of Gs and Ts. I had no idea that would some day be called a micro-satellite and would become a major force for genetic mapping, because that hadn't really been realized in 1984. There's also actually a repeat down here, a pentanucleotide repeat, which I knew from my own sequencing work was actually variable between individuals, but I wasn't clever enough to realize that might be a really useful kind of genetic marker. So that was one page of it, and then here comes the other page, next page along. This is the beta globin gene. I'm sure you recognized it because it's labeled there in this elegant annotation. You can see where the transcript starts. You can see where the initiation codon is and where the entrons are. And, oh, yeah, there's an EcoR1 site there in case you care. So this was my exposure to what I was quite sure I didn't ever want to do again. Which was to try to put together this kind of sequence trace and then display it in some fashion where people might be able to use it, and I'm sure nobody used this unless they were really desperate. We had to do better. Well, five years later, I found myself participating in the construction of this particular diagram for a published paper in Science. And this, again, looking like an awful lot of sequence data - in this case annotated with the amino acids and with some domains that are marked off. You might wonder what is this. Well, that little triangle right there is the common mutation that causes cystic fibrosis, so this was in fact the description of the cDNA sequence of the CFTR gene. But again not much benefited by a lot of the things that we now take for granted, because it was early days. Although, I will say in this case, this was much easier than it had been five years earlier for the hemoglobin conditions. So clearly we've come a long way, and a lot of that is, I think, a credit to GenBank and to the things we're celebrating here at this meeting. I looked back at the timeline that was put together in 2003 to celebrate the completion of the goals of the Human Genome Project and, of course, there was much debate of what belonged on the timeline, what were the really major events of the last 150 years, and you can see the ones that were chosen here for the first part of it, and then you go the bottom part of it. And I was happy to notice - I don't quite remember how we decided this. There's GenBank, ranking right up there with Mendel in terms of making it to the timeline. And also the formation of the International Nucleotide Sequence Database Consortium in 1986, most of which were nicely talked about, mostly in presentations yesterday by those who were deeply involved in these remarkable moments where really important plans were made by thoughtful people to try to catalyze where were heading. Of course, where we were headed then morphed into the Human Genome Project, which had its own series of milestones, all of which, I think, really could not have happened without being undergirded by the databases and the wonderful work that NCBI in particular was doing. I want to just highlight - and it was mentioned a minute ago by Steve in the very nice introduction - what a critical thing happened in 1996 which was the agreement amongst the International DNA Sequencing Consortium for the human genome, that all the data was going to be made available, and this is actually a photograph of the white board, which was written upon at that point in 1996. I think this is John Sulston's handwriting. It's either John's or Bob Waterston, and this is what the group agreed to, gathered there at that meeting in Bermuda, which sounds elegant, but I don't remember anything about it except for it had a very dingy conference room, and we gathered there for two and a half days to try to figure out how we were going to actually tackle this problem of sequencing all of the human DNA at the point where it just began to look like it was possible to ramp that up. And this is what we all agreed to, the conditions about release and immediate submission, aimed to have all sequence freely available and in the public domain for both research and development in order to maximize its benefits to society. And that was radical. And again you can see here automatic release of sequence assemblies greater than one kilobase, preferably daily, and that what was agreed to. That really set in motion a series of increasingly broadly accepted ideas about public access to data that continue to spinout today and were nicely represented by Betsy Nabel's presentation about what we are doing about genome-wide association studies and making those available to investigators long before publications can be generated in a way that I think really has produced a whole new approach to try and maximize benefit to society by making access immediate for investigators who would like to get their hands on the data and start to work with it. The second part of the Genome Project then continued apace. Again, I might say, we couldn't have made those Bermuda rules if there wasn't a database ready to receive all that data, and so the existence of GenBank was critical for that to happen. The draft genome in 2000, the published draft in 2001, and then in 2003, the essential completion of the human genome sequence with now much more to go. So that's about the past, and I should say those who contributed to that sequencing of the human genome are a remarkable group, more than 2,000 of them, here just picturing primarily the leaders of those 20 centers that gathered together with those immediate access data rules under their strong support to make all of this happen in record time, ultimately producing the genome two and a half years early and happily for taxpayers at a price tag about $400,000,000 less than what had been anticipated. I just want to recognize one person - since we're here talking about NCBI - who is a big part of this and who I don't think has been mentioned in this symposium, and that's Greg Schuler, who was a very important part of the effort to deal with the assembly of the sequence and the display of that sequence for the world to look at. So I've had the great fortune of being able to be part of these efforts, both the sequencing of the human and many other large-scale projects, and I very much know what Wilson is talking about here in terms of the way in which these things can either succeed or fail, and that is by borrowing all the brains of a lot of smart people, you can make amazing things happen, and I've been fortunate to be able to do that. So now let me turn to where we are at this point, so to give you some notes from the frontlines as I see it of the genomic revolution and I'll go through about six points. And this is going to be a little arbitrary because there's so much going on right now that I had to do a bit of picking and choosing. Of course, one of them is comparative genomics, the ability to not only look at our own genome but that of many other species now amounting to more than 30 vertebrates that have had their genome sequence determined either in draft or in finished form. The mouse, of course, the chimpanzee, the dog, the honeybee, the sea urchin, obviously those aren't vertebrates, but this one is, the macack. I just picked the ones that had covers, and I guess I left out the rat because it has such a horrible cover on the cover of Nature, but ultimately it should be in this list as well. I think what we've learned from this is an enormous amount about evolution, basically being able to look at evolution's lab notebook which is what genomics allows you to do, and identify which parts of the genome continue to be under selection. And we have a lot more to learn from that, and one of the things we hope to do now is to focus specifically on primates and try to learn more about recent evolution in terms of how it has affected our own species. One of the things that has been talked about a little bit at this meeting - and maybe David will talk about it this afternoon - is the way in which there's some serious interest in trying to be sure that the data is curated in GenBank so that when you go there, you can be pretty sure what you're looking at, it has high quality attached to it. Obviously the Ref-Seek approach which was mentioned in a really wonderful talk yesterday by Jim is a big part of that. But there is still this question about what to do with the large production genomic projects including the human, the mouse, and others, and the idea of having a community go in and annotate this in a true Wikipedia fashion, doesn't seem to be a good idea, and I think David makes that case very strongly in this recent News and Views coming from Science Magazine. Just as a bit, a point of information, this is something certainly we at the Genome Institute are concerned about, having put all this effort into generating these reference sequences of various species, we would not want them to fail to keep up with corrections and additions and things that are getting even better, because of course none of these were absolutely perfect. The human genome sequence, when we called it essentially complete, still had about 300 gaps in it that couldn't be closed by any known technology, but are now gradually getting attacked. So there is now a Genome Reference Consortium which has its specific mission to correct a small number of regions in the human genome reference that are currently misrepresented, and to be as a central point of contact for community members that think they've found something wrong or they've fixed something that previously was ambiguous and that involves particular leadership from Wash U and the Sanger, and from the bio-informatics groups at EBI and NCBI, represented in particular by Deanna Church. So there is a serious intention here to continue to improve those sequences as more information becomes available, but not in a way that generates chaos, hopefully in a way where any change that's made in the reference sequence is well backed-up by experimental data. Meanwhile, of course, DNA sequencing which has been perhaps for most of the last 25 years in a fairly stable state in terms of depending upon Sanger dideoxy sequencing and gel based separation of the products is as all of you know, undergoing revolutionary technical advances that allow us to proceed in a massively parallel fashion and generate amounts of data that are truly breathtaking, some of those coming from this instrument already mentioned, the 4-5-4 which carries out this kind of parallel sequencing on beads, some other data coming from the Illumina Solexa instrument where the parallelism is done in clusters on a solid surface - here shown in a microscopic picture, each one of these representing a clonal population of DNA molecules that then can be sequenced in place. And, of course, we're not done yet. If you saw this paper in PNAS, suggesting really dramatic acceleration might possible from this company called Pacific Biosciences, using zero-mode wave guides and single molecule sequencing, truly single molecule sequencing perhaps with very long reads, if this in fact reduces to practice in the space of the next couple of years, this could really open up the ability to collect sequence data at a dramatic level, as could this one, which is a paper just published last week coming in this instance from Helicos, showing that they have reduced to practice the ability to do single molecule DNA sequencing, and they applied this in a fashion that is pretty impressive, to a particular viral genome, namely M13, as a proof of principle - neither of these probably going to make a big difference in sequence output in the next 12 months, but wait until we see what happens in the next year or two. The disruptive innovations that are characterizing this field are coming along hard and fast, and if any of these succeed at the rate that might be projected, we may get to that thousand-dollar-genome a lot sooner than the seven or eight years that many people have been predicting, and of course the challenges for GenBank and NCBI if this comes to pass, are only going to accelerate in a log scale - and it's a good thing we have capable leadership prepared for that. I'm sure we do. Let me go on to a particular application of high through-put sequencing that I think is turning out to be very exciting, and this is a joint effort of the Genome Institute and the Cancer Institute. Of course, cancer is a disease of the genome. We all understand that now, although we didn't really know that until about 30 years ago. It's now quite clear, and this particular quote is an interesting one, because it comes from Renato Dulbecco in 1986 in a piece written for Science, which was alluded to in yesterday's presentations, so one of the very first calls, the very first published call to do the genome was on the basis of understanding cancer, and Renato, I think, made a very strong and effective case for that. And that has resulted in the very first year or so in the formation of the pilot effort for the Cancer Genome Atlas, which is trying to bring together all of these components to apply high through-put sequencing and other means of genome characterization to cancer, and particularly we have chosen three tumors to start with - glioblastoma multiforme, ovarian cancer, and squamous cell carcinoma of the lung for each of those attempting to collect hundreds of tumors for which there is also DNA available from the same individual, so you can tell the difference between somatic and heritable changes, there's technology development also associated with this enterprise to try to push the agenda forward for doing this in a comprehensive and cost-effective way. I'll show you some very new data from the Glioblastoma Project, the first one that has come out of this - and this is only about a week old at the meeting of the TCGA analysis group that was held here just very recently - and basically what one needs to do if you're going to analyze a tumor of this sort, is you need lots samples, and this is actually taken from the glioblastomas. Let me explain what this picture is telling you. Here are the human chromosomes 1 through 22, each column here and you can see there are a lot of them - more than 100 - represents a different tumor, and the colors represent whether you have a gain or a loss - with a loss being in blue and a gain being in red - and what you can see here is some patterns are appearing. This is using a statistical method called GISTIC, to be able to try to assess what's going on, and you can see both there are whole chromosome changes or chromosome arm changes, but in a few places like that thin blue line there there's something more specific going on in a narrower way. One can then move this to the more careful analysis using GISTIC, which is an effort that Gaddy Getz at the Broad has but together along with others. And what you come up with in terms of recurrent deletions and amplifications is summarized here with a significance value attached to it in order to be sure that you're not just looking at the noise. Many of the things that are deleted and amplified in these glioblastomas are in fact recognizable as genes that might have been previously implicated for this particularly devastating cancer. But some of them certainly are not strong peeks that have no particular assigned gene, as you can see here are in fact of great interest. Now on top of doing the copy number analysis, which one can do now quite readily at high through-put. This program has also begun to involve high through-put sequencing and some 600 genes have now been sequenced - the exons of them that is - in about 90 tumors. That's just a summary of the deletions and amplifications. And here's what's been found - and again this is very new data, although the mutations that have been discovered have been validated by an independent platform, so we believe that they are not false sequencing positives. Here is the number of mutations found in these 86 glioblastomas after sequencing about 600 genes. It's interesting that the positive controls, which are in purple, those known to be mutated glioblastoma all turn up at the sequence level. But then there are all these other things in blue, and the two that are most frequent so far in this group include an old friend of mine - neurofibromytosis Type I, a gene that my lab identified in 1990, a gene which people have said plays absolutely no role in glioblastoma multiforme, apparently having missed the mutations because it's a large complicated gene with a lot of non-processed pseudogenes that make it very hard to work with. ERB2, the gene which when amplified is associated with response to herceptin, also turns up in glioblastoma but not as an amplified gene, but actually as one with point mutations in both the extra-cellular and the kinase domain, and this has very interesting connotations for possible therapeutics, as does this. So this is just an early peek at what is likely to come out of these analyses and as TCGA scales up there's going to be lots more of this data, again all being deposited immediately into a database where authorized investigators can go and see the information long before publication, following essentially the same model that you heard about from Betsy for GWA studies. So we're pretty excited about this, and Renato Dulbecco writing a little essay for a piece in Scientific American, reflects upon just what a good thing it is that his dream is now finally coming true. Furthermore, not only sequence information but other approaches to function are scalable, at least some of them are, and a lot of work is being done there mostly to empower investigators to be able to go faster than they could if they were dependent upon having to do all these things in their own laboratories. One of the things that we've been particularly doing is to try to define the parts list of the human genome, and to attach those parts to particular functions, and the End Code Project, which published last summer the results of its pilot, focused on a carefully chosen 30 megabases of the genome, brought together more than 30 groups that had different approaches to try to understand genome function and produced a very interesting intersection of those datasets, because it includes transcript identification and analysis, a lot of comparative genomics, all kinds of information about histone modifications that are associated with active or not so active chromatin, DNA's hypersensitivity, and even origins of replication and transcriptional factor binding sites accompanying papers in Genome Research filled out a whole special issue at the same time, just a wealth of information that we've never really had altogether focused on the same stretch of DNA. And this is now scaled up to attack the entire genome. As of last fall a number of groups have been funded to do this and are hard at work to try to do what we can to decorate the genome with this functional information that many people would love to get their hands on. We also have very recently initiated a similar End Code Project focused on flies and worms, the MOD End Code Project recognizing that we have a much better chance in that situation because these organisms are so well understood and so readily manipulated to do an even more thorough job of functional analysis in a coordinated way, and there are a number of highly regarded groups that came forward and applied for the chance to participate in this, and this is all getting underway in a very exciting fashion that should be well worth watching. In addition, of course, the mouse continues to be an animal of huge interest amongst researchers who are trying to understand gene function, and we do now have this international Mouse Knock-Out Consortium, which aims to knock out every one of the mouse protein coding genes in the course of the next three or four years, as an international collaboration between the NIH effort, which is called KOMP, the Knock-Out Mouse Project, and the European effort called EuKOM, and the Canadian effort called NorKOM. If you're interested in actually making sure that your favorite mouse gene gets put on the list for Knock-Out, sooner rather than later, you can do so. Simply punch in K-O-M-P in your Google browser. The first thing you'll come to is a radio station Los Angeles. Don't pay any attention to that. Go to the second entry, which is the NIH Knock-Out Mouse Project, and it will direct you how to put in a request to have a particular mouse locus targeted by an approach which is pretty sophisticated. It's based on homologous recombination. It will generate a conditional ready null allele which can then be distributed at very reasonable distribution costs as an ES cell, which you can then turn into a mouse and figure out what the phenotype might be. In addition, there's been a considerable need over the course of several years, to have a set of full length cDNAs available for individuals who are interested in studying either a single gene's function or sometimes a whole suite of genes and we all know - those of us who have had the experience ourselves or watched our students or post-docs trying to do it, that it is really not a very exciting thing to try to get that last five prime exon of a gene that you're interested in that just seems to be resisting you. And so the idea of doing this as a comprehensive resource was generated in fact quite a number of years ago, and has been jointly managed by the Genome Institute, the NCI, and the NCBI. And the NCBI has been critical to this effort, most recently with Lucas Wagner playing a very helpful role in the curation of this database. We're getting pretty close to having what you could call a complete set of both human and mouse genes. As you can see by these numbers, the last roughly 1,000 or 1,500 of each of those are now actually being synthesized because it turns out when you get down to the last dregs, it's more efficient to do that by synthesizing them from scratch as opposed to trying to pull them out of a cDNA library where you can't seem to find them or by rescuing them by RTPCR. Many of these are large cDNAs, greater than six kilobases which are very hard to find in full length in other places, but you can synthesize them quite readily. Well, not quite readily, this is still a bit of a challenge. I must say this experience has taught me that we aren't quite at a point where you can just dial in the sequence you want and get it tomorrow. The DNA synthesis capabilities still run into trouble with certain types of sequences. But at any rate, this resource, which is again completely available from numerous distributors only for the cost of sending out the clone, should be an empowering one for people who are trying to understand function. And for other people who are trying to understand function who are used to using mouse knock-outs our sIRNAs in order to look at the performance of a particular protein in a particular circumstance or looking at a pathway or even looking at a cellular phenotype trying to understand how to modify it, the availability of small molecules as an additional permutation for these kinds of experiments is a really empowering, new appearance on the scene that many people still, I think, have not completely become aware of and which you could certainly benefit by learning more about if you have not taken advantage of this. So this is a resource supplied through the NIH Roadmap where investigators who wish to develop a small molecule, an organic compound that will act as an agonist or an antagonist or some other kind of perturbing force on their favorite phenotype need to develop an assay that will perform well in a high through-put setting, primarily in 1536 well plates. That can then be peer reviewed and if accepted, put into one of the high through-put screening centers. Pictured here is the one up in Gaithersburg, the NIH Chemical Genomics Center. They will screen a library of now 300,000 compounds. Hits are generally found. Those can then be optimized a bit with some medicinal chemistry, made sure that they have appropriate solubility specificity and so on, and then the compound goes back to the investigator and importantly, the compound also goes into PubChem, a database which NCBI has founded, which is the first time that we have had a public database of small molecules, because most of this information has remained out of view, behind the curtain, or in a subscription database that many people did not have the resources to buy into, so PubChem has really opened up once again a whole new territory in terms of biomedical research that has been very empowering, and much credit there to Steve Bryant who has curated that, and who I'm happy to hear is recovering from a very serious accident earlier this year, and we all wish him the best after that very difficult time that he's been through. So these are just some of the resources that I think are driving the forward motion of experimental determinations of function and so, yes, high through-put doesn't have to mean no output. High through-put can actually mean a lot of exciting functional results and we're seeing those happen all around us as a consequence of these available approaches. You've already heard from Betsy just what a dramatic set of things have happened in cardiovascular disease as far as genetic factors in common disease being revealed, and this has in fact been a glorious 18 months or so of such discoveries. And it was a good thing, because we weren't doing very well until fairly recently. This curve - this graph from 2002 shows you how dramatically we did in terms of discovering genes from mendelian traits in the 1990s, empowered particularly by Genome Project tools, and how dismally we had done in terms of identifying genes for common disorders with only seven such human complex trait loci being accepted by these authors. And, boy, has that changed especially in the last year or so, and it's changed because of a couple things. One being the HapMap Project, which enabled the ability to understand how genetic variation is organized across the genome so that one didn't have to test as many individual variants as you thought you would, because they're traveling in neighborhoods, and you can choose a small subset as a proxy for all the rest, and that makes it possible then with SNP chips that have only half a million SNPs on it to have a pretty good chance of representing the rest of the SNPs in the genome. The other thing that's helped hugely is this profound drop in genotyping costs over the course of the last few years. When a genotype cost 50 cents as it did back in 2002, the idea of doing a genome-wide association study was just prohibitive. Now that it costs about a tenth of a penny things have really come around. And the results of that as you can see in this diagram that was put together by Terry Manolia and Lisa Brooks and myself for a review that's coming out next month in JCI is really pretty amazing when you look at the decoration of the human karyotype with new discovered variants associated with common disease. The first HapMap success in 2005 macular degeneration compliment factor H. In 2006 three more of these being discovered. There's that NOS1AP that Betsy mentioned for QT interval prolongation. And in 2007 everything just broke wide open. This is just the first quarter of 2007 - the second quarter - the third quarter, the fourth quarter, and this carries us up to about February 1st, and I need to update this slide because it's already out of date. Every one of those banners represents the discovery of a variant in a locus that plays a role in a common disease or a quantitative trait in humans, and each one of those reveals a remarkable story, because most of them are really quite unexpected in terms of which genes turned out to be involved in conferring risk to disease. Let me say, however, that while this is a glorious kind of moment of discovery, what you're looking at there explains only a tiny fraction of heritability for most of these conditions. So a big question mark is where is the rest of the heritability. Is it in those rare variants like the PCKS9 example that Betsy mentioned that Helen Hobbs has shown us for LDL. Is it in those copy number variants? Large stretches of DNA, which may be deleted or duplicated which are more difficult to test, but which are increasingly coming within our grasp especially with high through-put sequencing using paired-end reads. Is it that there's gene-gene interactions that we have not been able to assess adequately? Are the gene environment interactions such that we're missing out on the goodies there? Or have we overestimated heritability for some of these conditions? We don't know the answers to that. My own bet is that there's a whole spectrum of rare variants undergirding the common ones, and now that we have sequencing ability we should be able to begin to look at those as well. Will they be in the same loci as the common variants? I don't know. We should be able to find that out, but I can tell you as somebody who has worked on diabetes for the last 15 years in my own lab over here in Building 50, we have really moved into totally new territory in the last year by the availability of theses kinds of studies to uncover now some 16 loci that are associated with that disease and which are shedding entirely new light on our understand of what the really fundamental molecular basis of the condition might be. We still have a lot to learn, but we finally, I think, have blown away a little bit of the fog, and all of this recognized by Science as calling human genetic variation as the breakthrough of the year, and as Betsy has already ably demonstrated - and again I would also like to give great credit to Jim Ostell and Steve Sherry, the existence of the dbGaP database is making all of this information rapidly accessible to qualified investigators to begin to do the hard work of understanding how this all works. Because after all what a GWA success tells you if you achieve genome-wide significance, you've got to be really careful about the statistics, and if you've ruled out population stratification or some technical problem, then you can say there is a common variant located somewhere in a segment - and the segment might be small or large depending on whether linkage disequilibrium is weak or strong in that area - that's associated with modestly increased risk of disease. Modestly - I mean the odds ratios are often 1.2 to 1.3 in the particular population that you've studied, and obviously you don't want to stop there. That is a clue, and now you want to go forward. And Betsy showed a different version of a very similar diagram here in terms of what one then wants to do to move down this diagram to try to get both the truth of understanding how this biologically makes sense and ultimately to translation. Notice that one of the things that everybody wants to do almost immediately after they have made such an association is to say, okay, what are all of the DNA sequence variants that might have driven that association, because I want to know that whole set. I know there's something here in this stretch of DNA that differentiates cases and controls, but what is it? Because they're all traveling together, and the database of human variation - dbSNP - is not yet complete. And so, unfortunately, many people are then forced to go and do their own sequencing through such an interval and in a fashion that is often inefficient and sometimes error prone and certainly more expensive than it should be. So in order to try to deal with that issue and supply a catalogue that would speed up this process and eliminate the need for every group to do this all by themselves we have just begun an effort with major participation from the Welcome Trust and from the Beijing Institute to try to deepen that catalogue of human variation by sequencing roughly 1,000 genomes at draft coverage 2 to 4X in order to be able to say that we have essentially discovered variants in the human genome that have a frequency of one percent or greater, and that would be very helpful in terms of accelerating the ability to follow up on these GWA studies. There are no phenotypes associated with the DNAs in this situation, and a major reason for that is we wouldn't have enough power with any particular condition to be able to draw any conclusions about genotype/phenotype association in a set of 1,000, and furthermore, these are samples that have been adequately consented, so that the data can be put completely on the Web without any barriers at all. This is one of those rare examples where the information will be completely in an open database, but the absence of phenotypes adds to the confidence that these individuals are not facing a risk by having agreed to that. This will be happening over the course of the next year and a half. This will involve not just looking for SNPs, but also looking for copy number variation, again by doing sequencing with paired-end reads to be able to identify that in a systematic way, but of course that's just part of the problem. The big issue is going to be functional analysis, and I think here is a place where individuals who are particularly motivated to try to understand a particular locus and who could bring the full power of every kind of functional assay to bear on it are going to make most of the major insights. I do think there is one or two places, though, where we could accelerate that. One is, of course, having the accessibility to tools like Mouse Knock-Outs and SIRNAs and small molecules. Another would be to try to have a resource that better enables us to detect the correlation between gene expression and genotype, which at the present time has been done for lymphoblastoid cells, but not so much for other tissues. Imagine that you have done a GWA study and you're trying to figure out, well, okay, I've got a signal here. I know there's something in this region that is associated with disease risk because here is my interval and here are the SNPs that are giving me a statistically significant result, and that one's a little better than the others. This is a pretty typical result, although this is a cartoon. You then go and look at this interval to say, okay, what locus maybe is driving this, and not too often but fairly often, you find that you are landing in a region that's fairly gene rich, and so you have multiple loci that might be in fact involved. Well, any one of those could be in fact driving your association. Oftentimes, you might have a hunch based on a biological argument that one makes more sense than the other, but you should be worried about those hunches because they often turn out to be wrong. So what would help you here it seems would be able to say something about whether this SNP actually has a cis-acting affect on expression of one of these genes, because if it does, that's going to increase your likelihood you've got the right locus, and if may even tell you whether the affect of the association is over-expression or under-expression of the locus at hand. So here is in fact - and this is basically what was done in that very nice study that Betsy mentioned from Cookson's lab on asthma, where they did this using lymphoblastoid cells and discovered that ORMDL3 was the right locus, but here is the idea. You would need to have an appropriate tissue in order to be able to assess the affect of this SNP on cis-acting gene expression. You would then look at the level of expression for the three genotypes at that SNP, and you'd look to see whether there's any relationship or not. And you would not expect to see a dramatic difference, but here in this instance you can see one of these genes in the cartoon does have a relationship to the genotype at that strongest SNP and the other four don't. Now in this situation, I drew this as though you were testing actually tissue from diseased individuals, but frankly it doesn't matter, because what you're looking for really is is there an affect of that particular genetic variant on expression of this gene in a cis-acting way. And so if one were to look at normal tissue - and this was done in that Cookson paper - you get the same answer. So what does that say? That says we should in fact generate a database as soon as we can of human tissues, pick maybe our 50 favorite organ sites, a sample, perhaps a thousand tissues from different individuals for each of those organ sites, and then do your most careful quantitative gene expression analysis, which these days will be done by sequence tags not by hybridization, and then also do genotyping across the genome for each sample. And in fact that is a proposal which is being floated around as part of the NIH Roadmap competition. Obviously, huge challenges in terms of sample acquisition, but based upon experiences that we're learning about from other groups that have tried to do this on a somewhat smaller scale this sounds like it might be possible through very rapid autopsies that could be done, could harvest many tissues at one time from the same individual, cutting back the amount of genotyping you would have to do. I am sure this is a database that would teach us a lot about cis-acting regulation of gene expression, and it would be particularly valuable for this purpose. It would probably also tell us about transacting affects, because if you have a variation in a transcription factor, you might expect that then to see resulting in numerous target genes, which are up or down a little bit depending upon the abundance or the time of expression of that transcription factor mediated by its cis regulatory signals. That, of course, is going to be a more statistically challenging problem, because you have to look across the whole genome as opposed to in the local environment for a cis-acting affect, which is one reason one probably needs a thousand different human donors for each tissue, if you're going to have the sufficient power. So this kind of thing would help us, I think a lot. Well, let me move on quickly - because I see the yellow light has gone on - to some other applications in the clinical arena, and I think they are expanding rapidly and perhaps a familiar diagram of the way in which all of this is going to play out. It think we're doing pretty well in terms of finding common genetic risk variances for common disease. Obviously a lot of heritability yet to be discovered, but as we get into the position of being able to figure out who's at risk in circumstances where we have preventive medicine strategies, that will be clinically appealing, and that of course is already possible now in some cases of cancer. The pharmacogenomics approach perhaps very nicely outlined by Betsy in her description about what NHLBI is getting ready to do with warfarin to try to demonstrate in a rigorous controlled fashion whether this does in fact improve outcomes, is potentially going to be a very exciting application of genomics as well. And ultimately, the therapeutic developments that we all hope for will maybe be the most major benefit, but will be the longest in coming because of that long pipeline between getting an idea and having the FDA approve your drug. One of the things, though, that I would like to quickly touch on is this common set of misconceptions that I regularly hear about whether these GWAS discoveries are going to have any value in terms of identifying new drug targets that we didn't know about and starting things into the pipeline that might be truly novel approaches to common disease. Because people look at the results and say, well, you know, you found a variant, but it only has an odds ratio of 1.2. How could that possibly matter in terms of providing you information about a new drug target? And similarly to that, people will say, yeah, but if you develop a drug based on that particular discovery of a GWA signal, then it will only work for people who have the risk allele. Both of those suppositions are in fact false, but they do immediately sort of attract attention. And let me just point out a good example of how this can't be right, and this comes from my own most familiar disease of type II diabetes. When we got together, my group and the group in the Welcome Trust and the group that was doing that at the Broad, the DGI and my group is lead by Mike Bankey, the statistical geneticist at Michigan, these were the ten genes that we together published about nine months ago in Science, and there were in fact a few of them known before, but most were not. And interestingly of these ten genes derived from this genome-wide association study approach - these are not candidates. They're what came out of the search, two of them, KCNJ11 and PPARgamma, are well known targets of the mainstays of diabetic therapy. If you want to go beyond insulin KCNJ11 codes for one of the subunits of the sulfonylurea receptor. PPARgamma codes for the target of thiazolidinediones. So here we have a proof of principle that if you hadn't already known that, you would have discovered in this first pass at least two very exciting drug targets, and maybe there are others here as well. Similarly, recent studies that we've been involved in looking at lipid levels coming up with a whole bunch of newly discovered loci for LDL and HDL and triglycerides. Interestingly, one of the ones that turns up for cholesterol is HMG-CoA reductase, of course, the classic target of statin drugs, which hadn't previously appeared to have variants that played much of a roll in lipid levels, but when you have very large studies you have sufficient power to see that. So there's another one that would be awfully valuable if we didn't already know about it and doesn't it seem likely, therefore, that there are others on these lists that will make also very exciting drug targets once we sift through and try to understand their function and figure out how to take that approach. Finally, let me finish by saying that all of these exciting developments in the science, and they really are exciting, mean that we need to pay even more more attention than we ever have to the ethical, legal, and social issues, because otherwise the benefits that we all hope the public will enjoy may end up being truncated by misuse of the information that causes people to be injured or to just stay away from the opportunity. And this is only accentuated by the fact that personalized medicine and genomics has suddenly become a market opportunity for direct-to-consumer companies like the three that you see here. All three of these will offer you for between $1,000 and $2,500 analysis of your genome covering something in the neighborhood of a half a million to a million SNPs, and then will give you feedback in terms of what your predicted risks are for a long list of conditions, as well as information about ancestry. And this is an interesting development and actually one that I think causes many of us to be a bit anxious, because there is great potential for confusion, great potential here also for health behaviors to be modified in irrational ways, and potential for discrimination if people are not careful about how this information finds its way into health insurer's hands or into the hands of the workplace deciders who may think that this is something you ought to know about before you decide whether to hire, fire, or promote. That would be both at this point unjustified on the basis of scientific grounds, but it would certainly be unjustified on the basis of principals of equity and justice. A recent piece from last Friday's Science from Cathy Hudson and colleagues makes a strong case that this is not a stable situation and deserves a higher level of oversight than is currently being offered, and goes through the case for that using SYP450 testing for SSRIs. And, of course, in terms of the policies that would be most beneficial to at least eliminate the genetic discrimination risk we are still waiting for a successful outcome after more than a dozen years of hoping to see federal action on this, Here is a bill introduced in 2003 that passed the Senate unanimously and the 108th Congress, but the house took no action. That bill was then re-introduced in the 109th. The Senate passed it unanimously. The House did not bring it to a vote. We're now in the 110th and both Senate and House have taken it much more seriously and the House passed the bill this time 420 to 3 just about a year ago. The Senate has not acted upon it. It is currently tied up as a result of a hold that's been placed on the bill by a single senator, and as time is clicking by the likelihood that this bill will get passed this year seems to be growing fainter by the day. And this is truly a frustrating experience when you can see on a research basis how much harm the absence of such protection is doing to our ability to do studies and how the solution actually at this point seems to be fairly clear and actually embraced by many people. So there you have it. A romp through a whole lot of components of what has happened and what is happening now and what we hope will happen in the future with this approach to the genome, much of it catalyzed by the wonderful folks at NCBI who had all of these remarkable abilities to archive and display the data and in fact think about it. So Sidney's comment from this morning, which I think was the representation of the phrase that was worth a thousand power points - it is better to combine human intelligence with artificial stupidity - he was referring to computers, but I'd like to put Congress on that list, too - than to do it the other way around, and I guess the credit that we should all give to David and to Jim and to all of their wonderful colleagues down through these years as the overseers of NCBI and of GenBank, is that they have in fact provided an awful lot of human intelligence to make this whole process go as well as it has, so my image seems to be somewhat similar as Betsy's, although we didn't collaborate on this - fireworks are appropriate. Happy birthday to GenBank. Congratulations to David and Jim. Thank you all. Applause Be glad to take any questions. Can you go to the microphone? Because otherwise people on the Web can't hear. A question about your closing comment. Is this a product in Congress of intelligent something? Since I'm a federal employee, I should be very careful about how I answer that. First of all I guess we should give Congress the credit for actually having started the Genome Project, so there's a pretty important thing to point out. In this particular instance of genetic discrimination I'm afraid that the wise-heads have not necessarily carried the day, but we all hope that they still will. Yeah. My question may be somewhat naive, being a young scientist, but I was in a discussion recently and we were discussing the difference that you mentioned about full sequencing versus genome-wide scanning and the benefits of each, and I was wondering, for example, if there was a new mutation that brought about a disease, how long would it take before genome-wide scanning would be able to pick that up because of linkage disequilibrium? Collins: So when you say scanning, I assume you mean sort of SNP genotype being GWAS kinds of studies. Well, no, it's a very appropriate question, because GWA studies really can only detect a variant that has a reasonably high frequency, that is at least three or four percent, so if you have a rare variant that's less common than that, GWAS will utterly fail you; you won't be able to discover that. You're basically depending upon common variants to play some role in common disease, and I think we can now say they do, but we can also say they by no means explain more than a small fraction of the overall heritability, and maybe the rest of that is then hiding down there in those rare variants. Obviously, if we had the ability to sequence the entire genome, instead of doing GWA studies, we would do so, and we'd be happy to give up GWAS for all time, if we could afford to go after the entire sequence. And we're moving in on that, and certainly for some diseases already, there are efforts underway to start sequencing all the exons, figuring that that's probably a pretty good place to go and look if you're looking for particularly severe conditions. But, ultimately, we'll want the whole genome, and let me say that means we're going to be doing complete sequencing of hundreds of thousands of human genomes in the course of the next few years, especially if these new technologies make that possible. And not only is that going to be a huge problem in terms of storage of the data and display of the data, it's going to be a huge problem in terms of the analytical capability of those wise computational statistical folks who are going to have to make sense out of all this, because there's going to be lots and lots of rare variants discovered that have nothing to do with the phenotype you're interested in, but they're along for the ride in that genome, and if you're not careful, you could make a very serious mistake in terms of deciding what's cause and effect and what's just the noise. So we have, I think, more than ever the need for a generation of computational biologists to also be human geneticists to help us through this next very exciting phase of really getting the whole spectrum of how heredity plays a role in health and disease. Thank you. [Applause]