Trent Lecture - Genomic information - Eric lander

[applause] Dr. Eric Lander: Oh, wow. I want to say a couple of thank yous and a couple of things. First to Jeff Trent, it is a tremendous honor to come and give the Trent Lecture. I think it's great naming lectures after people while they're still alive. It's better than coming and giving the Trent Memorial Lecture, to give the Trent Lecture while there's still a Trent to enjoy it, and so I salute you for honoring Jeff for the wonderful thing he did in starting the intramural program in NHGRI. I very much want to also salute the scientists of NISC, Eric Green and all of the people who have worked with NISC. At the beginning of the day, they stood up and we saluted them but many more people have come and gone over the course of the day and there's a bunch of new people so if it's okay, I would like to ask the fantastic scientists who created NISC, who continue to run NISC, and who have done this amazing thing by insuring that the world's best biomedical campus, the NIH, has a world-class sequencing center. So if I could ask all the people from NISC who are here today, because many of them are, to please stand. I think we want to salute you again. [applause] We have had the pleasure of working with NISC on many, many projects, admiring many other projects, their role in the mammalian genome projects and end coding the mouse genome and cancer genome anatomy and in brainstorming many projects still on the drawing boards and soon to get underway. So well done, happy 10th birthday and we look forward to many, many more and watching the impact that you have on the NIH and the impact you have on the world continue to grow. I just want to thank all the preceding speakers who are good and close friends from the world of genomes, Claire and Richard and Rick and Wylie and Rick and Andy and Evan and David, and many of the things I'll touch on are things they have already touched on, because, in fact, we're all interested in this broad common world of what you can learn from genomes. So, since you'll have heard bits and pieces of all of these amazing ideas from people over the course of the day, what I'm going to try to do is draw together, run a thread through it, and really address the question of genomic information, what we can learn from it. Because I think the single greatest change over the course of the last 20 years or so in biology is the recognition that biology is, yes, it's the study of organisms and, yes, it's the study of molecules and things, but that at its very core, it is about information, and that there is genomic information. By genomic I don't necessarily mean the DNA, I mean genome scale, comprehensive, complete information that all the components of the cell, DNA, RNA, proteins, modifications thereof, and that by laying out all of that information, we can transform the sort of questions that we can address. All of the speakers today have shown beautiful examples of that and I'm particularly delighted to see so many young people, post docs, graduate students in the audience because this is the world you guys are inheriting. This world where it is not just about the experiments on your own bench but the experiments of the entire world laid out before you to pick through and figure out how to extract the information from. So that's the theme. And I'm going to touch on many different forms of genomic information, if I can. But the granddaddy of all genomic information projects, of course, was this Human Genome Project. It taught us some very important things. It taught us it wasn't a bad idea to lay out some clear goals. Goal directed science had a bad name originally but the idea that if we thought clearly and had some things we had to get and we could define some goals, wouldn't be bad to lay out those goals and try to go for them and hold ourselves toward them. It also taught us that if we're making a project about information, it was absolutely crucial that that information be completely and freely and immediately available to anybody because it was simply absurd that the people who were producing the projects were the only ones who could use it well. We needed to enlist the ideas and the creativity of everybody around the world in any country, in academia or in industry. And so that was an important lesson that emerged from it. We learned the importance of laying out concrete plans, timelines. There was a plan and timeline laid out for the Human Genome Project over the course of 15 years and actually pretty much worked according to plan. There were, you know, lots of innovations along the way, but there was a sensible plan and we learned how to plan together, including planning in the face sometimes of huge uncertainty. And we learned the importance of collaboration. The importance of international collaboration. The Genome Project, again, as a kind of granddaddy here involving six countries, 20 centers but every project that we talk about has been an international project involving many groups in the United States and many groups in other countries in this ever-changing mix of centers helping one another to stay at the edge. In the case of the Human Genome Project, as you all know, a rough draft sequence came out in 2001, a finished sequence came out in 2003. There was another little lesson there, finished. Finished is a technical term in the world of genomics. It means the vast majority of, but there's still 300 gaps and that's okay, we're aware of it. Absolute completion shouldn't be the enemy of getting the vast majority of the information out. And there are many things we can state and have stated that we can't quite get the last little bit out but we can get the first 95, 98 percent out and we should get out in the hands of as many scientists as possible. And of course, what's been the impact of it? Well, it's laid out before us, the landscape of a human genome. It's a beautiful landscape with all of these interesting mountains and valleys, dense gene regions, poor gene, poor regions, all sorts of these striking things. But the real test has been its impact on medicine. When the Human Genome Project started, there had only been about 70 diseases that been identified molecularly, single monogenic Mendelian disorders that have been identified before the Human Genome Project. With the tools that have emerged during the course of the Human Genome Project we're now up to some 2,600 Mendelian conditions, for which we know the guilty gene, and people can study them in great detail. So that was all fun but that's past history, it was the Human Genome Project. What about beyond the Human Genome Project? What is the agenda today? What are the sorts of things that genome centers, that people around the world are trying to insure that we have and have freely available on the Web for everyone? Well, Human Genome Project had a goal, know all the sequence in the human genome. All is in italics because it means, you know, the vast majority, and don't give me a hard time, at the last percentage or so. Here's some other things. We need to know all the genetic variation in the human population and its relationship to disease. We need to know all the functional elements in the human genome. We have been hearing about these things already from the speakers today. We need to know all the signatures of cellular responses. Cells only know how to do a limited number of things. I don't know if it's 500 or 5,000, but there's a limited number and we're going to be able to recognize what those things are by some reduced signatures of cellular responses. We need to be able to modulate all the genes in the genome. We need to know all the mechanisms of cancer and we need to know similar information about the genomes of all the major infectious agents. That's a good to do list, and that is the to do list for not the 21st century, for goodness sakes, this is the to-do list for the next ten years. And indeed, for those of us involved in this, we know that more than half the stuff on this, there's already been great progress and we can begin to start putting checkmarks next to things on this list, because we're quite far along on them and there's nothing on this list, I think, that should take us more than the next decade or so with the appropriate interpretation of the word “all.” There will still be things to discover 30 years from now and all that, but to get the vast majority of out there. “It is helped by” is one of the themes in this symposium, the continuing innovation in technologies. The Human Genome Project was helped greatly by the appearance of first florescent sequence in the capillary sequencing and then we've had the appearance of all sorts of next generation and next generation and next generation sequencers. These 454s and SELEXs, it's solids and Helicoses [spelled phonetically] and others, and I won't fuss over what their throughputs and re-links are because they're changing every day as people are continuing to improve these machines. But one is getting up to points of gigabases per run or perhaps two gigabases per run. I've heard of four gigabases per run on some of these platforms, and there seems to be no reason why those things can't be achieved. So I want to turn to the topics I was talking about, human genetic variation. Let me take that one first and just describe what has been just the remarkable, remarkable period since the Human Genome Project. Now, as various speakers have referred to, there is a fair amount of polymorphism in the human population. It's actually not that large compared to most mammalian species, they are more polymorphic than we are but we have about one heterozygous base per thousand bases or so, or 1300 bases in the human genome. And if I take a random heterozygous base in you, the probability is greater than 90 percent that it's shared with other people in this room. That is, the vast majority of the variation in you is common genetic variation. It's not these rare Mendelian things that are private mutations, the vast majority of what you have got is common genetic variation. And what does it do? Well, we know some examples, it's already been referred to apolipoprotein E has a common genetic variant widely referred to that confers risk of Alzheimer's disease. We have got some other examples of a common genetic variant, NCCR 5 [spelled phonetically], that confers protection against ***. But we really had no systematic way of looking at what might be the medical implications of common genetic variation. So in 1996, several folks, myself included, began to get very interested in the idea, even before we had the sequence of human genome all tidied up, in fact, before we even had most of it, in the idea that we needed more than a sequence. We really needed to understand all the common genetic variation in the human population. Well, simple back of the envelope calculations could tell you that there are about 12 million common genetic variants, and the hallucination was this, that one might be able to simply write down all the genetic variants along the top of an Excel spreadsheet, write down all the diseases along the side of the Excel spreadsheet, and human genetics might reduce simply to saying which genetic variants were enriched in which diseases. That would be very nice. It was also kind of a nutty thing ten years ago to think about that, because it implied having 12 million genetic variants, we had nothing close to that. It implied being able to genotype these 12 million genetic variants in thousands and thousands of patients. And mind you, near completeness was necessary. If you only could do ten percent of it, well, you'd only catch ten percent of the things you were looking for. You really had to get the whole thing. But as these kind of genomic information projects have taught us, put one foot in front of another and consistently you may be able to build to these goals. To indicate just how poor the information resources were when we started, one could publish, in fact, we did publish a paper in 1998 entitled "Large Scale Identification of SNPs" that could report 4,000 SNPs and call it large scale. That was just an indication of where we were at that point. But through efforts like this and others, the idea came along that we should be able to collect SNPs in a systematic fashion. A public/private consortium was put together, the SNP consortium in 1999, with what sounded like an ambitious goal, 300,000 SNPs across the genome. That proved quickly to be under-ambitious as the SNP consortium within two years reached 1.4 million SNPs. And then as the Human Genome Project came rolling along, it was quickly increased to two million SNPs, three million SNPs, blah, blah, blah, eight million SNPs, something like 10 million SNPs now. The vast majority of the common genetic variation in the population is already in the public databases. If we find the heterozygous site in you, we know empirically that the odds are very good it is already in the databases. Now, the problem was still how are you going to type tens of 10 million SNPs across each patient? Could you get away with less somehow, without sacrificing the information? Well, here some of the ideas from Mendelian diseases became very helpful in organizing the thinking. Some of the Mendelian diseases that occurred in isolated populations with single founder chromosomes reminded us that every mutation occurs on a single ancestral chromosome that has a bunch of polymorphisms on it, and as its passed down through the generations, recombination whittles away the markers of the far distances, but nearby, you still have strong correlation amongst the markers that are there. You still have linkage disequilibrium. And you could use it for mapping, for example, in places like Finland without even families, just looking across a population of Finns with a rare genetic disease, you could map it by linkage disequilibrium, that signature of ancestral chromosomes. A very important paper for Mark Daly showed that even in a general European population in Toronto, you could, if you were up close and personal, detect that linkage disequilibrium. Then he found in a population of patients with Crohn's disease that there was a highly stereotyped pattern of blocks of genetic markers that hung together so well that you only needed a couple of those genetic markers to be able to trace the proxy for the entire block. And so that gave rise to this notion that if we only knew that correlation structure across the genome, the haplotype structure across the genome, we'd be able to pick out a mere 3 or 400,000 genetic markers and trace inheritance this way. Well, from a random proposal there of wouldn't it be good to do that, the community swung into action within a year, a haplotype map project was launched. Again, the same pattern involving multiple countries, multiple centers, clear goals, free information sharing. And by 2006 it was largely completed and that nice correlation structure is quite evident in this correlation gram here across the tiny region of the genome, but the slide goes on all the way across the NIH campus. Then, you also needed technologies to genotype. Even 300,000 is a big number, but here a variety of different ideas in both the private sector and the public sector came together to allow multiplexing of one marker, ten markers, a thousand markers. By the last year half a million genetic markers means simultaneously genotyped on DNA chips. It's up to a million this year. And so suddenly, one had to put up or shut up. One had to actually say, you had the genetic variation of the human population, you had the tools for genotyping across people, why not do it? And many groups around the world have been doing just that for the past year. And it has been an annus mirabilis, 2007, a year of miracles. Just to give you a graph here of the confirmed common disease common variants involved in common disease, 2000, a single, very interesting report of PPAR gamma and type 2 diabetes. Crohn's disease, published in 2001. Another diabetes gene in 2003. Age related macular degeneration in 2005. 2006, several more. 2007, through April, when the tools became available, through August, through September. I don't have October, I'm getting tired continuing to remake this slide here. And it's, it's going to have a lot of trouble fitting on by the end of December. But it's clear that there is an extraordinary explosion right now of diseased genes disease associations of common genetic variants. And why is that? It's because of the continued investment in infrastructure. In building the tools in human genome projects, SNP consortiums, HapMap projects, genotyping rays, it's the NIH behind many of these things. It's the private sector behind many of these things. It's private/public partnerships behind these things. But it's the willingness to actually roll up sleeves and create that infrastructure and then make it broadly available to a community. What are we learning from these sorts of findings already, in what just has been about a year of this? Well, with regards to the common disease, common genetic variant idea, we learned it works. You can find lots of them and the significance levels are extraordinary. Ten to the minus tenth is hardly impressing anybody any day. There are ten to the minus 60th, ten to the minus 120ths that are significant. We're learning that the vast majority of the genes that play a role are not the genes that were prior candidates on anybody's list. It's perhaps no surprise, we knew this from the Mendelian diseases, we were bad guessers. We're bad guessers about the common diseases as well. And we're also learning that many of the risk factors are not in coding sequences. They are noncoding. They are probably regulatory sequences. So out of shock we have already heard from the speakers that a significant fractions of the human genome isn't the functional stuff in the human genome is noncoding, while a significant fracture of the variation that affects disease is noncoding. We have our work cut out for us to understand it, but it's in the population, it does affect risk and it's probably going to be a very good handle into what these things do. It's revealing new pathways, the complement pathway in macular degeneration, autophagy involved with multiple loci and inflammatory bowel disease, beta cell function, and in particular, all sorts of new things, zinc transporters, et cetera, and type 2 diabetes. It's revealing connections between diseases, already referred to this morning, chromosome 9, this interesting region that has myocardial infarction risk factor and a type 2 diabetes risk factor, very close to each other. What does that mean? They're not the same, they're a little bit apart but very, very close We're learning that the effect sizes may be modest but they may be very important, PPAR gamma. It's only a 1.3 fold increase in your risk but it happens to be a drug target for a drug that's useful in type 2 diabetes. We're learning that some of these markers, for example, in type 2 diabetes, again, can be very useful in a clinical sense of identifying which prediabetic patients will benefit most from early interventions. We're learning about ethnic variation and health disparities, about AQ24, a risk factor for prostate cancer that is present in all populations but in higher frequency in African Americans and may explain the somewhat higher frequency of prostate cancer in African Americans. We're learning that it's often hard to find the specific gene, the specific allele, a lot of work is going to be needed for that. We'll come back to them. We're learning that more is more. Larger sample sizes will yield even more. I can tell you stories from inflammatory bowel disease that Mark Daly tells me that the first thousand or so patients identified six loci, but when three different groups pooled their data to get 3 or 4,000 patients, they're now up to something like 30 highly significant loci that come with larger sample sizes. We are learning that there's still much more of the genetic variance to explain. We've explained maybe 50 percent of the variation for macular degeneration but perhaps five percent of the variation for type 2 diabetes. Why? Is it we're missing the genes? Is it epistasis between them? Is it environment? Well, it's only been a year, nobody knows. The dust hasn't come close to settling. These are the sorts of questions. So what do we need? Well, what we've really learned is we've barely scratched the surface of this. We've scratched the surface probably of the genes and barely scratched the surface of the biology. What do we need? Well, three things. Larger samples, and more diverse populations. Most of the work has gone on in European derived populations. We know that different alleles are at different frequencies and you'll spot different things, you'll have more power to spot different things if the allele frequencies are somewhat different. And so African American populations will reveal different loci, not because there's fundamental differences but because the allele frequency fluctuations between populations make it easier to spot some things, Asian populations, Hispanic populations. This is, this is essential to really being able to do the biology, as well as being able to investigate health disparities. Beyond that, as several of the speakers, notably Richard Gibbs referred to this morning, we've only examined some of the range of genetic variation. We have looked only really with these genome-wide association studies at the genomic variance between 50 percent and five percent. Polymorphism in the human population, the word technically means down to about one percent. Common variation in the human population, segregating variation. That is to say, variation common enough that if you got a thousand patients you'd see it multiple times, enough times to recognize that it was an increased risk factor, runs down another log below that five percent to at least half a percent, and yet the studies now are not powered to do that. We don't have catalogs even that run down there. And yet we know there's important stuff. Helen Hobbs' beautiful work on PCSK9 with variants in the range of two to three percent, common genetic variation but not yet assayed by the types of maps we're using. We need to have genome wide projects. Whole genome discussions of thousand genome projects to collect all that genetic variation so in this HapMap-type fashion we can exploit all of that to do common variation studies. For now, as regions come up people are extremely interested in sequencing those regions to find the lower frequency variants. But here, since they are, in fact, common enough that we could collect them all, as Richard referred to, let's collect them all. And then of course, there are rare mutations. There are mutations that are private mutations and they can be very revealing, too. Helen Hobbs has beautifully shown in a population of patients with low HDL that a couple of genes have just too many rare singleton mutations and that, too, is a signature. A signature that can't be caught by the common genetic variation, and we need the tools for that. And I'll take for granted, but Evan Eichler has made a very good point about that, that human genome also has much more than SNPs, it has copy number variation in these interesting repeated regions, and we need to be able to put all of that into this pipeline as well and look at the copy number variation across the genome. And for all of it there's a tremendous amount of sequencing that's going to have to go on in the next couple of years, but like with these other projects, I think it's guaranteed to give us the kinds of catalogs and tools we need to drive this problem home. At least to drive it home with regard to finding genetic variants. What do they mean? Well, we need tools to connect these genetic variants to physiology. We can't forget that by piling up 20 things that might be involved in inflammatory bowel disease, 20 things that might be involved in type 2 diabetes, that's of course, just the start. How are we going to keep up with that pace in the laboratory? Well, I want to turn to some of the things we need for that. So let's put aside all this human genetic variation and collecting it. I'm confident that can happen. What about breathing functional meaning into the genome so we can make sense of this human genetic variation, so we can connect it with disease? So I want to turn to a little bit about talking about all the functional elements in the genome. Well, there are two different ways that one can approach them that I'll at least mention, probably some others. Conservation maps, looking at the portions of the human genome that evolution has voted on as really mattering. And David Haussler has referred to this quite beautifully, that looking at the patterns of conservation across the genome one can learn a lot about what matters in the genome, even if the mouse knockout doesn't show a phenotype. If evolution tells you it's not willing to change that base, I go with evolution, it knows what it's doing. And then I also want to talk about chromatin state maps, a new kind of map that I think we want to collect a lot of and put them on the Web. So let's turn to ways of annotating the human genome so we'll be able to make sense of some of these disease loci. So conservation maps, clearly the first thing after the Human Genome Project was to get the mouse genome done, and many of the people in this room played crucial roles in that, including folks at NISC, of getting the mouse genome done. And then using that mouse genome by lining up the mouse genome with the human genome and with a few other genomes, the dog genome, the rat genome, and lining up just that first handful of genomes has revealed a number of important things. Genomic comparison has already revealed that the human gene catalog is very different than we thought. It's not the hundred thousand that was in the textbooks a decade ago. It's not even the 30 or 40,000 that we all wrote in the human genome paper back in 2001. It's not even, I think, the 25,000 protein coding genes that are in the current catalog that were in the current catalogs last year. In fact, comparative work from the handful of mammalian species but Michele Clamp is very nicely shown in a paper coming out very shortly. Probably the human protein coding gene count is really in the neighborhood of about 20 to 21,000. But the current databases probably only have about 20,400 real protein coding genes and much of the rest of the stuff are simply open reading frames that are spuriously. And I don't have time to go into the arguments. And that you can pick that out of by comparison. And the number that really primate-specific things is modest, measured in the hundreds, and they are the sort of things that Evan Eichler talked about, these very exciting gene families that are getting born. There is new stuff, but for the most part the story with protein coding genes is pairing them down and whittling them away. But even as they're getting paired away, the coding things, the noncoding things in the genome are really crying out for our attention. They're burgeoning. As you look across the genome, as various speakers referred to, we find that there are patches of conservation, clear conservation, ranging from these ultra conserved elements to smaller binding sites that evolution has lovingly preserved and that something like two thirds of all the stuff evolution has preserved is this noncoding stuff covering about five percent of the human genome. We know in a few cases that there are regulatory elements because when you, when you knock them out of a mouse you're able to see that it disregulates genes nearby, but that's a pretty tough thing to do to annotate half a million elements. Half a million mouse knockouts is daunting even for me to contemplate, which is big. So the best way to really home in and clean this up is to increase the power of the data, first. With just the human and a mouse or a dog there's a limit to how much you could get, but evolution kindly made many mammals and by comparing more and more genomes, we're able to refine those signals, get rid of the noise, pull up the signal. And so various groups came together, but here I particularly want to credit the folks at NISC, collaborating with some folks at the Broad, for proposing a concrete program to sequence a large number, about two dozen mammalian genomes. And that program the NIH launched involving all of the sequencing centers, and with elephants and armadillos and rabbits and bats and cats and hedgehogs and all that, and the project is essentially complete. There are aspects of it still being tidied up, but the vast majority of these data are already freely available on the Web. David Haussler has referred to some of this already and groups around the world are putting together all these two dozen sequences and saying, can we get down not just to 200 base pair conserved elements but 150, ten, can we pick out ten base pair elements, etcetera, and there's just been an explosion of interest in folks who are comfortable with both genomes and bioinformatics and squeezing out all of the information that evolution was kind enough to leave us from the experiment that's called the mammalian radiation. So, I'll give you some examples of things that come out if we're looking at genomes. Here's one. I'm fond of this one. If you line up many genomes and you start looking at what's conserved, you find a funny little site here that's it's not that little, a funny site here that's present about 5,000 times across the human genome, and when it occurs it's very well-conserved. What in the world does it mean? So we use that. We took a biotinylated version of that piece of DNA and pulled down with it protein. We took cellular extract and bound to the biotinylated sequence that contains that motif there, cellular extract, and found that when we pulled it down and flew it on a mass speck, the CTCF insulator protein, an insulator protein that blocks the spreading of gene expression. Only about three insulator sites in the human had ever been characterized, but suddenly, maybe the genome has given us 5,000 candidate insulator sites. How are you going to prove that they're really insulators? Are you going to go knock them all out? That's a lot of work. Turns out, again, genomic information can give you a very good clue, right away. Just take all the genes in the genome that are divergently transcribed. If they're divergently transcribed and this thing is an insulator sequence, when there's an insulator sequence in the middle, those genes should have uncorrelated gene expression. If there's no insulator sequence, they should have correlated gene expression. Get the public databases, look at their gene expression patterns, it works. The guys who have this tend to be uncorrelated, the guys who don't tend to be correlated. So you can take that out of the information. Obviously, you want to go do biochemistry after that, but it's very nice to be able to do this because you can do this in an afternoon. Other things that you can come out. You can take the things that David refers to, these ultra, ultra-conserved sequences way out at the end, or little less ultra conserved, maybe super conserved or very conserved or something. The most five percent most conserved sequences across the genome and see where they are across the genome. And when you do that, you find the following curious fact, that the most conserved noncoding sequences across the genome are not near genes. They're in gene deserts, gene-poor regions. But not no genes, just gene poor. What genes are in those gene poor regions? Developmentally important transcription factors. Almost every one of those 200 regions that have peaks of highly conserved noncoding elements are enriched for developmentally important transcription factors or axon guidance receptors. Half of that very conserved stuff is focused around these regions. They must be very interesting. What do they do? So we were curious about understanding what was going on special at these regions, and that led us into the second part of the work, chromatin state maps. Because we took a guess that maybe chromatin would be one way in which those loci were special. And so, we began to explore the chromatin structure of these funny regions, and I'll tell you about that now. Chromatin structure is enormously complex. Histones have these tails that are decorated with all sorts of modifications but for the moment I'll keep it simple and refer to only two histone modifications. One, lysine 4 trimethylation, which I'll color green because it's associated with active genes; and lysine 27 trimethylation, which I'll call a red, because it's been historically associated with inactive genes. One can then go look and what we did was using chromatin immunoprecipitation on a microarray, a DNA microarray for just these special regions of the genome. We began to explore a chromatin structure of those regions and we found that in mature cells sometimes they had the green mark, sometimes they had the red mark, sometimes they didn't have any mark, but you never see both together, which was consistent with the literature that it was either a green it was an on or an off, until we looked at embryonic stem cells. And in ES cells we found a very curious phenomenon. Right around those developmentally important genes in these regions, we found that in embryonic stem cells they were marked with both red and green, both an on and an off mark, and yet were silent, as if they were poised for either activation or repression, according to which lineage they might go down into. At least that was our hallucination, there. Well, to really look at that in a serious way, one's got to expand to more cell types and expand to the genome. And as Rick Myers has already referred to, the idea of doing chromatin immunoprecipitation and hybervising [spelled phonetically] it to a DNA array is something that's so 2006, it's really not at all au courant. The right way to do it now is through chromatin immunoprecipitation, get the DNA and run it on one of these ultra high throughput sequencers that give you little reads and you map them back to the genome. So we did that using a Selexa and the data are, as they would say, comparable. The top line is sequencing, the bottom line is a microarray, they look pretty the same. And so we could do this across various cell types and for a variety of different chromatin marks, and I'll summarize a bunch of data for the following sort of questions. The question we really want to know deeply, we want to know, how does a cell decide to take up a career? When a cell decides to go from being an ES cell to a fully differentiated cell, it makes a variety of career decisions along the way. It loses potential. It makes commitment. We say that in developmental biology, but what do we mean by it? What are the molecular correlates of a cell being committed to do something, or having the potential, still, to do something? We don't really in developmental biology have a clear, crisp way to read out what career decisions have been made, and which lie ahead. So what we have been trying to do is study that with chromatin. And I'll give you a brief summary of where we're at, at the moment, and this will be slightly oversimplifying the data, but it's not a bad description of it. In embryonic stem cells, genes break up into three different categories. There are some AT rich promoters and they're fickle. They come on, they come off in different cell types. They're very fickle and my sense is these guys here come on or off depending on whether there's a transcription factor to turn them on or off. Very fickle. 70 percent of the genes are CpG rich islands and they're housekeeping genes and they're on all the time. 15 percent of the genes, somewhat more than just in those special regions, but highly enriched in those special regions, are these bivalent genes that start off in ES cells in this bi potential state of red and green and then in different lineages may go green or red, but we're finding now, sometimes, stay bi potential in some of those lineages. In which lineages do they stay bi potential? Stay bi valent? Roughly speaking, in those lineages that still have choices ahead involving that gene. So if we're looking at myoblast neural cells and fibroblasts, we're talking about a gene that's involved in hematopoietic cells, there are no more decisions to be made, it's made a final decision. But a gene involved in differentiation of some neurons still is bi potential here in a neuronal precursor, and a gene involved in differentiation of adipocytes but not other descendants of fibroblast is still bi potential there. And so very roughly, and this is the happy thing of when you only have a limited amount of data you can make a very simple happy model. So the very simple happy model right now is this bivalent mark is an indication of decisions still ahead. As we collect more data the model will surely become more complicated, but happily, I don't know enough yet to complicate you with it. But that's kind of the picture. These chromatin state maps are very interesting. They're revealing all sorts of things. Here's a gene in embryonic stem cells. The coding region here is the protocadherin gene that has a zillion different promoters. And you can see in embryonic stem cells, every one of these promoters is marked as a bivalent promoter, independently with a green and a red, except that one, which is just green, and it's the one that's used in embryonic stem cells. You can oh, we also put CTCF, that insulator, on this, and it nicely insulates between each promoter. You can pick out the microRNA genes. Here's a microRNA. It's very hard to figure out what the primary transcript is for a microRNA, but in fact, here is this green mark of activation and this other mark, K36, that identifies transcribed regions, and it's very easy to pick out this must be the transcript that results in this mature microRNA. Similarly you can find new promoters for genes, FOXP1 instead of FOXP2 that was talked about before. Here's a little promoter here, here's the transcript. But in embryonic fibroblast, there's another promoter being used and you can clearly read off the transcript there. You can read off which allele is being used because you're sequencing so you can tell polymorphisms between the little reads, and you can tell that in hybrid mice, F1 hybrid mice, you can tell that the green mark is on one parental chromosome and a different red mark called K9 is on a different parental chromosome. This is imprinted. This is active, that's the imprinted chromosome, you can read it off right away from the chromatin state map. And you can also tell the different alleles here. All of the transcription is occurring here off the cutaneous allele, not the 129 allele. And so you can pick out and you can do this with humans as well. And finally, going back to this human genetic variation. We have begun to look at marks, the K4 mark, not trimethylation, but dimethylation and monomethylation. These marks, I don't want to confuse you with too many marks, but these marks are marks that seem to indicate open chromatin and enhancers in particular. They're associated with hypersensitive sites in DNA and you can kind of read these off as at least protoenhancer marks. And I put this region up for one reason, which is remember I said chromosome 9 had this funny bit that was noncoding that was associated with both myocardial infarction and type 2 diabetes? It's there. And it's got all sorts of interesting enhancer things over it. Now, I know these enhancers are in a totally irrelevant cell type, they're in HL 60 cancer cell and they're in human umbilical vein cells here. But nonetheless, one can get cell types now and mark up those enhancer structures in more relevant cell types. And my guess is there's a lot of interesting action going on over here in terms of enhancers, and maybe that will help guide us in. Anyway, I'm going to quickly I'll just say and won't really talk about, we have been doing the same thing now with methylation. We have been taking the DNA and studying its chromatin structure its epigenomic structure with regard to methylation. And you can do this by, you know, some genes have CpG islands which sometimes could become methylated and turn the genes off. And you can study this by treating the DNA with bisulfate, and you can then shotgun sequence. Problem is, it's a lot of DNA and so we've come up with, and I'll just mention, some, some interesting tricks where you can slice out one percent of the genome on a gel that contains just the MSP 1 fragments of a certain size, and it says MSP 1 cuts its CpGs. These things are highly enriched for CpG islands and you can assay about 90 percent of the CpG islands in the genome by sequencing about one percent of the genome. And you can pick out those regions that have, for example, become highly methylated in developed cells. I'll mention the following fact, which is, when you begin to measure methylation changes as cells develop, you take embryonic stem cells and you develop them into Sox1 positive cells and then to neural precursor cells and astrocytes. There's a huge change of methylation that occurs here. Very unmethylated, huge change to guys becoming methylated in this change, and then they stay the same past there. This got Alex Meissner, who did this work, beautiful work, very excited. I mention it because Alex Meissner is also very careful. We now think this is a very interesting artifact. We think that now we look at actual cells from tissue, in vivo tissue as opposed to cells being differentiated in cell culture. We don't see this methylation. In fact, it looks like there's some very important changes in methylation that occur in cell culture in the same cell types but are not occurring in vivo. And this is of interest because the one place where you do see this methylation is in cancer. There's something very funny going on with regard to methylation. I mention this because there's been some talk about using bisulfite sequencing, and we're very excited about to go describe all this and now it's very clear. There are some very interesting artifacts and I think at the end will tell us more about cancer than development, with regard to methylation. I mention it. Anyway, all right. So those are those things. But those are annotating the genome. What about functional tools? What about the kind of genomic information that's going to shed light on cellular circuitry? I want to take a little bit of time and talk about tools for doing that. Not for marking up the genome anymore with variation or marking up with conservation or marking it up with chromatin state maps, although I think all of those things are very important and we’ve got to keep generating them and getting them out on the web, but the tools for somewhat more high throughput biology to explore pathways. And so here, I want to describe work of a student, Piyush Gupta, to indicate that even the very sensitive cell biological experiments of a type that you might not think would yield to genomic approaches, can be made to yield to genomic approaches. So, I'll describe briefly. Piyush Gupta, who came to our lab from Bob Weinberg's lab, he's a cancer person, Piyush is and was extremely interested in deploying the tools of RNAi screening. So RNAi is, of course, a fabulous technology for knocking out the gene of your choice, and with a couple of groups, including our own, we have built genome wide RNAi libraries. You can at least imagine the idea of doing genome wide screens with RNAi’s, to find all the genes that might matter in a process. Well, the process Piyush cared about was to understand the signaling of the ErbB2 receptor. He cared a lot about this problem because he was very interested in breast cancer, and breast cancer comes in five basic groups, as defined by gene expression patterns. Two of them, these first two have very poor prognosis and we need much better therapies for them. And this first class here is has prominent signaling through the ErbB2 receptor, and we need much better therapies for this class. So Piyush said, could I use high throughput RNAi screenings as a genomic information tool to tease apart the pathway? Now, here's the problem. This phenotype is very subtle. When you add heregulin to cells, breast cancer cells, they start off clustered next to each other, and when you had heregulin, they move apart a little bit and they get a like spiky, they put out filopodia marked by F actin, they separate a little bit. You can see it, but imagine trying to screen hundreds of thousands of wells for that phenotype. That's not going to be an easy thing to do, but that's what Piyush wanted to do. He wanted to say, use a genomic approach to screen for a very subtle cellular phenotype. And here, happily, we had some colleagues who also think genomically but with regard to image analysis, David Sabatini and particularly Anne Carpenter. So the I think the image takes a long time to come up. Did I get it? Yep, there we go. You can see the cells here without heregulin, with heregulin, have moved apart a little bit and have got a little blotchy with F actin. This is not a friendly thing to imagine doing a high throughput screen for, but Piyush was an optimist. So he took Anne Carpenter's software that's very good at detecting all sorts of objects, shapes of cells, cell boundaries here and other funny things, and used it to analyze lots of images and got all sorts of dimensions, counting F actin puncta, the nearest neighbor, this is cell shaped metrics, et cetera, et cetera. Got all of these different readouts of cells and then went away, being very smart and mathematical, and attempted to build a classifier. And after three months, this is the negative control here, he was unable to do it. Then he went back to Anne Carpenter and said, got any other tricks? And Anne said, well, we have been working on something called cell classifier, and it works like this. Cell classifier gives you 50 pictures. With your mouse you drag the ones that you think are in category A over to the left and the ones that are in category B over to the right, and it goes off and makes up its own rules. Based on its rules, it gives you 50 more pictures, but this time it’s divided them and said, I think these are As and these are Bs, is that what you mean? And you move around the ones it got wrong. It goes away, gives you back. After a couple hours with cell profiler, it's doing a mighty fine job. And in fact, it was able to accomplish in one such sitting, a pretty good classification of cells as either treated as looking like they had been activated by heregulin or not. Anyway, to make a long story short, with this he took a high throughput screen involving about a thousand genes in this case with multiple Rep5 replicates, many hairpins per whatever, and found a number of established genes, lots of new genes, but most interestingly, they fall into very sensible pathways. Three pathways that had been known to be involved in Rb2 signaling come out right away, the PI 3 kinase, NF kappaB, Jackstat, and one entirely new pathway, junk 3, not previously known to be involved, and it's an interesting pathway because there are inhibitors involved there are inhibitors that have been developed against junk three, but for neurodegeneration, maybe they'll have a use here. In addition, recurrent functions come up in neurite extension cell migration, ligand induced receptor endocytosis. The vast majority of those genes sort out nicely into different pathways and provide great sense for it. So I bring this up to say that even when you're talking about subtle cellular phenotypes, the genomic approaches can be quite handy and are quite tractable. And these are the sort of things I, at least, are on record as having advised Piyush would be a terrible screen, but in fact, turned out to be quite a reasonable screen and you can get a lot of really good pathways emerging out of that. I'll talk about another kind of way to recognize cellular signatures and I'll just, yeah, refer to that, which is ways of recognizing cellular signatures based on gene expression. And I just want to describe what's a beautiful project that's been continuing to grow of Todd Golub and Justin Lamb at the Broad whose idea is, we basically want to take any subtle process we're studying, whether it's a disease, the action of a drug, the action of a gene and put them all in one common language, one lingua franca, that whatever we're working on, the way to talk about it is what is its effect on perturbing RNA expression? And if we were to make a big database of that, we would pick up all sorts of connections by putting it in this common language that we would never otherwise have seen. And they have demonstrated very beautifully that one can do this. They have put together now a database of response signatures to number of human drugs, a couple hundred human drugs now, against numbers of human cell lines, and their idea is this. For any biological signature you want, take your biological signature, run it against the database, kind of Googling it, and out will pop the things that are similar to it. Any diseased state, any other state, any gene inhibition, see if there are any drugs or other perturbations that are similar. Just show you examples of this. Treat rats with estrogen. Paper in the literature treats rats with estrogen, looks at gene expression changes in uterus, take those genes that go up and down in response straight out of the paper in the literature, run it against this connectivity map database, out pops all the known estrogen analogs. Out pops something that wasn't known to be an estrogen analog but was proven to be an estrogen analog. If you put in the minus of that signature, down when it should be up, up when it should be down, you get the estrogen inhibition here, you get tamoxifens, you get the selected estrogen receptor modulators. So you can read this stuff right out. A beautiful example is they took the signature of leukemia cells that are sensitive to dexamethazone treatment, some are, and leukemia cells that are not sensitive to dexamethazone treatment, some are not, and you get the differential gene signature. Toss it into the database, say, “Ever seen a drug that looks like it induces the signature of being sensitive to dexamethazone”? And the database pops back and says, the immune suppressant rapamycin does that. And then you say, “Wow, I wonder if rapamycin just induce the signature of sensitivity to dexamethazone, but maybe it will make cells sensitive to dexamethazone.” And you do the experiment and it does. But who would have thought of using dexamethazone? We're certainly not smart enough but a genomic information database is smart enough, that if you simply ask it the question, it will tell you it's the best fit. And similarly, I'm going to skip through this to simply say, in a screening experiment to find small molecules that could block androgen signaling, Todd and his colleagues found these two natural products from these two plants that block androgen signaling. Had no idea what they did, but of course, you don't need to know anything, you just toss its signature into the connectivity map and the connectivity map replies, “Boy, that signature looks an awful lot like Hsp90 inhibitors.” Even though your molecules don't resemble any known Hsp90 inhibitors, they clearly must be blocking that pathway and they have gone on to show it is blocking that pathway. What we need, I would say, is, again, genomic information databases. We need to have signatures of all the FDA-approved drugs, of all the RNAis, of all the bioactive compounds freely available on the Web. How are we going to get that cheap enough? Well, we have begun to realize that if we're going to do lots of this, even doing it on microarrays for gene expression is too expensive, but Todd is coming up with ways to do this by sequencing and it may be the new sequencing technologies make this affordable. Oh, well, those are ways of doing cellular circuitry. I'll briefly mention, because it was referred to by Rick this morning, we still got to know all the mechanisms of cancer. That's the next thing on the list there. Very briefly, mapping the cancer genome is going to be one of the most important things over the next several years. These chips that let us track polymorphism in the human population, also lets you track deletions and amplifications in cancers. And this, this has become a very important and active thing. And sequencing, it's already been referred to by Rick that finding individual mutations like EGFR mutations in lung cancer has pointed out that there are subsets of lung cancer that have a distinct form of the disease that are responsive to particular drugs like Tarceva and Iressa. And so a task force at the NCI recommended a couple of years ago, I got to serve on this task force, that there ought to be a significant cancer genome project and that has morphed into this pilot project, the Cancer Genome Atlas Project that is now underway with groups around the country and I think is increasingly involving groups around the world, as it must. The concerns that have sometimes been expressed about this are either, we already know all the cancer genes, or cancer is hopelessly complicated. I don't think either of those positions is justified by the data. I just put up a list of the 21st century cancer genes that have been discovered in major cancers here, and what's really striking is that virtually all of them have come out of genomic approaches, not prior candidates, that of the drugable genes and common cancers, all have emerged in the 21st century from genomic approaches. That the genomic approaches have pointed us to new kinds of oncogenes we didn't know before, lineage-specific factors like MITF and TITF, translocations in epithelial cancers that used to thought to be confined to blood cancers. And that this is all, as Rick Wilson said, from screens that have been highly limited to really phosphatases, kinases, et cetera. And what we really need are unbiased genomic screens of the sort that have been talked about today. What is the future of cancer genomics? It will be, get a tumor, get RNA and DNA from the tumor and sequence. Sequence what? Well, in the first instance, by sequencing in limited ways you can get whole genomic copy number and rearrangement. You can sequence all the XOHMs, as Richard Gibbs has referred to. You can sequence from CDNA, as Rick Wilson has referred to. You can make chromatin and methylation maps. And all of that, all told, the bill is less than probably 100 million short reads. And 100 million short reads is not such a big deal anymore, or won't be such a big deal anymore in the next couple of years. This isn't re-sequencing the entire cancer genome. The entire cancer genome is probably 3,000 million short reads, which is still unthinkable for the next 12 to 14 24 months or so, but probably in the not-so-distant future. Nobody will fuss over the first couple of lines, we'll go to the latter but, you know, those of us who are highly practical say the first four lines, there will be the focus for the next five years, and then it will be more and more focused on probably being able to do the whole genome. Anyway, genomic information. There are so many kinds of genomic information. There's, of course, all the sequence in the genome, there's all the genetic variation of the population and its relation to disease. All these functional maps emerging from conservation, from chromative state. These signature maps like, like connectivity maps that let you look things up. Or these tools like RNAi inhibitions and databases that are built of the affects of RNAi inhibition. All of the cancer mutations, we're just barely at the starting point to that but I predict we are going to see an explosion of that over the next five years or so. I haven't talked about it, but Claire Fraser has referred very much to the genomes of all major infectious organisms and really being able to detail those as well. For the young people in the audience, this isn't what biology looked like two decades ago. It really was a world where what you did on your bench was primarily the data you were looking at. Now what you do on your bench is the starting point, but of course, it's comparison to everything out there, all the genomic information out there in the world is at your disposal. We are by no means done. The Human Genome Project, good start, but there's a lot more still to do. There are many projects here and there are many more still to go, and I encourage all of you to be thinking whenever you do any experiment, ask if I'm going to do it more than three times what's the genomic resource that would have been helpful for me to have? It is a remarkable, remarkable period we're living through. It still is very much unclear where and when it will end. We keep thinking maybe it's going to top off, but I see no sign of it topping off for quite some time to come. Well, I want to close by acknowledging the obvious, which is, this is the work of an extraordinary community. I want to acknowledge my own colleagues at the Broad Institute, many of them working in many of these areas who it's been fabulous to work with them. And, and I can't say enough about what a friendly and collaborative spirit there is there in Boston amongst MIT and Harvard scientists and Harvard Hospital scientists. But I also want to acknowledge something you often don't acknowledge, which is the extraordinary role of consortia. So much of what I have talked about was not the result of any one lab, not any one institute, not any one city but it was the result of being willing to put together consortia to get things done. And there has been this floating group of consortia, I just put down some of the ones whose data I have referred to here. Human Genome Project, SNP consortia, RNAi consortia, all sorts of consortia that have emerged over the years, and this has become such a powerful way to do science in the age of genomic information. And then lastly, I want to make a special acknowledgement to the sequencing centers. Over the course of now almost 18 years, the sequencing centers have worked together in all sorts of combinations to help try to bring about this revolution and get data out rapidly, and I think we all feel an enormous bond to each other. I want to acknowledge WashU and Baylor and Tiger and Sanger of the Joint Genome Institute, the Stanford Genome Center and others, and I particularly want to acknowledge, because it's a birthday party, NISC, for the extraordinary role it has played in making sure that this genomic revolution and genomic information that is happening all over the world is happening in spades here on the campus of the NIH. Anyway, this has been a great day, a great birthday party. The great thing about celebrating a first decade, in this case, is that one can be sure that the next decade is going to be vastly more exciting. So thanks for the opportunity to kind of tie it all up today and hats off to everybody here for what they're doing. Happy birthday. [applause] Dr. Francis Collins: So, we have time for a couple of questions before we adjourn to a reception. While people are finding their way, Eric, clearly the ability to generate vast amounts of data is outstripping, I think, most people's expectations, although I suppose it shouldn’t be said that we weren't sort of warned about this. Are we going to keep up in terms of the analysis capabilities that we have to put together to make sense out of all this or are we facing a mismatch in terms of algorithms, in terms of trainees? Are we in trouble or is everything just nicely dovetailed? Dr. Eric Lander: Oh, golly. Well, I have enormous faith, over the long term, in young people. I think it's clear that the next generation has already figured out that there's no distinction between being a wet scientist and a dry scientist. They're all recognizing they're damp. That they are, they are both. And we're seeing many more people going into biology now who consider it [unintelligible] bioinformatics training and such. So if you say, over the course of the next 15 years, will the young people lead us into the promised land by virtue of their understanding this new world, you know, us old generation may not fully enter that promised land, but the new generation will, and they understand it. Now, will they all fully show up in full force within the next 24 months to deal with the data, or will there be this deluge of data beyond what the existing training base is? Oh yeah, we're going to be just overburdened with tons and tons of data. But that's okay. I mean, you know, we'll, we'll manage to extract the most interesting things that we see in the data so far, and then as more and more people come in more things will be extracted. The thing we've got to do is make sure that the training programs are there. We've got to make sure I mean, I hardly need to say it, because this is something I think NIH believes deeply in. But NIH is the leader in training in the world here and we've got to make sure that essentially everybody going into biology, even if they think they're going to be a cell biologist, they need some cellular process, understands how to connect to this world, and also that we bring in large, large numbers of people who have real training in mathematics and computer science, etcetera. So in a 15 year time horizon, I think the whole notion of what it means to be a biologist will change, and the young people here will solve it. In the short term, well, we're just going to do all the paddling we can do to stay afloat. Dr. Francis Collins: Okay. [end of transcript]