Tip:
Highlight text to annotate it
X
[applause]
Dr. Eric Lander: Oh, wow. I want to say a couple of thank yous
and a couple of things. First to Jeff Trent, it is a tremendous honor to come and give
the Trent Lecture. I think it's great naming lectures after people while they're still
alive. It's better than coming and giving the Trent Memorial Lecture, to give the Trent
Lecture while there's still a Trent to enjoy it, and so I salute you for honoring Jeff
for the wonderful thing he did in starting the intramural program in NHGRI.
I very much want to also salute the scientists of NISC, Eric Green and all of the people
who have worked with NISC. At the beginning of the day, they stood up and we saluted them
but many more people have come and gone over the course of the day and there's a bunch
of new people so if it's okay, I would like to ask the fantastic scientists who created
NISC, who continue to run NISC, and who have done this amazing thing by insuring that the
world's best biomedical campus, the NIH, has a world-class sequencing center. So if I could
ask all the people from NISC who are here today, because many of them are, to please
stand. I think we want to salute you again.
[applause]
We have had the pleasure of working with NISC on many, many projects, admiring many other
projects, their role in the mammalian genome projects and end coding the mouse genome and
cancer genome anatomy and in brainstorming many projects still on the drawing boards
and soon to get underway. So well done, happy 10th birthday and we look forward to many,
many more and watching the impact that you have on the NIH and the impact you have on
the world continue to grow.
I just want to thank all the preceding speakers who are good and close friends from the world
of genomes, Claire and Richard and Rick and Wylie and Rick and Andy and Evan and David,
and many of the things I'll touch on are things they have already touched on, because, in
fact, we're all interested in this broad common world of what you can learn from genomes.
So, since you'll have heard bits and pieces of all of these amazing ideas from people
over the course of the day, what I'm going to try to do is draw together, run a thread
through it, and really address the question of genomic information, what we can learn
from it. Because I think the single greatest change over the course of the last 20 years
or so in biology is the recognition that biology is, yes, it's the study of organisms and,
yes, it's the study of molecules and things, but that at its very core, it is about information,
and that there is genomic information. By genomic I don't necessarily mean the DNA,
I mean genome scale, comprehensive, complete information that all the components of the
cell, DNA, RNA, proteins, modifications thereof, and that by laying out all of that information,
we can transform the sort of questions that we can address.
All of the speakers today have shown beautiful examples of that and I'm particularly delighted
to see so many young people, post docs, graduate students in the audience because this is the
world you guys are inheriting. This world where it is not just about the experiments
on your own bench but the experiments of the entire world laid out before you to pick through
and figure out how to extract the information from. So that's the theme. And I'm going to
touch on many different forms of genomic information, if I can. But the granddaddy of all genomic
information projects, of course, was this Human Genome Project. It taught us some very
important things. It taught us it wasn't a bad idea to lay out some clear goals. Goal
directed science had a bad name originally but the idea that if we thought clearly and
had some things we had to get and we could define some goals, wouldn't be bad to lay
out those goals and try to go for them and hold ourselves toward them. It also taught
us that if we're making a project about information, it was absolutely crucial that that information
be completely and freely and immediately available to anybody because it was simply absurd that
the people who were producing the projects were the only ones who could use it well.
We needed to enlist the ideas and the creativity of everybody around the world in any country,
in academia or in industry. And so that was an important lesson that emerged from it.
We learned the importance of laying out concrete plans, timelines. There was a plan and timeline
laid out for the Human Genome Project over the course of 15 years and actually pretty
much worked according to plan. There were, you know, lots of innovations along the way,
but there was a sensible plan and we learned how to plan together, including planning in
the face sometimes of huge uncertainty.
And we learned the importance of collaboration. The importance of international collaboration.
The Genome Project, again, as a kind of granddaddy here involving six countries, 20 centers but
every project that we talk about has been an international project involving many groups
in the United States and many groups in other countries in this ever-changing mix of centers
helping one another to stay at the edge.
In the case of the Human Genome Project, as you all know, a rough draft sequence came
out in 2001, a finished sequence came out in 2003. There was another little lesson there,
finished. Finished is a technical term in the world of genomics. It means the vast majority
of, but there's still 300 gaps and that's okay, we're aware of it. Absolute completion
shouldn't be the enemy of getting the vast majority of the information out. And there
are many things we can state and have stated that we can't quite get the last little bit
out but we can get the first 95, 98 percent out and we should get out in the hands of
as many scientists as possible.
And of course, what's been the impact of it? Well, it's laid out before us, the landscape
of a human genome. It's a beautiful landscape with all of these interesting mountains and
valleys, dense gene regions, poor gene, poor regions, all sorts of these striking things.
But the real test has been its impact on medicine. When the Human Genome Project started, there
had only been about 70 diseases that been identified molecularly, single monogenic Mendelian
disorders that have been identified before the Human Genome Project. With the tools that
have emerged during the course of the Human Genome Project we're now up to some 2,600
Mendelian conditions, for which we know the guilty gene, and people can study them in
great detail.
So that was all fun but that's past history, it was the Human Genome Project. What about
beyond the Human Genome Project? What is the agenda today? What are the sorts of things
that genome centers, that people around the world are trying to insure that we have and
have freely available on the Web for everyone? Well, Human Genome Project had a goal, know
all the sequence in the human genome. All is in italics because it means, you know,
the vast majority, and don't give me a hard time, at the last percentage or so.
Here's some other things. We need to know all the genetic variation in the human population
and its relationship to disease. We need to know all the functional elements in the human
genome. We have been hearing about these things already from the speakers today. We need to
know all the signatures of cellular responses. Cells only know how to do a limited number
of things. I don't know if it's 500 or 5,000, but there's a limited number and we're going
to be able to recognize what those things are by some reduced signatures of cellular
responses. We need to be able to modulate all the genes in the genome. We need to know
all the mechanisms of cancer and we need to know similar information about the genomes
of all the major infectious agents.
That's a good to do list, and that is the to do list for not the 21st century, for goodness
sakes, this is the to-do list for the next ten years. And indeed, for those of us involved
in this, we know that more than half the stuff on this, there's already been great progress
and we can begin to start putting checkmarks next to things on this list, because we're
quite far along on them and there's nothing on this list, I think, that should take us
more than the next decade or so with the appropriate interpretation of the word “all.” There
will still be things to discover 30 years from now and all that, but to get the vast
majority of out there.
“It is helped by” is one of the themes in this symposium, the continuing innovation
in technologies. The Human Genome Project was helped greatly by the appearance of first
florescent sequence in the capillary sequencing and then we've had the appearance of all sorts
of next generation and next generation and next generation sequencers. These 454s and
SELEXs, it's solids and Helicoses [spelled phonetically] and others, and I won't fuss
over what their throughputs and re-links are because they're changing every day as people
are continuing to improve these machines. But one is getting up to points of gigabases
per run or perhaps two gigabases per run. I've heard of four gigabases per run on some
of these platforms, and there seems to be no reason why those things can't be achieved.
So I want to turn to the topics I was talking about, human genetic variation. Let me take
that one first and just describe what has been just the remarkable, remarkable period
since the Human Genome Project. Now, as various speakers have referred to, there is a fair
amount of polymorphism in the human population. It's actually not that large compared to most
mammalian species, they are more polymorphic than we are but we have about one heterozygous
base per thousand bases or so, or 1300 bases in the human genome. And if I take a random
heterozygous base in you, the probability is greater than 90 percent that it's shared
with other people in this room. That is, the vast majority of the variation in you is common
genetic variation. It's not these rare Mendelian things that are private mutations, the vast
majority of what you have got is common genetic variation.
And what does it do? Well, we know some examples, it's already been referred to apolipoprotein
E has a common genetic variant widely referred to that confers risk of Alzheimer's disease.
We have got some other examples of a common genetic variant, NCCR 5 [spelled phonetically],
that confers protection against ***. But we really had no systematic way of looking at
what might be the medical implications of common genetic variation. So in 1996, several
folks, myself included, began to get very interested in the idea, even before we had
the sequence of human genome all tidied up, in fact, before we even had most of it, in
the idea that we needed more than a sequence. We really needed to understand all the common
genetic variation in the human population. Well, simple back of the envelope calculations
could tell you that there are about 12 million common genetic variants, and the hallucination
was this, that one might be able to simply write down all the genetic variants along
the top of an Excel spreadsheet, write down all the diseases along the side of the Excel
spreadsheet, and human genetics might reduce simply to saying which genetic variants were
enriched in which diseases. That would be very nice. It was also kind of a nutty thing
ten years ago to think about that, because it implied having 12 million genetic variants,
we had nothing close to that. It implied being able to genotype these 12 million genetic
variants in thousands and thousands of patients. And mind you, near completeness was necessary.
If you only could do ten percent of it, well, you'd only catch ten percent of the things
you were looking for. You really had to get the whole thing. But as these kind of genomic
information projects have taught us, put one foot in front of another and consistently
you may be able to build to these goals.
To indicate just how poor the information resources were when we started, one could
publish, in fact, we did publish a paper in 1998 entitled "Large Scale Identification
of SNPs" that could report 4,000 SNPs and call it large scale. That was just an indication
of where we were at that point. But through efforts like this and others, the idea came
along that we should be able to collect SNPs in a systematic fashion. A public/private
consortium was put together, the SNP consortium in 1999, with what sounded like an ambitious
goal, 300,000 SNPs across the genome. That proved quickly to be under-ambitious as the
SNP consortium within two years reached 1.4 million SNPs. And then as the Human Genome
Project came rolling along, it was quickly increased to two million SNPs, three million
SNPs, blah, blah, blah, eight million SNPs, something like 10 million SNPs now. The vast
majority of the common genetic variation in the population is already in the public databases.
If we find the heterozygous site in you, we know empirically that the odds are very good
it is already in the databases.
Now, the problem was still how are you going to type tens of 10 million SNPs across each
patient? Could you get away with less somehow, without sacrificing the information? Well,
here some of the ideas from Mendelian diseases became very helpful in organizing the thinking.
Some of the Mendelian diseases that occurred in isolated populations with single founder
chromosomes reminded us that every mutation occurs on a single ancestral chromosome that
has a bunch of polymorphisms on it, and as its passed down through the generations, recombination
whittles away the markers of the far distances, but nearby, you still have strong correlation
amongst the markers that are there. You still have linkage disequilibrium. And you could
use it for mapping, for example, in places like Finland without even families, just looking
across a population of Finns with a rare genetic disease, you could map it by linkage disequilibrium,
that signature of ancestral chromosomes.
A very important paper for Mark Daly showed that even in a general European population
in Toronto, you could, if you were up close and personal, detect that linkage disequilibrium.
Then he found in a population of patients with Crohn's disease that there was a highly
stereotyped pattern of blocks of genetic markers that hung together so well that you only needed
a couple of those genetic markers to be able to trace the proxy for the entire block. And
so that gave rise to this notion that if we only knew that correlation structure across
the genome, the haplotype structure across the genome, we'd be able to pick out a mere
3 or 400,000 genetic markers and trace inheritance this way.
Well, from a random proposal there of wouldn't it be good to do that, the community swung
into action within a year, a haplotype map project was launched. Again, the same pattern
involving multiple countries, multiple centers, clear goals, free information sharing. And
by 2006 it was largely completed and that nice correlation structure is quite evident
in this correlation gram here across the tiny region of the genome, but the slide goes on
all the way across the NIH campus. Then, you also needed technologies to genotype.
Even 300,000 is a big number, but here a variety of different ideas in both the private sector
and the public sector came together to allow multiplexing of one marker, ten markers, a
thousand markers. By the last year half a million genetic markers means simultaneously
genotyped on DNA chips. It's up to a million this year. And so suddenly, one had to put
up or shut up. One had to actually say, you had the genetic variation of the human population,
you had the tools for genotyping across people, why not do it? And many groups around the
world have been doing just that for the past year. And it has been an annus mirabilis,
2007, a year of miracles.
Just to give you a graph here of the confirmed common disease common variants involved in
common disease, 2000, a single, very interesting report of PPAR gamma and type 2 diabetes.
Crohn's disease, published in 2001. Another diabetes gene in 2003. Age related macular
degeneration in 2005. 2006, several more. 2007, through April, when the tools became
available, through August, through September. I don't have October, I'm getting tired continuing
to remake this slide here. And it's, it's going to have a lot of trouble fitting on
by the end of December. But it's clear that there is an extraordinary explosion right
now of diseased genes disease associations of common genetic variants. And why is that?
It's because of the continued investment in infrastructure. In building the tools in human
genome projects, SNP consortiums, HapMap projects, genotyping rays, it's the NIH behind many
of these things. It's the private sector behind many of these things. It's private/public
partnerships behind these things. But it's the willingness to actually roll up sleeves
and create that infrastructure and then make it broadly available to a community.
What are we learning from these sorts of findings already, in what just has been about a year
of this? Well, with regards to the common disease, common genetic variant idea, we learned
it works. You can find lots of them and the significance levels are extraordinary. Ten
to the minus tenth is hardly impressing anybody any day. There are ten to the minus 60th,
ten to the minus 120ths that are significant. We're learning that the vast majority of the
genes that play a role are not the genes that were prior candidates on anybody's list. It's
perhaps no surprise, we knew this from the Mendelian diseases, we were bad guessers.
We're bad guessers about the common diseases as well. And we're also learning that many
of the risk factors are not in coding sequences. They are noncoding. They are probably regulatory
sequences. So out of shock we have already heard from the speakers that a significant
fractions of the human genome isn't the functional stuff in the human genome is noncoding, while
a significant fracture of the variation that affects disease is noncoding. We have our
work cut out for us to understand it, but it's in the population, it does affect risk
and it's probably going to be a very good handle into what these things do.
It's revealing new pathways, the complement pathway in macular degeneration, autophagy
involved with multiple loci and inflammatory bowel disease, beta cell function, and in
particular, all sorts of new things, zinc transporters, et cetera, and type 2 diabetes.
It's revealing connections between diseases, already referred to this morning, chromosome
9, this interesting region that has myocardial infarction risk factor and a type 2 diabetes
risk factor, very close to each other. What does that mean? They're not the same, they're
a little bit apart but very, very close We're learning that the effect sizes may be
modest but they may be very important, PPAR gamma. It's only a 1.3 fold increase in your
risk but it happens to be a drug target for a drug that's useful in type 2 diabetes. We're
learning that some of these markers, for example, in type 2 diabetes, again, can be very useful
in a clinical sense of identifying which prediabetic patients will benefit most from early interventions.
We're learning about ethnic variation and health disparities, about AQ24, a risk factor
for prostate cancer that is present in all populations but in higher frequency in African
Americans and may explain the somewhat higher frequency of prostate cancer in African Americans.
We're learning that it's often hard to find the specific gene, the specific allele, a
lot of work is going to be needed for that. We'll come back to them. We're learning that
more is more. Larger sample sizes will yield even more. I can tell you stories from inflammatory
bowel disease that Mark Daly tells me that the first thousand or so patients identified
six loci, but when three different groups pooled their data to get 3 or 4,000 patients,
they're now up to something like 30 highly significant loci that come with larger sample
sizes.
We are learning that there's still much more of the genetic variance to explain. We've
explained maybe 50 percent of the variation for macular degeneration but perhaps five
percent of the variation for type 2 diabetes. Why? Is it we're missing the genes? Is it
epistasis between them? Is it environment? Well, it's only been a year, nobody knows.
The dust hasn't come close to settling. These are the sorts of questions.
So what do we need? Well, what we've really learned is we've barely scratched the surface
of this. We've scratched the surface probably of the genes and barely scratched the surface
of the biology. What do we need? Well, three things. Larger samples, and more diverse populations.
Most of the work has gone on in European derived populations. We know that different alleles
are at different frequencies and you'll spot different things, you'll have more power to
spot different things if the allele frequencies are somewhat different. And so African American
populations will reveal different loci, not because there's fundamental differences but
because the allele frequency fluctuations between populations make it easier to spot
some things, Asian populations, Hispanic populations. This is, this is essential to really being
able to do the biology, as well as being able to investigate health disparities.
Beyond that, as several of the speakers, notably Richard Gibbs referred to this morning, we've
only examined some of the range of genetic variation. We have looked only really with
these genome-wide association studies at the genomic variance between 50 percent and five
percent. Polymorphism in the human population, the word technically means down to about one
percent. Common variation in the human population, segregating variation. That is to say, variation
common enough that if you got a thousand patients you'd see it multiple times, enough times
to recognize that it was an increased risk factor, runs down another log below that five
percent to at least half a percent, and yet the studies now are not powered to do that.
We don't have catalogs even that run down there. And yet we know there's important stuff.
Helen Hobbs' beautiful work on PCSK9 with variants in the range of two to three percent,
common genetic variation but not yet assayed by the types of maps we're using.
We need to have genome wide projects. Whole genome discussions of thousand genome projects
to collect all that genetic variation so in this HapMap-type fashion we can exploit all
of that to do common variation studies. For now, as regions come up people are extremely
interested in sequencing those regions to find the lower frequency variants. But here,
since they are, in fact, common enough that we could collect them all, as Richard referred
to, let's collect them all.
And then of course, there are rare mutations. There are mutations that are private mutations
and they can be very revealing, too. Helen Hobbs has beautifully shown in a population
of patients with low HDL that a couple of genes have just too many rare singleton mutations
and that, too, is a signature. A signature that can't be caught by the common genetic
variation, and we need the tools for that.
And I'll take for granted, but Evan Eichler has made a very good point about that, that
human genome also has much more than SNPs, it has copy number variation in these interesting
repeated regions, and we need to be able to put all of that into this pipeline as well
and look at the copy number variation across the genome. And for all of it there's a tremendous
amount of sequencing that's going to have to go on in the next couple of years, but
like with these other projects, I think it's guaranteed to give us the kinds of catalogs
and tools we need to drive this problem home. At least to drive it home with regard to finding
genetic variants.
What do they mean? Well, we need tools to connect these genetic variants to physiology.
We can't forget that by piling up 20 things that might be involved in inflammatory bowel
disease, 20 things that might be involved in type 2 diabetes, that's of course, just
the start. How are we going to keep up with that pace in the laboratory? Well, I want
to turn to some of the things we need for that. So let's put aside all this human genetic
variation and collecting it. I'm confident that can happen. What about breathing functional
meaning into the genome so we can make sense of this human genetic variation, so we can
connect it with disease? So I want to turn to a little bit about talking about all the
functional elements in the genome.
Well, there are two different ways that one can approach them that I'll at least mention,
probably some others. Conservation maps, looking at the portions of the human genome that evolution
has voted on as really mattering. And David Haussler has referred to this quite beautifully,
that looking at the patterns of conservation across the genome one can learn a lot about
what matters in the genome, even if the mouse knockout doesn't show a phenotype. If evolution
tells you it's not willing to change that base, I go with evolution, it knows what it's
doing. And then I also want to talk about chromatin state maps, a new kind of map that
I think we want to collect a lot of and put them on the Web. So let's turn to ways of
annotating the human genome so we'll be able to make sense of some of these disease loci.
So conservation maps, clearly the first thing after the Human Genome Project was to get
the mouse genome done, and many of the people in this room played crucial roles in that,
including folks at NISC, of getting the mouse genome done. And then using that mouse genome
by lining up the mouse genome with the human genome and with a few other genomes, the dog
genome, the rat genome, and lining up just that first handful of genomes has revealed
a number of important things. Genomic comparison has already revealed that the human gene catalog
is very different than we thought. It's not the hundred thousand that was in the textbooks
a decade ago. It's not even the 30 or 40,000 that we all wrote in the human genome paper
back in 2001. It's not even, I think, the 25,000 protein coding genes that are in the
current catalog that were in the current catalogs last year.
In fact, comparative work from the handful of mammalian species but Michele Clamp is
very nicely shown in a paper coming out very shortly. Probably the human protein coding
gene count is really in the neighborhood of about 20 to 21,000. But the current databases
probably only have about 20,400 real protein coding genes and much of the rest of the stuff
are simply open reading frames that are spuriously. And I don't have time to go into the arguments.
And that you can pick that out of by comparison. And the number that really primate-specific
things is modest, measured in the hundreds, and they are the sort of things that Evan
Eichler talked about, these very exciting gene families that are getting born. There
is new stuff, but for the most part the story with protein coding genes is pairing them
down and whittling them away. But even as they're getting paired away, the coding things,
the noncoding things in the genome are really crying out for our attention. They're burgeoning.
As you look across the genome, as various speakers referred to, we find that there are
patches of conservation, clear conservation, ranging from these ultra conserved elements
to smaller binding sites that evolution has lovingly preserved and that something like
two thirds of all the stuff evolution has preserved is this noncoding stuff covering
about five percent of the human genome. We know in a few cases that there are regulatory
elements because when you, when you knock them out of a mouse you're able to see that
it disregulates genes nearby, but that's a pretty tough thing to do to annotate half
a million elements. Half a million mouse knockouts is daunting even for me to contemplate, which
is big.
So the best way to really home in and clean this up is to increase the power of the data,
first. With just the human and a mouse or a dog there's a limit to how much you could
get, but evolution kindly made many mammals and by comparing more and more genomes, we're
able to refine those signals, get rid of the noise, pull up the signal. And so various
groups came together, but here I particularly want to credit the folks at NISC, collaborating
with some folks at the Broad, for proposing a concrete program to sequence a large number,
about two dozen mammalian genomes. And that program the NIH launched involving all of
the sequencing centers, and with elephants and armadillos and rabbits and bats and cats
and hedgehogs and all that, and the project is essentially complete. There are aspects
of it still being tidied up, but the vast majority of these data are already freely
available on the Web. David Haussler has referred to some of this already and groups around
the world are putting together all these two dozen sequences and saying, can we get down
not just to 200 base pair conserved elements but 150, ten, can we pick out ten base pair
elements, etcetera, and there's just been an explosion of interest in folks who are
comfortable with both genomes and bioinformatics and squeezing out all of the information that
evolution was kind enough to leave us from the experiment that's called the mammalian
radiation. So, I'll give you some examples of things
that come out if we're looking at genomes. Here's one. I'm fond of this one. If you line
up many genomes and you start looking at what's conserved, you find a funny little site here
that's it's not that little, a funny site here that's present about 5,000 times across
the human genome, and when it occurs it's very well-conserved. What in the world does
it mean? So we use that. We took a biotinylated version of that piece of DNA and pulled down
with it protein. We took cellular extract and bound to the biotinylated sequence that
contains that motif there, cellular extract, and found that when we pulled it down and
flew it on a mass speck, the CTCF insulator protein, an insulator protein that blocks
the spreading of gene expression.
Only about three insulator sites in the human had ever been characterized, but suddenly,
maybe the genome has given us 5,000 candidate insulator sites. How are you going to prove
that they're really insulators? Are you going to go knock them all out? That's a lot of
work. Turns out, again, genomic information can give you a very good clue, right away.
Just take all the genes in the genome that are divergently transcribed. If they're divergently
transcribed and this thing is an insulator sequence, when there's an insulator sequence
in the middle, those genes should have uncorrelated gene expression. If there's no insulator sequence,
they should have correlated gene expression. Get the public databases, look at their gene
expression patterns, it works. The guys who have this tend to be uncorrelated, the guys
who don't tend to be correlated. So you can take that out of the information. Obviously,
you want to go do biochemistry after that, but it's very nice to be able to do this because
you can do this in an afternoon.
Other things that you can come out. You can take the things that David refers to, these
ultra, ultra-conserved sequences way out at the end, or little less ultra conserved, maybe
super conserved or very conserved or something. The most five percent most conserved sequences
across the genome and see where they are across the genome. And when you do that, you find
the following curious fact, that the most conserved noncoding sequences across the genome
are not near genes. They're in gene deserts, gene-poor regions. But not no genes, just
gene poor. What genes are in those gene poor regions? Developmentally important transcription
factors. Almost every one of those 200 regions that have peaks of highly conserved noncoding
elements are enriched for developmentally important transcription factors or axon guidance
receptors. Half of that very conserved stuff is focused around these regions. They must
be very interesting. What do they do?
So we were curious about understanding what was going on special at these regions, and
that led us into the second part of the work, chromatin state maps. Because we took a guess
that maybe chromatin would be one way in which those loci were special. And so, we began
to explore the chromatin structure of these funny regions, and I'll tell you about that
now.
Chromatin structure is enormously complex. Histones have these tails that are decorated
with all sorts of modifications but for the moment I'll keep it simple and refer to only
two histone modifications. One, lysine 4 trimethylation, which I'll color green because it's associated
with active genes; and lysine 27 trimethylation, which I'll call a red, because it's been historically
associated with inactive genes. One can then go look and what we did was using chromatin
immunoprecipitation on a microarray, a DNA microarray for just these special regions
of the genome. We began to explore a chromatin structure of those regions and we found that
in mature cells sometimes they had the green mark, sometimes they had the red mark, sometimes
they didn't have any mark, but you never see both together, which was consistent with the
literature that it was either a green it was an on or an off, until we looked at embryonic
stem cells. And in ES cells we found a very curious phenomenon. Right around those developmentally
important genes in these regions, we found that in embryonic stem cells they were marked
with both red and green, both an on and an off mark, and yet were silent, as if they
were poised for either activation or repression, according to which lineage they might go down
into. At least that was our hallucination, there.
Well, to really look at that in a serious way, one's got to expand to more cell types
and expand to the genome. And as Rick Myers has already referred to, the idea of doing
chromatin immunoprecipitation and hybervising [spelled phonetically] it to a DNA array is
something that's so 2006, it's really not at all au courant. The right way to do it
now is through chromatin immunoprecipitation, get the DNA and run it on one of these ultra
high throughput sequencers that give you little reads and you map them back to the genome.
So we did that using a Selexa and the data are, as they would say, comparable. The top
line is sequencing, the bottom line is a microarray, they look pretty the same.
And so we could do this across various cell types and for a variety of different chromatin
marks, and I'll summarize a bunch of data for the following sort of questions. The question
we really want to know deeply, we want to know, how does a cell decide to take up a
career? When a cell decides to go from being an ES cell to a fully differentiated cell,
it makes a variety of career decisions along the way. It loses potential. It makes commitment.
We say that in developmental biology, but what do we mean by it? What are the molecular
correlates of a cell being committed to do something, or having the potential, still,
to do something? We don't really in developmental biology have a clear, crisp way to read out
what career decisions have been made, and which lie ahead. So what we have been trying
to do is study that with chromatin. And I'll give you a brief summary of where we're at,
at the moment, and this will be slightly oversimplifying the data, but it's not a bad description of
it.
In embryonic stem cells, genes break up into three different categories. There are some
AT rich promoters and they're fickle. They come on, they come off in different cell types.
They're very fickle and my sense is these guys here come on or off depending on whether
there's a transcription factor to turn them on or off. Very fickle. 70 percent of the
genes are CpG rich islands and they're housekeeping genes and they're on all the time. 15 percent
of the genes, somewhat more than just in those special regions, but highly enriched in those
special regions, are these bivalent genes that start off in ES cells in this bi potential
state of red and green and then in different lineages may go green or red, but we're finding
now, sometimes, stay bi potential in some of those lineages.
In which lineages do they stay bi potential? Stay bi valent? Roughly speaking, in those
lineages that still have choices ahead involving that gene. So if we're looking at myoblast
neural cells and fibroblasts, we're talking about a gene that's involved in hematopoietic
cells, there are no more decisions to be made, it's made a final decision. But a gene involved
in differentiation of some neurons still is bi potential here in a neuronal precursor,
and a gene involved in differentiation of adipocytes but not other descendants of fibroblast
is still bi potential there.
And so very roughly, and this is the happy thing of when you only have a limited amount
of data you can make a very simple happy model. So the very simple happy model right now is
this bivalent mark is an indication of decisions still ahead. As we collect more data the model
will surely become more complicated, but happily, I don't know enough yet to complicate you
with it. But that's kind of the picture.
These chromatin state maps are very interesting. They're revealing all sorts of things. Here's
a gene in embryonic stem cells. The coding region here is the protocadherin gene that
has a zillion different promoters. And you can see in embryonic stem cells, every one
of these promoters is marked as a bivalent promoter, independently with a green and a
red, except that one, which is just green, and it's the one that's used in embryonic
stem cells. You can oh, we also put CTCF, that insulator, on this, and it nicely insulates
between each promoter.
You can pick out the microRNA genes. Here's a microRNA. It's very hard to figure out what
the primary transcript is for a microRNA, but in fact, here is this green mark of activation
and this other mark, K36, that identifies transcribed regions, and it's very easy to
pick out this must be the transcript that results in this mature microRNA. Similarly
you can find new promoters for genes, FOXP1 instead of FOXP2 that was talked about before.
Here's a little promoter here, here's the transcript. But in embryonic fibroblast, there's
another promoter being used and you can clearly read off the transcript there.
You can read off which allele is being used because you're sequencing so you can tell
polymorphisms between the little reads, and you can tell that in hybrid mice, F1 hybrid
mice, you can tell that the green mark is on one parental chromosome and a different
red mark called K9 is on a different parental chromosome. This is imprinted. This is active,
that's the imprinted chromosome, you can read it off right away from the chromatin state
map. And you can also tell the different alleles here. All of the transcription is occurring
here off the cutaneous allele, not the 129 allele. And so you can pick out and you can
do this with humans as well.
And finally, going back to this human genetic variation. We have begun to look at marks,
the K4 mark, not trimethylation, but dimethylation and monomethylation. These marks, I don't
want to confuse you with too many marks, but these marks are marks that seem to indicate
open chromatin and enhancers in particular. They're associated with hypersensitive sites
in DNA and you can kind of read these off as at least protoenhancer marks. And I put
this region up for one reason, which is remember I said chromosome 9 had this funny bit that
was noncoding that was associated with both myocardial infarction and type 2 diabetes?
It's there. And it's got all sorts of interesting enhancer things over it. Now, I know these
enhancers are in a totally irrelevant cell type, they're in HL 60 cancer cell and they're
in human umbilical vein cells here. But nonetheless, one can get cell types now and mark up those
enhancer structures in more relevant cell types. And my guess is there's a lot of interesting
action going on over here in terms of enhancers, and maybe that will help guide us in.
Anyway, I'm going to quickly I'll just say and won't really talk about, we have been
doing the same thing now with methylation. We have been taking the DNA and studying its
chromatin structure its epigenomic structure with regard to methylation. And you can do
this by, you know, some genes have CpG islands which sometimes could become methylated and
turn the genes off. And you can study this by treating the DNA with bisulfate, and you
can then shotgun sequence. Problem is, it's a lot of DNA and so we've come up with, and
I'll just mention, some, some interesting tricks where you can slice out one percent
of the genome on a gel that contains just the MSP 1 fragments of a certain size, and
it says MSP 1 cuts its CpGs. These things are highly enriched for CpG islands and you
can assay about 90 percent of the CpG islands in the genome by sequencing about one percent
of the genome. And you can pick out those regions that have, for example, become highly
methylated in developed cells.
I'll mention the following fact, which is, when you begin to measure methylation changes
as cells develop, you take embryonic stem cells and you develop them into Sox1 positive
cells and then to neural precursor cells and astrocytes. There's a huge change of methylation
that occurs here. Very unmethylated, huge change to guys becoming methylated in this
change, and then they stay the same past there. This got Alex Meissner, who did this work,
beautiful work, very excited. I mention it because Alex Meissner is also very careful.
We now think this is a very interesting artifact. We think that now we look at actual cells
from tissue, in vivo tissue as opposed to cells being differentiated in cell culture.
We don't see this methylation. In fact, it looks like there's some very important changes
in methylation that occur in cell culture in the same cell types but are not occurring
in vivo. And this is of interest because the one place where you do see this methylation
is in cancer. There's something very funny going on with regard to methylation. I mention
this because there's been some talk about using bisulfite sequencing, and we're very
excited about to go describe all this and now it's very clear. There are some very interesting
artifacts and I think at the end will tell us more about cancer than development, with
regard to methylation. I mention it.
Anyway, all right. So those are those things. But those are annotating the genome. What
about functional tools? What about the kind of genomic information that's going to shed
light on cellular circuitry? I want to take a little bit of time and talk about tools
for doing that. Not for marking up the genome anymore with variation or marking up with
conservation or marking it up with chromatin state maps, although I think all of those
things are very important and we’ve got to keep generating them and getting them out
on the web, but the tools for somewhat more high throughput biology to explore pathways.
And so here, I want to describe work of a student, Piyush Gupta, to indicate that even
the very sensitive cell biological experiments of a type that you might not think would yield
to genomic approaches, can be made to yield to genomic approaches.
So, I'll describe briefly. Piyush Gupta, who came to our lab from Bob Weinberg's lab, he's
a cancer person, Piyush is and was extremely interested in deploying the tools of RNAi
screening. So RNAi is, of course, a fabulous technology for knocking out the gene of your
choice, and with a couple of groups, including our own, we have built genome wide RNAi libraries.
You can at least imagine the idea of doing genome wide screens with RNAi’s, to find
all the genes that might matter in a process. Well, the process Piyush cared about was to
understand the signaling of the ErbB2 receptor. He cared a lot about this problem because
he was very interested in breast cancer, and breast cancer comes in five basic groups,
as defined by gene expression patterns. Two of them, these first two have very poor prognosis
and we need much better therapies for them. And this first class here is has prominent
signaling through the ErbB2 receptor, and we need much better therapies for this class.
So Piyush said, could I use high throughput RNAi screenings as a genomic information tool
to tease apart the pathway? Now, here's the problem. This phenotype is very subtle. When
you add heregulin to cells, breast cancer cells, they start off clustered next to each
other, and when you had heregulin, they move apart a little bit and they get a like spiky,
they put out filopodia marked by F actin, they separate a little bit. You can see it,
but imagine trying to screen hundreds of thousands of wells for that phenotype. That's not going
to be an easy thing to do, but that's what Piyush wanted to do. He wanted to say, use
a genomic approach to screen for a very subtle cellular phenotype. And here, happily, we
had some colleagues who also think genomically but with regard to image analysis, David Sabatini
and particularly Anne Carpenter. So the I think the image takes a long time to come
up. Did I get it? Yep, there we go. You can see the cells here without heregulin, with
heregulin, have moved apart a little bit and have got a little blotchy with F actin.
This is not a friendly thing to imagine doing a high throughput screen for, but Piyush was
an optimist. So he took Anne Carpenter's software that's very good at detecting all sorts of
objects, shapes of cells, cell boundaries here and other funny things, and used it to
analyze lots of images and got all sorts of dimensions, counting F actin puncta, the nearest
neighbor, this is cell shaped metrics, et cetera, et cetera. Got all of these different
readouts of cells and then went away, being very smart and mathematical, and attempted
to build a classifier. And after three months, this is the negative control here, he was
unable to do it.
Then he went back to Anne Carpenter and said, got any other tricks? And Anne said, well,
we have been working on something called cell classifier, and it works like this. Cell classifier
gives you 50 pictures. With your mouse you drag the ones that you think are in category
A over to the left and the ones that are in category B over to the right, and it goes
off and makes up its own rules. Based on its rules, it gives you 50 more pictures, but
this time it’s divided them and said, I think these are As and these are Bs, is that
what you mean? And you move around the ones it got wrong. It goes away, gives you back.
After a couple hours with cell profiler, it's doing a mighty fine job. And in fact, it was
able to accomplish in one such sitting, a pretty good classification of cells as either
treated as looking like they had been activated by heregulin or not.
Anyway, to make a long story short, with this he took a high throughput screen involving
about a thousand genes in this case with multiple Rep5 replicates, many hairpins per whatever,
and found a number of established genes, lots of new genes, but most interestingly, they
fall into very sensible pathways. Three pathways that had been known to be involved in Rb2
signaling come out right away, the PI 3 kinase, NF kappaB, Jackstat, and one entirely new
pathway, junk 3, not previously known to be involved, and it's an interesting pathway
because there are inhibitors involved there are inhibitors that have been developed against
junk three, but for neurodegeneration, maybe they'll have a use here. In addition, recurrent
functions come up in neurite extension cell migration, ligand induced receptor endocytosis.
The vast majority of those genes sort out nicely into different pathways and provide
great sense for it. So I bring this up to say that even when you're
talking about subtle cellular phenotypes, the genomic approaches can be quite handy
and are quite tractable. And these are the sort of things I, at least, are on record
as having advised Piyush would be a terrible screen, but in fact, turned out to be quite
a reasonable screen and you can get a lot of really good pathways emerging out of that.
I'll talk about another kind of way to recognize cellular signatures and I'll just, yeah, refer
to that, which is ways of recognizing cellular signatures based on gene expression. And I
just want to describe what's a beautiful project that's been continuing to grow of Todd Golub
and Justin Lamb at the Broad whose idea is, we basically want to take any subtle process
we're studying, whether it's a disease, the action of a drug, the action of a gene and
put them all in one common language, one lingua franca, that whatever we're working on, the
way to talk about it is what is its effect on perturbing RNA expression? And if we were
to make a big database of that, we would pick up all sorts of connections by putting it
in this common language that we would never otherwise have seen.
And they have demonstrated very beautifully that one can do this. They have put together
now a database of response signatures to number of human drugs, a couple hundred human drugs
now, against numbers of human cell lines, and their idea is this. For any biological
signature you want, take your biological signature, run it against the database, kind of Googling
it, and out will pop the things that are similar to it. Any diseased state, any other state,
any gene inhibition, see if there are any drugs or other perturbations that are similar.
Just show you examples of this. Treat rats with estrogen. Paper in the literature treats
rats with estrogen, looks at gene expression changes in uterus, take those genes that go
up and down in response straight out of the paper in the literature, run it against this
connectivity map database, out pops all the known estrogen analogs. Out pops something
that wasn't known to be an estrogen analog but was proven to be an estrogen analog. If
you put in the minus of that signature, down when it should be up, up when it should be
down, you get the estrogen inhibition here, you get tamoxifens, you get the selected estrogen
receptor modulators. So you can read this stuff right out.
A beautiful example is they took the signature of leukemia cells that are sensitive to dexamethazone
treatment, some are, and leukemia cells that are not sensitive to dexamethazone treatment,
some are not, and you get the differential gene signature. Toss it into the database,
say, “Ever seen a drug that looks like it induces the signature of being sensitive to
dexamethazone”? And the database pops back and says, the immune suppressant rapamycin
does that. And then you say, “Wow, I wonder if rapamycin just induce the signature of
sensitivity to dexamethazone, but maybe it will make cells sensitive to dexamethazone.”
And you do the experiment and it does. But who would have thought of using dexamethazone?
We're certainly not smart enough but a genomic information database is smart enough, that
if you simply ask it the question, it will tell you it's the best fit.
And similarly, I'm going to skip through this to simply say, in a screening experiment to
find small molecules that could block androgen signaling, Todd and his colleagues found these
two natural products from these two plants that block androgen signaling. Had no idea
what they did, but of course, you don't need to know anything, you just toss its signature
into the connectivity map and the connectivity map replies, “Boy, that signature looks
an awful lot like Hsp90 inhibitors.” Even though your molecules don't resemble any known
Hsp90 inhibitors, they clearly must be blocking that pathway and they have gone on to show
it is blocking that pathway. What we need, I would say, is, again, genomic
information databases. We need to have signatures of all the FDA-approved drugs, of all the
RNAis, of all the bioactive compounds freely available on the Web. How are we going to
get that cheap enough? Well, we have begun to realize that if we're going to do lots
of this, even doing it on microarrays for gene expression is too expensive, but Todd
is coming up with ways to do this by sequencing and it may be the new sequencing technologies
make this affordable.
Oh, well, those are ways of doing cellular circuitry. I'll briefly mention, because it
was referred to by Rick this morning, we still got to know all the mechanisms of cancer.
That's the next thing on the list there. Very briefly, mapping the cancer genome is going
to be one of the most important things over the next several years. These chips that let
us track polymorphism in the human population, also lets you track deletions and amplifications
in cancers. And this, this has become a very important and active thing. And sequencing,
it's already been referred to by Rick that finding individual mutations like EGFR mutations
in lung cancer has pointed out that there are subsets of lung cancer that have a distinct
form of the disease that are responsive to particular drugs like Tarceva and Iressa.
And so a task force at the NCI recommended a couple of years ago, I got to serve on this
task force, that there ought to be a significant cancer genome project and that has morphed
into this pilot project, the Cancer Genome Atlas Project that is now underway with groups
around the country and I think is increasingly involving groups around the world, as it must.
The concerns that have sometimes been expressed about this are either, we already know all
the cancer genes, or cancer is hopelessly complicated. I don't think either of those
positions is justified by the data. I just put up a list of the 21st century cancer genes
that have been discovered in major cancers here, and what's really striking is that virtually
all of them have come out of genomic approaches, not prior candidates, that of the drugable
genes and common cancers, all have emerged in the 21st century from genomic approaches.
That the genomic approaches have pointed us to new kinds of oncogenes we didn't know before,
lineage-specific factors like MITF and TITF, translocations in epithelial cancers that
used to thought to be confined to blood cancers. And that this is all, as Rick Wilson said,
from screens that have been highly limited to really phosphatases, kinases, et cetera.
And what we really need are unbiased genomic screens of the sort that have been talked
about today.
What is the future of cancer genomics? It will be, get a tumor, get RNA and DNA from
the tumor and sequence. Sequence what? Well, in the first instance, by sequencing in limited
ways you can get whole genomic copy number and rearrangement. You can sequence all the
XOHMs, as Richard Gibbs has referred to. You can sequence from CDNA, as Rick Wilson has
referred to. You can make chromatin and methylation maps. And all of that, all told, the bill
is less than probably 100 million short reads. And 100 million short reads is not such a
big deal anymore, or won't be such a big deal anymore in the next couple of years.
This isn't re-sequencing the entire cancer genome. The entire cancer genome is probably
3,000 million short reads, which is still unthinkable for the next 12 to 14 24 months
or so, but probably in the not-so-distant future. Nobody will fuss over the first couple
of lines, we'll go to the latter but, you know, those of us who are highly practical
say the first four lines, there will be the focus for the next five years, and then it
will be more and more focused on probably being able to do the whole genome.
Anyway, genomic information. There are so many kinds of genomic information. There's,
of course, all the sequence in the genome, there's all the genetic variation of the population
and its relation to disease. All these functional maps emerging from conservation, from chromative
state. These signature maps like, like connectivity maps that let you look things up. Or these
tools like RNAi inhibitions and databases that are built of the affects of RNAi inhibition.
All of the cancer mutations, we're just barely at the starting point to that but I predict
we are going to see an explosion of that over the next five years or so. I haven't talked
about it, but Claire Fraser has referred very much to the genomes of all major infectious
organisms and really being able to detail those as well.
For the young people in the audience, this isn't what biology looked like two decades
ago. It really was a world where what you did on your bench was primarily the data you
were looking at. Now what you do on your bench is the starting point, but of course, it's
comparison to everything out there, all the genomic information out there in the world
is at your disposal. We are by no means done. The Human Genome Project, good start, but
there's a lot more still to do. There are many projects here and there are many more
still to go, and I encourage all of you to be thinking whenever you do any experiment,
ask if I'm going to do it more than three times what's the genomic resource that would
have been helpful for me to have? It is a remarkable, remarkable period we're living
through. It still is very much unclear where and when it will end. We keep thinking maybe
it's going to top off, but I see no sign of it topping off for quite some time to come.
Well, I want to close by acknowledging the obvious, which is, this is the work of an
extraordinary community. I want to acknowledge my own colleagues at the Broad Institute,
many of them working in many of these areas who it's been fabulous to work with them.
And, and I can't say enough about what a friendly and collaborative spirit there is there in
Boston amongst MIT and Harvard scientists and Harvard Hospital scientists. But I also
want to acknowledge something you often don't acknowledge, which is the extraordinary role
of consortia. So much of what I have talked about was not the result of any one lab, not
any one institute, not any one city but it was the result of being willing to put together
consortia to get things done. And there has been this floating group of consortia, I just
put down some of the ones whose data I have referred to here. Human Genome Project, SNP
consortia, RNAi consortia, all sorts of consortia that have emerged over the years, and this
has become such a powerful way to do science in the age of genomic information.
And then lastly, I want to make a special acknowledgement to the sequencing centers.
Over the course of now almost 18 years, the sequencing centers have worked together in
all sorts of combinations to help try to bring about this revolution and get data out rapidly,
and I think we all feel an enormous bond to each other. I want to acknowledge WashU and
Baylor and Tiger and Sanger of the Joint Genome Institute, the Stanford Genome Center and
others, and I particularly want to acknowledge, because it's a birthday party, NISC, for the
extraordinary role it has played in making sure that this genomic revolution and genomic
information that is happening all over the world is happening in spades here on the campus
of the NIH.
Anyway, this has been a great day, a great birthday party. The great thing about celebrating
a first decade, in this case, is that one can be sure that the next decade is going
to be vastly more exciting. So thanks for the opportunity to kind of tie it all up today
and hats off to everybody here for what they're doing. Happy birthday.
[applause]
Dr. Francis Collins: So, we have time for a couple of questions
before we adjourn to a reception. While people are finding their way, Eric, clearly the ability
to generate vast amounts of data is outstripping, I think, most people's expectations, although
I suppose it shouldn’t be said that we weren't sort of warned about this. Are we going to
keep up in terms of the analysis capabilities that we have to put together to make sense
out of all this or are we facing a mismatch in terms of algorithms, in terms of trainees?
Are we in trouble or is everything just nicely dovetailed?
Dr. Eric Lander: Oh, golly. Well, I have enormous faith, over
the long term, in young people. I think it's clear that the next generation has already
figured out that there's no distinction between being a wet scientist and a dry scientist.
They're all recognizing they're damp. That they are, they are both. And we're seeing
many more people going into biology now who consider it [unintelligible] bioinformatics
training and such. So if you say, over the course of the next 15 years, will the young
people lead us into the promised land by virtue of their understanding this new world, you
know, us old generation may not fully enter that promised land, but the new generation
will, and they understand it. Now, will they all fully show up in full force within the
next 24 months to deal with the data, or will there be this deluge of data beyond what the
existing training base is?
Oh yeah, we're going to be just overburdened with tons and tons of data. But that's okay.
I mean, you know, we'll, we'll manage to extract the most interesting things that we see in
the data so far, and then as more and more people come in more things will be extracted.
The thing we've got to do is make sure that the training programs are there. We've got
to make sure I mean, I hardly need to say it, because this is something I think NIH
believes deeply in. But NIH is the leader in training in the world here and we've got
to make sure that essentially everybody going into biology, even if they think they're going
to be a cell biologist, they need some cellular process, understands how to connect to this
world, and also that we bring in large, large numbers of people who have real training in
mathematics and computer science, etcetera. So in a 15 year time horizon, I think the
whole notion of what it means to be a biologist will change, and the young people here will
solve it. In the short term, well, we're just going to do all the paddling we can do to
stay afloat.
Dr. Francis Collins: Okay.
[end of transcript]