Tip:
Highlight text to annotate it
X
Well, thanks, Steve, and thanks for the chance to be part of this historic symposium celebrating
this quarter century of GenBank's contributions to biomedical research, which are indeed substantial
and it's hard to imagine how we would be anywhere near where we are right now without the dedication
of the folks that have been engaged in this for all these 25 years. The challenges that lie ahead
are at least as much to be dealing with a somewhat anxious way as what we've traveled through before
because clearly the acceleration of the pace of discovery is moving at an exponential pace
and that's going to, I think, put a great deal of stresses upon David and Jim and the rest of the
very capable GenBank, NCBI staff. But I am sure, based on past performance, that they're up to it.
Since this is a historic occasion, I thought myself a little bit back on the ways in which the use of
sequence data has played out in my own personal experience, and I recalled the first real major involvement
I got involved in in terms of trying to determine sequence at reasonable scale - at least it seemed
really reasonable at the time - and that was in fact the publication of a paper in 1984 that had this
as one page of a very long appendix. This was an effort that I conducted while working as a post-doc
in Sherm Weissman's lab in Yale to pull together all of the DNA sequence that had been derived
on the human beta globin cluster as of that point. So this was just about 25 years ago, and in fact,
what you're looking at here is an output, which I had to actually travel across campus to find this
fancy piece of equipment called a laser printer, in order to make this very nice output.
The output itself was painstakingly put together by me, by basically taking sequence papers that
had been produced by various groups. The Weissman lab, partly myself, and partly others then
tried to fill in the gaps because people hadn't actually tried to do this in a contiguous way.
And I pretty much took all of the parts that were not already in electronic form, which is most off it,
and then entered it by hand and had my, at that time 14-year-old daughter do the proofreading.
And she got paid two dollars an hours, which was pretty good, although she got pretty tired of it
before this was over. And then I annotated it as you can see there with some very clever indications
about what's present. This happens to be a stretch of DNA just upstream of the beta globin locus.
You'll notice from the numbers over there on the far side that we're already up in this huge
stretch of DNA, more than 40,000 base pairs. At this point this was, I think, the largest
contiguous stretch of human DNA that had been assembled. And, yes, I did notice there's this
run of Gs and Ts. I had no idea that would some day be called a micro-satellite and would
become a major force for genetic mapping, because that hadn't really been realized in 1984.
There's also actually a repeat down here, a pentanucleotide repeat, which I knew from my own
sequencing work was actually variable between individuals, but I wasn't clever enough to realize
that might be a really useful kind of genetic marker. So that was one page of it, and then here comes
the other page, next page along. This is the beta globin gene. I'm sure you recognized it because
it's labeled there in this elegant annotation. You can see where the transcript starts. You can see
where the initiation codon is and where the entrons are. And, oh, yeah, there's an EcoR1 site there
in case you care. So this was my exposure to what I was quite sure I didn't ever want to do again.
Which was to try to put together this kind of sequence trace and then display it in some fashion
where people might be able to use it, and I'm sure nobody used this unless they were really desperate.
We had to do better. Well, five years later, I found myself participating in the construction of
this particular diagram for a published paper in Science. And this, again, looking like an awful lot of
sequence data - in this case annotated with the amino acids and with some domains that are marked off.
You might wonder what is this. Well, that little triangle right there is the common mutation
that causes cystic fibrosis, so this was in fact the description of the cDNA sequence of the CFTR gene.
But again not much benefited by a lot of the things that we now take for granted, because it was early days.
Although, I will say in this case, this was much easier than it had been five years earlier for the
hemoglobin conditions. So clearly we've come a long way, and a lot of that is, I think, a credit
to GenBank and to the things we're celebrating here at this meeting. I looked back at the timeline
that was put together in 2003 to celebrate the completion of the goals of the Human Genome Project
and, of course, there was much debate of what belonged on the timeline, what were the really
major events of the last 150 years, and you can see the ones that were chosen here for the first part of it,
and then you go the bottom part of it. And I was happy to notice - I don't quite remember how we decided this.
There's GenBank, ranking right up there with Mendel in terms of making it to the timeline.
And also the formation of the International Nucleotide Sequence Database Consortium in 1986,
most of which were nicely talked about, mostly in presentations yesterday by those who were
deeply involved in these remarkable moments where really important plans were made
by thoughtful people to try to catalyze where were heading. Of course, where we were headed
then morphed into the Human Genome Project, which had its own series of milestones,
all of which, I think, really could not have happened without being undergirded by the databases
and the wonderful work that NCBI in particular was doing. I want to just highlight - and it was
mentioned a minute ago by Steve in the very nice introduction - what a critical thing happened in 1996
which was the agreement amongst the International DNA Sequencing Consortium for the human
genome, that all the data was going to be made available, and this is actually a photograph of the
white board, which was written upon at that point in 1996. I think this is John Sulston's handwriting.
It's either John's or Bob Waterston, and this is what the group agreed to, gathered there at that
meeting in Bermuda, which sounds elegant, but I don't remember anything about it except for
it had a very dingy conference room, and we gathered there for two and a half days to try to
figure out how we were going to actually tackle this problem of sequencing all of the human DNA
at the point where it just began to look like it was possible to ramp that up. And this is what we
all agreed to, the conditions about release and immediate submission, aimed to have all sequence
freely available and in the public domain for both research and development in order to maximize
its benefits to society. And that was radical. And again you can see here automatic release of
sequence assemblies greater than one kilobase, preferably daily, and that what was agreed to.
That really set in motion a series of increasingly broadly accepted ideas about public access to data
that continue to spinout today and were nicely represented by Betsy Nabel's presentation about
what we are doing about genome-wide association studies and making those available to investigators
long before publications can be generated in a way that I think really has produced a whole new
approach to try and maximize benefit to society by making access immediate for investigators
who would like to get their hands on the data and start to work with it. The second part of
the Genome Project then continued apace. Again, I might say, we couldn't have made those
Bermuda rules if there wasn't a database ready to receive all that data, and so the existence of
GenBank was critical for that to happen. The draft genome in 2000, the published draft in 2001,
and then in 2003, the essential completion of the human genome sequence with now much more to go.
So that's about the past, and I should say those who contributed to that sequencing of the human genome
are a remarkable group, more than 2,000 of them, here just picturing primarily the leaders of
those 20 centers that gathered together with those immediate access data rules under their strong support
to make all of this happen in record time, ultimately producing the genome two and a half years early
and happily for taxpayers at a price tag about $400,000,000 less than what had been anticipated.
I just want to recognize one person - since we're here talking about NCBI - who is a big part of this
and who I don't think has been mentioned in this symposium, and that's Greg Schuler, who was
a very important part of the effort to deal with the assembly of the sequence and the display of
that sequence for the world to look at. So I've had the great fortune of being able to be part of
these efforts, both the sequencing of the human and many other large-scale projects, and I very much
know what Wilson is talking about here in terms of the way in which these things can either
succeed or fail, and that is by borrowing all the brains of a lot of smart people, you can make
amazing things happen, and I've been fortunate to be able to do that. So now let me turn to where we are
at this point, so to give you some notes from the frontlines as I see it of the genomic revolution
and I'll go through about six points. And this is going to be a little arbitrary because there's so much
going on right now that I had to do a bit of picking and choosing. Of course, one of them is
comparative genomics, the ability to not only look at our own genome but that of many other species
now amounting to more than 30 vertebrates that have had their genome sequence determined
either in draft or in finished form. The mouse, of course, the chimpanzee, the dog, the honeybee,
the sea urchin, obviously those aren't vertebrates, but this one is, the macack. I just picked
the ones that had covers, and I guess I left out the rat because it has such a horrible cover
on the cover of Nature, but ultimately it should be in this list as well. I think what we've learned
from this is an enormous amount about evolution, basically being able to look at evolution's lab notebook
which is what genomics allows you to do, and identify which parts of the genome continue to be under selection.
And we have a lot more to learn from that, and one of the things we hope to do now is to focus
specifically on primates and try to learn more about recent evolution in terms of how it has affected
our own species. One of the things that has been talked about a little bit at this meeting - and
maybe David will talk about it this afternoon - is the way in which there's some serious interest
in trying to be sure that the data is curated in GenBank so that when you go there, you can be
pretty sure what you're looking at, it has high quality attached to it. Obviously the Ref-Seek approach
which was mentioned in a really wonderful talk yesterday by Jim is a big part of that.
But there is still this question about what to do with the large production genomic projects
including the human, the mouse, and others, and the idea of having a community go in and
annotate this in a true Wikipedia fashion, doesn't seem to be a good idea, and I think David
makes that case very strongly in this recent News and Views coming from Science Magazine.
Just as a bit, a point of information, this is something certainly we at the Genome Institute are
concerned about, having put all this effort into generating these reference sequences of various species,
we would not want them to fail to keep up with corrections and additions and things that are
getting even better, because of course none of these were absolutely perfect. The human genome sequence,
when we called it essentially complete, still had about 300 gaps in it that couldn't be closed by any
known technology, but are now gradually getting attacked. So there is now a Genome Reference Consortium
which has its specific mission to correct a small number of regions in the human genome reference
that are currently misrepresented, and to be as a central point of contact for community members
that think they've found something wrong or they've fixed something that previously was ambiguous
and that involves particular leadership from Wash U and the Sanger, and from the bio-informatics groups
at EBI and NCBI, represented in particular by Deanna Church. So there is a serious intention here
to continue to improve those sequences as more information becomes available, but not in a way
that generates chaos, hopefully in a way where any change that's made in the reference sequence
is well backed-up by experimental data. Meanwhile, of course, DNA sequencing which has been
perhaps for most of the last 25 years in a fairly stable state in terms of depending upon Sanger
dideoxy sequencing and gel based separation of the products is as all of you know, undergoing
revolutionary technical advances that allow us to proceed in a massively parallel fashion and
generate amounts of data that are truly breathtaking, some of those coming from this instrument
already mentioned, the 4-5-4 which carries out this kind of parallel sequencing on beads,
some other data coming from the Illumina Solexa instrument where the parallelism is done in clusters
on a solid surface - here shown in a microscopic picture, each one of these representing a clonal
population of DNA molecules that then can be sequenced in place. And, of course, we're not done yet.
If you saw this paper in PNAS, suggesting really dramatic acceleration might possible from
this company called Pacific Biosciences, using zero-mode wave guides and single molecule sequencing,
truly single molecule sequencing perhaps with very long reads, if this in fact reduces to practice
in the space of the next couple of years, this could really open up the ability to collect sequence
data at a dramatic level, as could this one, which is a paper just published last week coming
in this instance from Helicos, showing that they have reduced to practice the ability to do single
molecule DNA sequencing, and they applied this in a fashion that is pretty impressive,
to a particular viral genome, namely M13, as a proof of principle - neither of these probably
going to make a big difference in sequence output in the next 12 months, but wait until we see
what happens in the next year or two. The disruptive innovations that are characterizing this field
are coming along hard and fast, and if any of these succeed at the rate that might be projected,
we may get to that thousand-dollar-genome a lot sooner than the seven or eight years that many people
have been predicting, and of course the challenges for GenBank and NCBI if this comes to pass, are only
going to accelerate in a log scale - and it's a good thing we have capable leadership prepared for that.
I'm sure we do. Let me go on to a particular application of high through-put sequencing
that I think is turning out to be very exciting, and this is a joint effort of the Genome Institute
and the Cancer Institute. Of course, cancer is a disease of the genome. We all understand that now,
although we didn't really know that until about 30 years ago. It's now quite clear,
and this particular quote is an interesting one, because it comes from Renato Dulbecco in 1986
in a piece written for Science, which was alluded to in yesterday's presentations, so one of the
very first calls, the very first published call to do the genome was on the basis of understanding
cancer, and Renato, I think, made a very strong and effective case for that. And that has resulted
in the very first year or so in the formation of the pilot effort for the Cancer Genome Atlas,
which is trying to bring together all of these components to apply high through-put sequencing
and other means of genome characterization to cancer, and particularly we have chosen three tumors
to start with - glioblastoma multiforme, ovarian cancer, and squamous cell carcinoma of the lung
for each of those attempting to collect hundreds of tumors for which there is also DNA available
from the same individual, so you can tell the difference between somatic and heritable changes,
there's technology development also associated with this enterprise to try to push the agenda
forward for doing this in a comprehensive and cost-effective way. I'll show you some very new data
from the Glioblastoma Project, the first one that has come out of this - and this is only about a week old
at the meeting of the TCGA analysis group that was held here just very recently - and basically
what one needs to do if you're going to analyze a tumor of this sort, is you need lots samples,
and this is actually taken from the glioblastomas. Let me explain what this picture is telling you.
Here are the human chromosomes 1 through 22, each column here and you can see there are a lot
of them - more than 100 - represents a different tumor, and the colors represent whether you
have a gain or a loss - with a loss being in blue and a gain being in red - and what you can see here
is some patterns are appearing. This is using a statistical method called GISTIC, to be able to try
to assess what's going on, and you can see both there are whole chromosome changes or
chromosome arm changes, but in a few places like that thin blue line there there's something
more specific going on in a narrower way. One can then move this to the more careful analysis
using GISTIC, which is an effort that Gaddy Getz at the Broad has but together along with others.
And what you come up with in terms of recurrent deletions and amplifications is summarized here
with a significance value attached to it in order to be sure that you're not just looking at the noise.
Many of the things that are deleted and amplified in these glioblastomas are in fact recognizable
as genes that might have been previously implicated for this particularly devastating cancer.
But some of them certainly are not strong peeks that have no particular assigned gene, as you can see here
are in fact of great interest. Now on top of doing the copy number analysis, which one can do now
quite readily at high through-put. This program has also begun to involve high through-put sequencing
and some 600 genes have now been sequenced - the exons of them that is - in about 90 tumors.
That's just a summary of the deletions and amplifications. And here's what's been found -
and again this is very new data, although the mutations that have been discovered have been
validated by an independent platform, so we believe that they are not false sequencing positives.
Here is the number of mutations found in these 86 glioblastomas after sequencing about 600 genes.
It's interesting that the positive controls, which are in purple, those known to be mutated glioblastoma
all turn up at the sequence level. But then there are all these other things in blue, and the two
that are most frequent so far in this group include an old friend of mine - neurofibromytosis Type I,
a gene that my lab identified in 1990, a gene which people have said plays absolutely no role
in glioblastoma multiforme, apparently having missed the mutations because it's a large
complicated gene with a lot of non-processed pseudogenes that make it very hard to work with.
ERB2, the gene which when amplified is associated with response to herceptin, also turns up in glioblastoma
but not as an amplified gene, but actually as one with point mutations in both the extra-cellular
and the kinase domain, and this has very interesting connotations for possible therapeutics, as does this.
So this is just an early peek at what is likely to come out of these analyses and as TCGA scales up
there's going to be lots more of this data, again all being deposited immediately into a database
where authorized investigators can go and see the information long before publication,
following essentially the same model that you heard about from Betsy for GWA studies.
So we're pretty excited about this, and Renato Dulbecco writing a little essay for a piece in
Scientific American, reflects upon just what a good thing it is that his dream is now finally coming true.
Furthermore, not only sequence information but other approaches to function are scalable,
at least some of them are, and a lot of work is being done there mostly to empower investigators
to be able to go faster than they could if they were dependent upon having to do all these things
in their own laboratories. One of the things that we've been particularly doing is to try to define
the parts list of the human genome, and to attach those parts to particular functions,
and the End Code Project, which published last summer the results of its pilot, focused on a
carefully chosen 30 megabases of the genome, brought together more than 30 groups that had
different approaches to try to understand genome function and produced a very interesting intersection
of those datasets, because it includes transcript identification and analysis, a lot of comparative
genomics, all kinds of information about histone modifications that are associated with active or
not so active chromatin, DNA's hypersensitivity, and even origins of replication and
transcriptional factor binding sites accompanying papers in Genome Research filled out a whole
special issue at the same time, just a wealth of information that we've never really had altogether
focused on the same stretch of DNA. And this is now scaled up to attack the entire genome.
As of last fall a number of groups have been funded to do this and are hard at work to try to do
what we can to decorate the genome with this functional information that many people would love
to get their hands on. We also have very recently initiated a similar End Code Project focused
on flies and worms, the MOD End Code Project recognizing that we have a much better chance
in that situation because these organisms are so well understood and so readily manipulated
to do an even more thorough job of functional analysis in a coordinated way, and there are
a number of highly regarded groups that came forward and applied for the chance to participate
in this, and this is all getting underway in a very exciting fashion that should be well worth watching.
In addition, of course, the mouse continues to be an animal of huge interest amongst researchers
who are trying to understand gene function, and we do now have this international
Mouse Knock-Out Consortium, which aims to knock out every one of the mouse protein coding genes
in the course of the next three or four years, as an international collaboration between the NIH
effort, which is called KOMP, the Knock-Out Mouse Project, and the European effort called
EuKOM, and the Canadian effort called NorKOM. If you're interested in actually making sure that
your favorite mouse gene gets put on the list for Knock-Out, sooner rather than later, you can do so.
Simply punch in K-O-M-P in your Google browser. The first thing you'll come to is a radio
station Los Angeles. Don't pay any attention to that. Go to the second entry, which is the NIH
Knock-Out Mouse Project, and it will direct you how to put in a request to have a particular
mouse locus targeted by an approach which is pretty sophisticated. It's based on homologous
recombination. It will generate a conditional ready null allele which can then be distributed
at very reasonable distribution costs as an ES cell, which you can then turn into a mouse
and figure out what the phenotype might be. In addition, there's been a considerable need
over the course of several years, to have a set of full length cDNAs available for individuals
who are interested in studying either a single gene's function or sometimes a whole suite of genes
and we all know - those of us who have had the experience ourselves or watched our students
or post-docs trying to do it, that it is really not a very exciting thing to try to get that last five
prime exon of a gene that you're interested in that just seems to be resisting you. And so the idea of
doing this as a comprehensive resource was generated in fact quite a number of years ago,
and has been jointly managed by the Genome Institute, the NCI, and the NCBI. And the NCBI
has been critical to this effort, most recently with Lucas Wagner playing a very helpful role
in the curation of this database. We're getting pretty close to having what you could call
a complete set of both human and mouse genes. As you can see by these numbers, the last roughly
1,000 or 1,500 of each of those are now actually being synthesized because it turns out when you
get down to the last dregs, it's more efficient to do that by synthesizing them from scratch
as opposed to trying to pull them out of a cDNA library where you can't seem to find them
or by rescuing them by RTPCR. Many of these are large cDNAs, greater than six kilobases
which are very hard to find in full length in other places, but you can synthesize them quite readily.
Well, not quite readily, this is still a bit of a challenge. I must say this experience has taught me
that we aren't quite at a point where you can just dial in the sequence you want and get it tomorrow.
The DNA synthesis capabilities still run into trouble with certain types of sequences.
But at any rate, this resource, which is again completely available from numerous distributors
only for the cost of sending out the clone, should be an empowering one for people who are
trying to understand function. And for other people who are trying to understand function who
are used to using mouse knock-outs our sIRNAs in order to look at the performance of a
particular protein in a particular circumstance or looking at a pathway or even looking at a
cellular phenotype trying to understand how to modify it, the availability of small molecules
as an additional permutation for these kinds of experiments is a really empowering, new
appearance on the scene that many people still, I think, have not completely become aware of
and which you could certainly benefit by learning more about if you have not taken advantage of this.
So this is a resource supplied through the NIH Roadmap where investigators who wish to develop
a small molecule, an organic compound that will act as an agonist or an antagonist or some other
kind of perturbing force on their favorite phenotype need to develop an assay that will perform well
in a high through-put setting, primarily in 1536 well plates. That can then be peer reviewed
and if accepted, put into one of the high through-put screening centers. Pictured here is the one
up in Gaithersburg, the NIH Chemical Genomics Center. They will screen a library of now
300,000 compounds. Hits are generally found. Those can then be optimized a bit with some
medicinal chemistry, made sure that they have appropriate solubility specificity and so on,
and then the compound goes back to the investigator and importantly, the compound also
goes into PubChem, a database which NCBI has founded, which is the first time that we
have had a public database of small molecules, because most of this information has remained
out of view, behind the curtain, or in a subscription database that many people did not have the
resources to buy into, so PubChem has really opened up once again a whole new territory
in terms of biomedical research that has been very empowering, and much credit there
to Steve Bryant who has curated that, and who I'm happy to hear is recovering from a very
serious accident earlier this year, and we all wish him the best after that very difficult time
that he's been through. So these are just some of the resources that I think are driving the
forward motion of experimental determinations of function and so, yes, high through-put doesn't
have to mean no output. High through-put can actually mean a lot of exciting functional results
and we're seeing those happen all around us as a consequence of these available approaches.
You've already heard from Betsy just what a dramatic set of things have happened in cardiovascular
disease as far as genetic factors in common disease being revealed, and this has in fact been a
glorious 18 months or so of such discoveries. And it was a good thing, because we weren't doing
very well until fairly recently. This curve - this graph from 2002 shows you how dramatically
we did in terms of discovering genes from mendelian traits in the 1990s, empowered particularly
by Genome Project tools, and how dismally we had done in terms of identifying genes for
common disorders with only seven such human complex trait loci being accepted by these authors.
And, boy, has that changed especially in the last year or so, and it's changed because of a couple things.
One being the HapMap Project, which enabled the ability to understand how genetic variation
is organized across the genome so that one didn't have to test as many individual variants as you
thought you would, because they're traveling in neighborhoods, and you can choose a small subset
as a proxy for all the rest, and that makes it possible then with SNP chips that have only half a
million SNPs on it to have a pretty good chance of representing the rest of the SNPs in the genome.
The other thing that's helped hugely is this profound drop in genotyping costs over the course
of the last few years. When a genotype cost 50 cents as it did back in 2002, the idea of doing a
genome-wide association study was just prohibitive. Now that it costs about a tenth of a penny
things have really come around. And the results of that as you can see in this diagram that was
put together by Terry Manolia and Lisa Brooks and myself for a review that's coming out next
month in JCI is really pretty amazing when you look at the decoration of the human karyotype
with new discovered variants associated with common disease. The first HapMap success in 2005
macular degeneration compliment factor H. In 2006 three more of these being discovered.
There's that NOS1AP that Betsy mentioned for QT interval prolongation. And in 2007 everything
just broke wide open. This is just the first quarter of 2007 - the second quarter -
the third quarter, the fourth quarter, and this carries us up to about February 1st, and I need to
update this slide because it's already out of date. Every one of those banners represents
the discovery of a variant in a locus that plays a role in a common disease or a quantitative trait
in humans, and each one of those reveals a remarkable story, because most of them are really
quite unexpected in terms of which genes turned out to be involved in conferring risk to disease.
Let me say, however, that while this is a glorious kind of moment of discovery, what you're
looking at there explains only a tiny fraction of heritability for most of these conditions.
So a big question mark is where is the rest of the heritability. Is it in those rare variants like
the PCKS9 example that Betsy mentioned that Helen Hobbs has shown us for LDL. Is it in those
copy number variants? Large stretches of DNA, which may be deleted or duplicated which are
more difficult to test, but which are increasingly coming within our grasp especially with
high through-put sequencing using paired-end reads. Is it that there's gene-gene interactions
that we have not been able to assess adequately? Are the gene environment interactions such
that we're missing out on the goodies there? Or have we overestimated heritability for some
of these conditions? We don't know the answers to that. My own bet is that there's a whole
spectrum of rare variants undergirding the common ones, and now that we have sequencing ability
we should be able to begin to look at those as well. Will they be in the same loci as the common
variants? I don't know. We should be able to find that out, but I can tell you as somebody who
has worked on diabetes for the last 15 years in my own lab over here in Building 50, we have really
moved into totally new territory in the last year by the availability of theses kinds of studies
to uncover now some 16 loci that are associated with that disease and which are shedding
entirely new light on our understand of what the really fundamental molecular basis of the
condition might be. We still have a lot to learn, but we finally, I think, have blown away a
little bit of the fog, and all of this recognized by Science as calling human genetic variation as the
breakthrough of the year, and as Betsy has already ably demonstrated - and again I would also
like to give great credit to Jim Ostell and Steve Sherry, the existence of the dbGaP database is
making all of this information rapidly accessible to qualified investigators to begin to do the
hard work of understanding how this all works. Because after all what a GWA success tells you
if you achieve genome-wide significance, you've got to be really careful about the statistics,
and if you've ruled out population stratification or some technical problem, then you can say
there is a common variant located somewhere in a segment - and the segment might be small or large
depending on whether linkage disequilibrium is weak or strong in that area - that's associated
with modestly increased risk of disease. Modestly - I mean the odds ratios are often 1.2 to 1.3
in the particular population that you've studied, and obviously you don't want to stop there.
That is a clue, and now you want to go forward. And Betsy showed a different version of a very
similar diagram here in terms of what one then wants to do to move down this diagram to try to
get both the truth of understanding how this biologically makes sense and ultimately to translation.
Notice that one of the things that everybody wants to do almost immediately after they have
made such an association is to say, okay, what are all of the DNA sequence variants that might
have driven that association, because I want to know that whole set. I know there's something
here in this stretch of DNA that differentiates cases and controls, but what is it? Because they're
all traveling together, and the database of human variation - dbSNP - is not yet complete.
And so, unfortunately, many people are then forced to go and do their own sequencing through
such an interval and in a fashion that is often inefficient and sometimes error prone and certainly
more expensive than it should be. So in order to try to deal with that issue and supply a catalogue
that would speed up this process and eliminate the need for every group to do this all by themselves
we have just begun an effort with major participation from the Welcome Trust and from the Beijing Institute
to try to deepen that catalogue of human variation by sequencing roughly 1,000 genomes at
draft coverage 2 to 4X in order to be able to say that we have essentially discovered variants
in the human genome that have a frequency of one percent or greater, and that would be very
helpful in terms of accelerating the ability to follow up on these GWA studies. There are no
phenotypes associated with the DNAs in this situation, and a major reason for that is we wouldn't
have enough power with any particular condition to be able to draw any conclusions about
genotype/phenotype association in a set of 1,000, and furthermore, these are samples that have
been adequately consented, so that the data can be put completely on the Web without any
barriers at all. This is one of those rare examples where the information will be completely in
an open database, but the absence of phenotypes adds to the confidence that these individuals
are not facing a risk by having agreed to that. This will be happening over the course of the next
year and a half. This will involve not just looking for SNPs, but also looking for copy number
variation, again by doing sequencing with paired-end reads to be able to identify that in a
systematic way, but of course that's just part of the problem. The big issue is going to be
functional analysis, and I think here is a place where individuals who are particularly
motivated to try to understand a particular locus and who could bring the full power of
every kind of functional assay to bear on it are going to make most of the major insights.
I do think there is one or two places, though, where we could accelerate that. One is, of course,
having the accessibility to tools like Mouse Knock-Outs and SIRNAs and small molecules.
Another would be to try to have a resource that better enables us to detect the correlation
between gene expression and genotype, which at the present time has been done for
lymphoblastoid cells, but not so much for other tissues. Imagine that you have done a
GWA study and you're trying to figure out, well, okay, I've got a signal here. I know there's
something in this region that is associated with disease risk because here is my interval
and here are the SNPs that are giving me a statistically significant result, and that one's a little
better than the others. This is a pretty typical result, although this is a cartoon. You then go
and look at this interval to say, okay, what locus maybe is driving this, and not too often
but fairly often, you find that you are landing in a region that's fairly gene rich, and so you
have multiple loci that might be in fact involved. Well, any one of those could be in fact
driving your association. Oftentimes, you might have a hunch based on a biological argument
that one makes more sense than the other, but you should be worried about those hunches
because they often turn out to be wrong. So what would help you here it seems would be able
to say something about whether this SNP actually has a cis-acting affect on expression of one
of these genes, because if it does, that's going to increase your likelihood you've got the right
locus, and if may even tell you whether the affect of the association is over-expression or
under-expression of the locus at hand. So here is in fact - and this is basically what was done
in that very nice study that Betsy mentioned from Cookson's lab on asthma, where they did this
using lymphoblastoid cells and discovered that ORMDL3 was the right locus, but here is the idea.
You would need to have an appropriate tissue in order to be able to assess the affect of this
SNP on cis-acting gene expression. You would then look at the level of expression for the
three genotypes at that SNP, and you'd look to see whether there's any relationship or not.
And you would not expect to see a dramatic difference, but here in this instance you can see
one of these genes in the cartoon does have a relationship to the genotype at that strongest SNP
and the other four don't. Now in this situation, I drew this as though you were testing actually
tissue from diseased individuals, but frankly it doesn't matter, because what you're looking for really is
is there an affect of that particular genetic variant on expression of this gene in a cis-acting way.
And so if one were to look at normal tissue - and this was done in that Cookson paper -
you get the same answer. So what does that say? That says we should in fact generate a database
as soon as we can of human tissues, pick maybe our 50 favorite organ sites, a sample, perhaps
a thousand tissues from different individuals for each of those organ sites, and then do your most
careful quantitative gene expression analysis, which these days will be done by sequence tags
not by hybridization, and then also do genotyping across the genome for each sample.
And in fact that is a proposal which is being floated around as part of the NIH Roadmap competition.
Obviously, huge challenges in terms of sample acquisition, but based upon experiences that
we're learning about from other groups that have tried to do this on a somewhat smaller scale
this sounds like it might be possible through very rapid autopsies that could be done, could harvest
many tissues at one time from the same individual, cutting back the amount of genotyping you
would have to do. I am sure this is a database that would teach us a lot about cis-acting
regulation of gene expression, and it would be particularly valuable for this purpose.
It would probably also tell us about transacting affects, because if you have a variation in a
transcription factor, you might expect that then to see resulting in numerous target genes,
which are up or down a little bit depending upon the abundance or the time of expression
of that transcription factor mediated by its cis regulatory signals. That, of course, is going to be
a more statistically challenging problem, because you have to look across the whole genome
as opposed to in the local environment for a cis-acting affect, which is one reason one probably
needs a thousand different human donors for each tissue, if you're going to have the sufficient power.
So this kind of thing would help us, I think a lot. Well, let me move on quickly - because I see
the yellow light has gone on - to some other applications in the clinical arena, and I think they
are expanding rapidly and perhaps a familiar diagram of the way in which all of this is going
to play out. It think we're doing pretty well in terms of finding common genetic risk variances
for common disease. Obviously a lot of heritability yet to be discovered, but as we get into the
position of being able to figure out who's at risk in circumstances where we have preventive
medicine strategies, that will be clinically appealing, and that of course is already possible now
in some cases of cancer. The pharmacogenomics approach perhaps very nicely outlined by Betsy
in her description about what NHLBI is getting ready to do with warfarin to try to demonstrate
in a rigorous controlled fashion whether this does in fact improve outcomes, is potentially
going to be a very exciting application of genomics as well. And ultimately, the therapeutic
developments that we all hope for will maybe be the most major benefit, but will be the longest in coming
because of that long pipeline between getting an idea and having the FDA approve your drug.
One of the things, though, that I would like to quickly touch on is this common set of misconceptions
that I regularly hear about whether these GWAS discoveries are going to have any value
in terms of identifying new drug targets that we didn't know about and starting things into
the pipeline that might be truly novel approaches to common disease. Because people look at the
results and say, well, you know, you found a variant, but it only has an odds ratio of 1.2.
How could that possibly matter in terms of providing you information about a new drug target?
And similarly to that, people will say, yeah, but if you develop a drug based on that particular
discovery of a GWA signal, then it will only work for people who have the risk allele.
Both of those suppositions are in fact false, but they do immediately sort of attract attention.
And let me just point out a good example of how this can't be right, and this comes from my
own most familiar disease of type II diabetes. When we got together, my group and the group
in the Welcome Trust and the group that was doing that at the Broad, the DGI and my group
is lead by Mike Bankey, the statistical geneticist at Michigan, these were the ten genes that
we together published about nine months ago in Science, and there were in fact a few of them
known before, but most were not. And interestingly of these ten genes derived from this
genome-wide association study approach - these are not candidates. They're what came out of
the search, two of them, KCNJ11 and PPARgamma, are well known targets of the mainstays
of diabetic therapy. If you want to go beyond insulin KCNJ11 codes for one of the subunits of
the sulfonylurea receptor. PPARgamma codes for the target of thiazolidinediones.
So here we have a proof of principle that if you hadn't already known that, you would have
discovered in this first pass at least two very exciting drug targets, and maybe there are others
here as well. Similarly, recent studies that we've been involved in looking at lipid levels
coming up with a whole bunch of newly discovered loci for LDL and HDL and triglycerides.
Interestingly, one of the ones that turns up for cholesterol is HMG-CoA reductase, of course,
the classic target of statin drugs, which hadn't previously appeared to have variants that played
much of a roll in lipid levels, but when you have very large studies you have sufficient power
to see that. So there's another one that would be awfully valuable if we didn't already know about it
and doesn't it seem likely, therefore, that there are others on these lists that will make also
very exciting drug targets once we sift through and try to understand their function and
figure out how to take that approach. Finally, let me finish by saying that all of these exciting
developments in the science, and they really are exciting, mean that we need to pay even more
more attention than we ever have to the ethical, legal, and social issues, because otherwise
the benefits that we all hope the public will enjoy may end up being truncated by misuse of
the information that causes people to be injured or to just stay away from the opportunity.
And this is only accentuated by the fact that personalized medicine and genomics has suddenly
become a market opportunity for direct-to-consumer companies like the three that you see here.
All three of these will offer you for between $1,000 and $2,500 analysis of your genome covering something in the neighborhood of a half a million to a million SNPs, and then will
give you feedback in terms of what your predicted risks are for a long list of conditions,
as well as information about ancestry. And this is an interesting development and actually one
that I think causes many of us to be a bit anxious, because there is great potential for confusion,
great potential here also for health behaviors to be modified in irrational ways, and potential
for discrimination if people are not careful about how this information finds its way into
health insurer's hands or into the hands of the workplace deciders who may think that this is
something you ought to know about before you decide whether to hire, fire, or promote.
That would be both at this point unjustified on the basis of scientific grounds, but it would
certainly be unjustified on the basis of principals of equity and justice. A recent piece from
last Friday's Science from Cathy Hudson and colleagues makes a strong case that this is not
a stable situation and deserves a higher level of oversight than is currently being offered,
and goes through the case for that using SYP450 testing for SSRIs. And, of course, in terms of
the policies that would be most beneficial to at least eliminate the genetic discrimination risk
we are still waiting for a successful outcome after more than a dozen years of hoping to see
federal action on this, Here is a bill introduced in 2003 that passed the Senate unanimously
and the 108th Congress, but the house took no action. That bill was then re-introduced in the 109th.
The Senate passed it unanimously. The House did not bring it to a vote. We're now in the 110th
and both Senate and House have taken it much more seriously and the House passed the bill
this time 420 to 3 just about a year ago. The Senate has not acted upon it. It is currently tied up
as a result of a hold that's been placed on the bill by a single senator, and as time is clicking by
the likelihood that this bill will get passed this year seems to be growing fainter by the day.
And this is truly a frustrating experience when you can see on a research basis how much harm
the absence of such protection is doing to our ability to do studies and how the solution actually
at this point seems to be fairly clear and actually embraced by many people. So there you have it.
A romp through a whole lot of components of what has happened and what is happening now
and what we hope will happen in the future with this approach to the genome, much of it
catalyzed by the wonderful folks at NCBI who had all of these remarkable abilities to archive
and display the data and in fact think about it. So Sidney's comment from this morning, which
I think was the representation of the phrase that was worth a thousand power points - it is better
to combine human intelligence with artificial stupidity - he was referring to computers, but I'd
like to put Congress on that list, too - than to do it the other way around, and I guess the credit
that we should all give to David and to Jim and to all of their wonderful colleagues down
through these years as the overseers of NCBI and of GenBank, is that they have in fact provided
an awful lot of human intelligence to make this whole process go as well as it has, so my image
seems to be somewhat similar as Betsy's, although we didn't collaborate on this - fireworks
are appropriate. Happy birthday to GenBank. Congratulations to David and Jim. Thank you all.
Applause
Be glad to take any questions.
Can you go to the microphone? Because otherwise people on the Web can't hear.
A question about your closing comment. Is this a product in Congress of intelligent something?
Since I'm a federal employee, I should be very careful about how I answer that. First of all
I guess we should give Congress the credit for actually having started the Genome Project,
so there's a pretty important thing to point out. In this particular instance of genetic discrimination
I'm afraid that the wise-heads have not necessarily carried the day, but we all hope that they still will. Yeah.
My question may be somewhat naive, being a young scientist, but I was in a discussion recently
and we were discussing the difference that you mentioned about full sequencing versus
genome-wide scanning and the benefits of each, and I was wondering, for example, if there was
a new mutation that brought about a disease, how long would it take before genome-wide scanning
would be able to pick that up because of linkage disequilibrium?
Collins: So when you say scanning, I assume you mean sort of SNP genotype being GWAS kinds of studies.
Well, no, it's a very appropriate question, because GWA studies really can only detect a variant
that has a reasonably high frequency, that is at least three or four percent, so if you have a rare
variant that's less common than that, GWAS will utterly fail you; you won't be able to discover that.
You're basically depending upon common variants to play some role in common disease,
and I think we can now say they do, but we can also say they by no means explain more than
a small fraction of the overall heritability, and maybe the rest of that is then hiding down there
in those rare variants. Obviously, if we had the ability to sequence the entire genome, instead of
doing GWA studies, we would do so, and we'd be happy to give up GWAS for all time,
if we could afford to go after the entire sequence. And we're moving in on that, and certainly
for some diseases already, there are efforts underway to start sequencing all the exons,
figuring that that's probably a pretty good place to go and look if you're looking for particularly
severe conditions. But, ultimately, we'll want the whole genome, and let me say that means
we're going to be doing complete sequencing of hundreds of thousands of human genomes
in the course of the next few years, especially if these new technologies make that possible.
And not only is that going to be a huge problem in terms of storage of the data and display of the data,
it's going to be a huge problem in terms of the analytical capability of those wise computational
statistical folks who are going to have to make sense out of all this, because there's going to be
lots and lots of rare variants discovered that have nothing to do with the phenotype you're
interested in, but they're along for the ride in that genome, and if you're not careful, you could
make a very serious mistake in terms of deciding what's cause and effect and what's just the noise.
So we have, I think, more than ever the need for a generation of computational biologists to also
be human geneticists to help us through this next very exciting phase of really getting the whole
spectrum of how heredity plays a role in health and disease. Thank you. [Applause]