Tip:
Highlight text to annotate it
X
Dr. Richard Wilson: Thanks very much, Bob. It’s a pleasure to
be here so I’d like to thank Eric and his colleagues for including me among this group.
And it’s a pleasure to be introduced by Bob. We’ve had a number of interactions
over the years mainly focused on finishing Genomes. In fact, Bob, I think it was you
that actually compiled the list of the most difficult Genomes to finish and so we’re
now very excited to have in review, I guess, a manuscript describing the platypus Genome
which really was, as you told us, the most difficult to sequence.
So, as I said, it’s a pleasure to be here. I want to talk a little bit about some of
the next generation sequencing technologies. And you heard a bit of this from Richard so
I’ll try to be complementary rather than redundant. I do want to say, excuse me, little
frog in the throat this morning, first, congratulations to NISC. You know, Genome centers are all
about accomplishments and milestones and firsts, so I have to give you guys credit. I’m pretty
sure you’re the first Genome center to have an anniversary party, at least, so, nice going.
But if you think about this, for those of you who really know Eric, this is not at all
surprising, because Eric loves to celebrate.
[laughter]
So I’ve know Eric for a long time, since 1990 when I moved from Cal Tech to Washington
University and he ended up just down the hallway from me and back then we had next generation
sequencing, at least the first iteration of next generation sequencing. We actually started
getting rid of radioactivity and the autorads that we used to, you know, manually read into
the gel. And so, we had a lab where we had a couple of these boxes that are shown over
there on the left. This is the old ABI 373 and Eric actually got one for his lab just
down the hall before he moved up to Bethesda. So this was exciting back in those days, 36
lanes, woo-hoo.
Of course we now have additional next generation sequencing. This is the next round and shown
on this slide are the three current platforms that we’re all trying very hard to understand
how they work and what the impact might be and what sorts of experiments that we can
do, that even just a few years ago we could only dream about doing. So this is the Celexa
instrument now manufactured by Illumina on the lower left, the 454 instrument up in the
center and the new ABI SOLiD instrument down on the lower right.
So I want to try to talk a little bit about what we’re trying to do with some of these
platforms. I want to go back a little bit because, you know, the promise of the human
Genome sequence was always that it was going to revolutionize biological research, going
to revolutionize medical research, and the way that we look at ourselves. And even just
having conversations among ourselves about how many people even in the auditorium would
actually like to have their Genome sequenced is pretty amazing when you think back on this.
But the reference sequence, version 1.0 of the Genome, now allows us to ask amazing questions.
What’s come along with the version 1.0 sequence is an amazing amount of technology, software
tools and infrastructure such as we have at all of the Genome centers that are represented
here today to allow you to actually use this reference sequence to ask interesting questions.
It’s buttressed by ancillary Genome sequences, the mouse, the chimp and others, and all this
really allows us, as I said, to do experiments that we only previously dreamed about doing.
Now what we want to do is start applying all of this infrastructure and resource to cancer
and other diseases. And really now with these next generation sequencing tools coming along,
it allows us to do this in new and exciting ways.
So if you think back on how we approached cancer initially with sequencing, you heard
Richard talk just a little bit about this, we used PCR and we’d been using PCR sequence
for many years. But one of the things we tried to do maybe five, six years ago was to try
to understand how to do it in very large scale, high throughput PCR-based sequencing. So we
came up with all kinds of computer tools to pick primers and to keep track of things but
the most difficult thing was then going through and looking at all of the data and try to
find where we were actually seeing synonymous changes or non-synonymous changes and so forth.
But this was the paradigm a few years ago.
When I approach a cancer, we have a list of candidate genes that we think might be involved
in a particular that we were interested in and then we’d get a large collection of
patient samples, large a few years ago was about 50, maybe a hundred. And it actually
worked, in some cases. This is one of the poster children for this type of approach.
A number of studies were done. We actually did one from our Genome center in collaboration
with Harold Varmus and colleagues at Memorial Sloan-Kettering Cancer Center, where we looked
at a number of kinase genes in non-small cell lung cancer patients and in the epidermal
growth factor receptor gene. Voila, we found quite a few mutations, all focused in the
tyrosine kinase domain of this protein and this coupled with some very nice phenotype
information. One of the things that we found are quite a few patients with non-small cell
lung cancer, typically non-smokers, who were treated with tyrosine kinase inhibitor drugs,
such as Aressa and Tarceva, had mutations in their EGFR genes more often than not. So
this was very exciting and sort of seemed to be a promise of things to come from additional
sequencing.
The work that we did there in lung cancer coupled with the work that was going on at
both the Baylor and Broad Genome centers, led all three of us to sort of put our heads
together on a little project we called the Tumor Sequencing Project, or TSP. And in this
Project we expanded our candidate gene list to about a thousand genes, it actually turned
out to be closer to 600 by the time we were finished, with about 200 very high-quality
adenocarcinoma samples. And just briefly, the TSP was organized late 2005. The focus,
as I said, was on lung adeno. We had a target list of about 600 genes, three sequencing
centers working together. We enlisted several cancer centers to mainly help us collect and
characterize lung adenocarcinoma tissue samples. Out of an initial set of about 800, we focused
down to about 200 that we thought were quality enough for sequencing. Divided up the labor,
each of the three centers did about 3000 amplicons, roughly 300 genes, and we had a common set
of roughly a hundred genes that we could use as sort of a cross-center comparison. We’re
currently now, we have all the sequencing done, and we’re currently now in the process
of going through this data analysis.
Some interesting things have come out of this already. This just shows you some mutations
that we’ve seen. If you then stratify all of these samples and you look at smokers versus
never smokers, you can find some differences. For example, you’re more apt to find mutations
in EGFR, GRB2 and GRB7 in never smokers, whereas KRAS and MET mutations are more common in
smokers. Likewise, you can use the mutations you find in some genes, notably TP53, ERBB3
and AKT3, to sort of get some idea of tumor grades. So as you can see over here in Grade
1 tumors, we didn’t find any mutations in these three genes. In grade 2 we start to
find TP53 gene mutations popping up and then in grade 3 we start to see more TP53, the
ERBB3 are coming up as well as some mutations in AKT3. This is early data but interesting
trends, nonetheless.
The other thing that we found that’s really exciting and there’s now a paper in press
describing this work, early on we used Affy SNPerase [spelled phonetically] to sort of
qualify the DNAs that we wanted to sequence, get some idea of what we actually had in hand,
make sure these were of sufficient quantity and quality for sequencing and we actually
found several interesting amplifications that represented potentially new targets that we
could add to our sequencing list. So we did see through these amplifications some that
had oncogenes in the intervals, and this wasn’t necessarily surprising. But there were quite
a few, TITF1, for example, was exciting, and this is a lineage gene for the development
of lung tissue.
So the idea here was to start to plug together the two technologies, sort of a whole Genome
array-based approach, with the sequencing focused on PCR. So this has sort of led to
our second paradigm. Rather than simply focusing on sort of hypothesis-driven gene lists, which
are somewhat biased in terms of the expectations that you have of what’s going on in the
Genome in a particular type of cancer, we now are moving more to including data-driven
targets that come from using such orthologous technologies as array-based RNA profiling,
CGH, and SNP genotyping. And this has been quite exciting.
The next sort of large-scale collaborative phase of cancer genomics is underway now,
known as the TCGA. This is a project that involves the same three sequencing centers
but now quite a few CGCCs or Cancer Genome Characterization Centers, and the idea is
the same. The sequencing centers have gotten started off on hypothesis-driven gene target
lists, and then using data that comes from various types of analyses that these centers
are doing, we add additional targets that the sequencing power can be focused on.
Well, I want to go back and tell a little story, because I want to take you into some
of this next generation sequencing technology. We started another cancer project in 2002,
as a collaboration between the Genome sequencing center and our colleague at Washington University,
Tim Lay, who was interested in acute myelogenous leukemia, which is a very nasty adult leukemia.
And when we started, it was the same sort of paradigm. It was an attempt to use high
throughput PCR-based sequencing, and focus on a hypothesis-driven list.
A couple of features of this project: we used primary tumors rather than cell lines as our
substrate for sequencing. In all cases, we would use match normal tissue along with the
tumor tissue, so that we could get a quick idea as to what we were seeing in germ line
versus somatic variations. We had a discovery set, a small set of samples of 96 matched
tumor-normals, and then any time we would find what looked to be a mutation in that
discovery set we’d then go and resequence it in an additional validation set of 94 tumors.
Had a gene list of about 450 genes, and then we also used this sort of orthogonal approach
with CGH arrays, expression profiling, et cetera, to contribute additional targets to
the list. And this worked quite well, and guess what? We found mutations in genes and
none of the genes that we found mutation in were all that surprising, considering we were
looking at leukemia.
So we asked ourselves a lot of questions and one of the things that certainly kept us up
at night was what are we missing? We’re focused only on exons, we’re mainly focused
on exons for genes that we expect might have mutations, what about all the other genes?
So in terms of sort of moving away from hypothesis-driven gene sequencing, there are a lot of ways that
can go. The Hopkins study that came out, Valquesque [spelled phonetically] was the senior author,
this group looked at 13,000 genes using, again, a PCR-based approach, in 11 colon and 11 breast
cancer cell lines, and found interesting mutations. But this is just a start. This is a relatively
small number of samples, you know, how do we scale this up? The problem with PCR-based
resequencing, it’s relatively expensive, it’s diploid at best -- some tumors you’re
going to have many, many, many more copies of particular alleles -- and it’s low coverage.
So how can we improve and also, again, what are we missing outside of the exons?
Well, we decided to try to take the next step with our AML project and we have started on
sequencing a whole Genome using the Celexa technology. This is our case referred to as
933124. This was a 57-year old Caucasian female who presented with a de novo M1 AML. At her
initial diagnosis, she presented with 100 percent myeloblasts in her bone marrow sample
and this is what we’ve used for our studies. The patient relapsed and died 11 months later.
In doing quite a few different types of analyses on her Genome, we found that she has completely
normal cytogenetics, as best as one can tell. Using NimbleGen arrays, the 2.1 million array,
we found one tiny amplification on Chromosome 7, about 7 kb. There was no loss of heterozygosity
detected on the Afi 500 k-SNP [spelled phonetically] array. We did find through our PCR-based sequencing,
that two mutations, an expanded internal tandem duplication in the FLIP3 gene, and then a
point mutation in her NPM1 gene. So we have informed consent for whole Genome sequencing
and eventually data release, and off we went.
So this just shows you the histology sample -- this was 100 percent blasts on the slide
-- so we had a really clean tumor here for sequencing. There really weren’t any worries.
This is a liquid tumor so there weren’t concerns about stromal contamination. As I
said, she presented at 100 percent blasts and she also has, as it appears, a completely
diploid Genome.
Well, one of the things that we always ask ourselves when we start sequencing a Genome
is what kind of coverage do we need? We had some idea of this for the old sort of ABI-
based sequencing. You could say when we got to 8X or 10X coverage, we had a pretty good
idea of what the utility of that sequence might be. Well, using this new sequencing
platform where we’re getting much shorter reads, we had some questions as to what coverage
would we really need and we spent a lot of the time busying our statisticians trying
to come up with coverage models, theoretical coverage models, that maybe they were right
and maybe they weren’t. One of the measures that we thought we could come up with perhaps
is that we’ve collected all of these polymorphisms using various types of arrays. Could we simply
use those as a way to measure coverage? So as we generate sequence, can we go and look
for all of these SNPs that we found with arrays and as we find 90, 95, closer to 100 percent,
can that then give us some sort of metric of how close to finished we are with the sequencing?
So here’s just a quick update in terms of the numbers, as of last week. We had done
55 runs of the Celexa/Illumina instrument, collecting 32 base reads in each of these
runs, about 44 billion base pairs which calculates out to about 14.5X haploid coverage. We detected
210 thousand SNPs in this Genome; 83 percent of these are present in dbSNP. And then, here’s
our coverage metric. Out of the 481 thousand SNPs that we saw on arrays, we’ve now identified
about 183,000, or roughly 38 percent. So just in terms of going through this how close are
we to being finished exercise, right? Using this metric, 14X haploid coverage represents
about 40 percent diploid coverage. Our theoretical calculations said that we were going to need
25-30X coverage with these short reads, to reach a goal of about 99 percent of the sequence
coverage and variants detected. So this looks like we’re on the right slope here and it
also seems to converge nicely with some of the other centers that are starting to use
this technology are seeing with regard to coverage.
We also use the new technology to do some cDNA sequencing. So cDNA sequencing is not
a technique that is dead and needs to be put away. This has actually been quite useful.
We’ve used a number of different cDNA library construction procedures and normalization
schemes that all fit very well with the idea of putting little bit of DNA on solid support
as one needs to do for these new platforms. And these were sequenced on both 454 and Celexa.
So, I have our pipeline up here, and I think just the key points are is that as these reads
from both genomic and cDNA libraries come through, and are checked here, looking for
SNPs and small indels and so forth, one of the key things that we do is at some point,
especially for non-synonymous and splice site putative variants, we then go back to the
old PCR-based sequencing pipeline to try and validate as well as look at the same sequence
variants in other AML patient samples.
So what have we done so far? We’ve really focused as a top priority on sequence variants
that appear to be non-synonymous, that are not in dbSNP, that are detected multiple times
in the cDNA sequencing effort and that are detected at least once in the tumor DNA.
So this again is ongoing. We don’t have all the coverage that we’d like yet from
this particular patient’s genome but we’ve found 59 non-synonymous variants in 43 genes.
Most of these are likely rare SNPs. The two somatic mutations that we had found using
a PCR-based approach with this same genome were found again with the Celexa sequencing
method and the internal tandem duplication, and FLIP3 and the NPM1 mutation. One additional
somatic mutation was discovered in FLIP3, a non-synonymous change here at amino acid
position 194. All of the other variants -- and these are all coding -- all of the other
variants were localized to genes that had not been previously implicated in AML pathogenesis
and hence were not on our original target list. And we have identified at least three
other putative somatic mutations and these are currently going through our PCR base sequencing
pipeline for confirmation.
This may be pretty hard to see from the back, but what I’m showing you here is just what
we can get out of the cDNA sequencing approach. So I’m showing you the one somatic mutation
that we found in FLIP3 and what you have here is an alignment of Celexa reads from the tumor
genome, as well as from the cDNA sequencing effort. So we get a nice match of the Ts,
reads with the Ts and reads with the Cs, doing no more comparison of figuring out how high
a little peak is underneath another peak, you actually get a nine base almost digital
readout of the frequency of these two alleles so -- and we see the same thing over here.
This would suggest about a 50-50 match, 50-50 expression levels between the two alleles,
and we can then extrapolate that to quite a few other genes. So what you’re looking
at here is for several genes, listed here. These are the frequencies of the expressed
copies of the variants, and the germ line sequences. So, for example, in this one here,
PTPN11, we see about 200 to one, or perhaps zero, the variant allele as compared to the
germ line. Up here you get a little bit more of a mix, about four to one. In some cases
it’s a little bit more of the 50-50.
We’ve also discovered several novel splice variants using this. Genes are listed over
here. For example, RPAIM1 [spelled phonetically], five new splice variants have been detected
and characterized using this combination cDNA and Genome sequencing approach and quite a
few others as well.
So just to summarize a few points, so next generation sequencing is here, at least this
iteration of it. And we can clearly see already that it will have a substantial impact on
the study of the cancer genome as well as for other human diseases. Coverage models
for next generation whole genome sequencing are converging; we’re starting to better
understand exactly what we can do and how much work it takes. These ancillary or orthologous
genome-based technologies are really crucial for understanding the target genome before
you actually start all this large genome sequencing. So the SNPerase [spelled phonetically] I think
still have some value. And then this transcriptome-based approach using cDNAs either as a stand-alone
approach or in concert with a whole genome sequencing effort represents a pretty powerful
adjunct for cancer Genome analysis.
More is clearly needed. We’re at the very early days of all of these technologies, and
one of the things that I think you’ll get the message of here today is that all of us
are trying to figure out how to bring these to bear, what they can be used for, what sorts
of up-front strategies and tools and technologies we need to develop to make them even more
powerful.
So if my last slide will come up which sometimes it does and sometimes it doesn’t, you can
see a list of some of my colleagues. The acknowledgments doesn’t want to happen. If anyone can explain
that bit of MacIntosh-ology to me that would be appreciated. So, thanks.
[applause]
Dr. Eric Green: Questions for Rick. Jeff?
[clear]
Jeff: So I’m trying to decide if your argument
to some extent might be for or against the Valescu [spelled phonetically] model meaning
that ultimately, a year from now, three months from now, the whole Genome association and
the whole Genome sequencing will both be of sufficient depth and sufficient quality that
if today you wanted to define mutations at a level of quality that at least based on
standard Sanger-based sequencing, you and others in this room have defined the value
proposition which still seem to be weighted in that direction. So I almost, again, I’m
trying to understand the dichotomy between the short-term value of implementing ABI-based
sequencing in support of cancer sequencing today across defined exons versus waiting
for whole genome sequencing.
Dr. Richard Wilson: So I think it’s like I said, Jeff, I think
it’s still early days on a lot of this sort of approaches that I described that we’re
trying to use on AML. I think there still is value in the PCR-based approach. I mean,
clearly, you find things that are of use in studying cancer. So I wouldn’t shut down
that pipeline quite yet. I think we can improve on it still, and then we can at some point,
I think, transition nicely to some of these next generation technologies. There’s still
a place for hypothesis-driven sequencing, and I think Richard’s example of the capture
array in the 454 sequencing is a nice sort of next place to go, if you will, for that
sort of target sequencing. And it’s cheaper, and allows us to do a lot more work.
Dr. Eric Green: Karen [spelled phonetically].
Karen: Hi, Rick. Nice talk. I was wondering how long
you think that you and the rest of the community will be tinkering with this iteration of next
generation sequencing before the next iteration of next gen sequencing comes along?
[laughter]
Richard Wilson: So, that’s a great question, Karen. I think
we have plenty to keep us busy at least for, what do you guys think, another ten years?
But we already see things that are right on the horizon -- I think in the next year or
two -- that will give all of these current platforms a run for their money. So that’s
exciting. And so it’s nice to have sort of a competitive situation, now. We all went
through the time when there was only one player and technology moved along as that player
allowed, so--
Dr. Eric Green: One more over here.
Female Speaker: Hi, Rick. This is actually something that
occurred to me during--
Dr. Richard Wilson: Oh-oh. You’re supposed to be on not just
shout.
Female Speaker: This is something that actually occurred to
me during Claire’s talk. I’m not sure you’re the best person to answer it but…
It seems to me that with the variety of microbial Genomes that exist in humans and the collection
of variants that each individual would have and then, again, the host genome of humans
and the difficulty we’ve had making associations between variants in the human genome with
disease, is there anybody who’s looking at associations between microbial, in particular
microbial variants people have, say in their colon or in their lungs, with the variants
in the human Genome and the possible impact that has on cancer?
Dr. Richard Wilson: Yes.
[laughter]
And Claire talked a little bit about this microbiome initiative, which is just underway.
And I think it’s initially a cataloguing exercise, but the association, especially
between health and disease, is right around the corner, I think. I don’t know if Claire
wants to add to that.
Dr. Claire Fraser-Liggett: I agree.
Dr. Richard Wilson: She agreed.
Dr. Eric Green: We have concurrence. Okay, thank you, Rick.