Adventures in The Cancer Genome - Richard wilson

Dr. Richard Wilson: Thanks very much, Bob. It’s a pleasure to be here so I’d like to thank Eric and his colleagues for including me among this group. And it’s a pleasure to be introduced by Bob. We’ve had a number of interactions over the years mainly focused on finishing Genomes. In fact, Bob, I think it was you that actually compiled the list of the most difficult Genomes to finish and so we’re now very excited to have in review, I guess, a manuscript describing the platypus Genome which really was, as you told us, the most difficult to sequence. So, as I said, it’s a pleasure to be here. I want to talk a little bit about some of the next generation sequencing technologies. And you heard a bit of this from Richard so I’ll try to be complementary rather than redundant. I do want to say, excuse me, little frog in the throat this morning, first, congratulations to NISC. You know, Genome centers are all about accomplishments and milestones and firsts, so I have to give you guys credit. I’m pretty sure you’re the first Genome center to have an anniversary party, at least, so, nice going. But if you think about this, for those of you who really know Eric, this is not at all surprising, because Eric loves to celebrate. [laughter] So I’ve know Eric for a long time, since 1990 when I moved from Cal Tech to Washington University and he ended up just down the hallway from me and back then we had next generation sequencing, at least the first iteration of next generation sequencing. We actually started getting rid of radioactivity and the autorads that we used to, you know, manually read into the gel. And so, we had a lab where we had a couple of these boxes that are shown over there on the left. This is the old ABI 373 and Eric actually got one for his lab just down the hall before he moved up to Bethesda. So this was exciting back in those days, 36 lanes, woo-hoo. Of course we now have additional next generation sequencing. This is the next round and shown on this slide are the three current platforms that we’re all trying very hard to understand how they work and what the impact might be and what sorts of experiments that we can do, that even just a few years ago we could only dream about doing. So this is the Celexa instrument now manufactured by Illumina on the lower left, the 454 instrument up in the center and the new ABI SOLiD instrument down on the lower right. So I want to try to talk a little bit about what we’re trying to do with some of these platforms. I want to go back a little bit because, you know, the promise of the human Genome sequence was always that it was going to revolutionize biological research, going to revolutionize medical research, and the way that we look at ourselves. And even just having conversations among ourselves about how many people even in the auditorium would actually like to have their Genome sequenced is pretty amazing when you think back on this. But the reference sequence, version 1.0 of the Genome, now allows us to ask amazing questions. What’s come along with the version 1.0 sequence is an amazing amount of technology, software tools and infrastructure such as we have at all of the Genome centers that are represented here today to allow you to actually use this reference sequence to ask interesting questions. It’s buttressed by ancillary Genome sequences, the mouse, the chimp and others, and all this really allows us, as I said, to do experiments that we only previously dreamed about doing. Now what we want to do is start applying all of this infrastructure and resource to cancer and other diseases. And really now with these next generation sequencing tools coming along, it allows us to do this in new and exciting ways. So if you think back on how we approached cancer initially with sequencing, you heard Richard talk just a little bit about this, we used PCR and we’d been using PCR sequence for many years. But one of the things we tried to do maybe five, six years ago was to try to understand how to do it in very large scale, high throughput PCR-based sequencing. So we came up with all kinds of computer tools to pick primers and to keep track of things but the most difficult thing was then going through and looking at all of the data and try to find where we were actually seeing synonymous changes or non-synonymous changes and so forth. But this was the paradigm a few years ago. When I approach a cancer, we have a list of candidate genes that we think might be involved in a particular that we were interested in and then we’d get a large collection of patient samples, large a few years ago was about 50, maybe a hundred. And it actually worked, in some cases. This is one of the poster children for this type of approach. A number of studies were done. We actually did one from our Genome center in collaboration with Harold Varmus and colleagues at Memorial Sloan-Kettering Cancer Center, where we looked at a number of kinase genes in non-small cell lung cancer patients and in the epidermal growth factor receptor gene. Voila, we found quite a few mutations, all focused in the tyrosine kinase domain of this protein and this coupled with some very nice phenotype information. One of the things that we found are quite a few patients with non-small cell lung cancer, typically non-smokers, who were treated with tyrosine kinase inhibitor drugs, such as Aressa and Tarceva, had mutations in their EGFR genes more often than not. So this was very exciting and sort of seemed to be a promise of things to come from additional sequencing. The work that we did there in lung cancer coupled with the work that was going on at both the Baylor and Broad Genome centers, led all three of us to sort of put our heads together on a little project we called the Tumor Sequencing Project, or TSP. And in this Project we expanded our candidate gene list to about a thousand genes, it actually turned out to be closer to 600 by the time we were finished, with about 200 very high-quality adenocarcinoma samples. And just briefly, the TSP was organized late 2005. The focus, as I said, was on lung adeno. We had a target list of about 600 genes, three sequencing centers working together. We enlisted several cancer centers to mainly help us collect and characterize lung adenocarcinoma tissue samples. Out of an initial set of about 800, we focused down to about 200 that we thought were quality enough for sequencing. Divided up the labor, each of the three centers did about 3000 amplicons, roughly 300 genes, and we had a common set of roughly a hundred genes that we could use as sort of a cross-center comparison. We’re currently now, we have all the sequencing done, and we’re currently now in the process of going through this data analysis. Some interesting things have come out of this already. This just shows you some mutations that we’ve seen. If you then stratify all of these samples and you look at smokers versus never smokers, you can find some differences. For example, you’re more apt to find mutations in EGFR, GRB2 and GRB7 in never smokers, whereas KRAS and MET mutations are more common in smokers. Likewise, you can use the mutations you find in some genes, notably TP53, ERBB3 and AKT3, to sort of get some idea of tumor grades. So as you can see over here in Grade 1 tumors, we didn’t find any mutations in these three genes. In grade 2 we start to find TP53 gene mutations popping up and then in grade 3 we start to see more TP53, the ERBB3 are coming up as well as some mutations in AKT3. This is early data but interesting trends, nonetheless. The other thing that we found that’s really exciting and there’s now a paper in press describing this work, early on we used Affy SNPerase [spelled phonetically] to sort of qualify the DNAs that we wanted to sequence, get some idea of what we actually had in hand, make sure these were of sufficient quantity and quality for sequencing and we actually found several interesting amplifications that represented potentially new targets that we could add to our sequencing list. So we did see through these amplifications some that had oncogenes in the intervals, and this wasn’t necessarily surprising. But there were quite a few, TITF1, for example, was exciting, and this is a lineage gene for the development of lung tissue. So the idea here was to start to plug together the two technologies, sort of a whole Genome array-based approach, with the sequencing focused on PCR. So this has sort of led to our second paradigm. Rather than simply focusing on sort of hypothesis-driven gene lists, which are somewhat biased in terms of the expectations that you have of what’s going on in the Genome in a particular type of cancer, we now are moving more to including data-driven targets that come from using such orthologous technologies as array-based RNA profiling, CGH, and SNP genotyping. And this has been quite exciting. The next sort of large-scale collaborative phase of cancer genomics is underway now, known as the TCGA. This is a project that involves the same three sequencing centers but now quite a few CGCCs or Cancer Genome Characterization Centers, and the idea is the same. The sequencing centers have gotten started off on hypothesis-driven gene target lists, and then using data that comes from various types of analyses that these centers are doing, we add additional targets that the sequencing power can be focused on. Well, I want to go back and tell a little story, because I want to take you into some of this next generation sequencing technology. We started another cancer project in 2002, as a collaboration between the Genome sequencing center and our colleague at Washington University, Tim Lay, who was interested in acute myelogenous leukemia, which is a very nasty adult leukemia. And when we started, it was the same sort of paradigm. It was an attempt to use high throughput PCR-based sequencing, and focus on a hypothesis-driven list. A couple of features of this project: we used primary tumors rather than cell lines as our substrate for sequencing. In all cases, we would use match normal tissue along with the tumor tissue, so that we could get a quick idea as to what we were seeing in germ line versus somatic variations. We had a discovery set, a small set of samples of 96 matched tumor-normals, and then any time we would find what looked to be a mutation in that discovery set we’d then go and resequence it in an additional validation set of 94 tumors. Had a gene list of about 450 genes, and then we also used this sort of orthogonal approach with CGH arrays, expression profiling, et cetera, to contribute additional targets to the list. And this worked quite well, and guess what? We found mutations in genes and none of the genes that we found mutation in were all that surprising, considering we were looking at leukemia. So we asked ourselves a lot of questions and one of the things that certainly kept us up at night was what are we missing? We’re focused only on exons, we’re mainly focused on exons for genes that we expect might have mutations, what about all the other genes? So in terms of sort of moving away from hypothesis-driven gene sequencing, there are a lot of ways that can go. The Hopkins study that came out, Valquesque [spelled phonetically] was the senior author, this group looked at 13,000 genes using, again, a PCR-based approach, in 11 colon and 11 breast cancer cell lines, and found interesting mutations. But this is just a start. This is a relatively small number of samples, you know, how do we scale this up? The problem with PCR-based resequencing, it’s relatively expensive, it’s diploid at best -- some tumors you’re going to have many, many, many more copies of particular alleles -- and it’s low coverage. So how can we improve and also, again, what are we missing outside of the exons? Well, we decided to try to take the next step with our AML project and we have started on sequencing a whole Genome using the Celexa technology. This is our case referred to as 933124. This was a 57-year old Caucasian female who presented with a de novo M1 AML. At her initial diagnosis, she presented with 100 percent myeloblasts in her bone marrow sample and this is what we’ve used for our studies. The patient relapsed and died 11 months later. In doing quite a few different types of analyses on her Genome, we found that she has completely normal cytogenetics, as best as one can tell. Using NimbleGen arrays, the 2.1 million array, we found one tiny amplification on Chromosome 7, about 7 kb. There was no loss of heterozygosity detected on the Afi 500 k-SNP [spelled phonetically] array. We did find through our PCR-based sequencing, that two mutations, an expanded internal tandem duplication in the FLIP3 gene, and then a point mutation in her NPM1 gene. So we have informed consent for whole Genome sequencing and eventually data release, and off we went. So this just shows you the histology sample -- this was 100 percent blasts on the slide -- so we had a really clean tumor here for sequencing. There really weren’t any worries. This is a liquid tumor so there weren’t concerns about stromal contamination. As I said, she presented at 100 percent blasts and she also has, as it appears, a completely diploid Genome. Well, one of the things that we always ask ourselves when we start sequencing a Genome is what kind of coverage do we need? We had some idea of this for the old sort of ABI- based sequencing. You could say when we got to 8X or 10X coverage, we had a pretty good idea of what the utility of that sequence might be. Well, using this new sequencing platform where we’re getting much shorter reads, we had some questions as to what coverage would we really need and we spent a lot of the time busying our statisticians trying to come up with coverage models, theoretical coverage models, that maybe they were right and maybe they weren’t. One of the measures that we thought we could come up with perhaps is that we’ve collected all of these polymorphisms using various types of arrays. Could we simply use those as a way to measure coverage? So as we generate sequence, can we go and look for all of these SNPs that we found with arrays and as we find 90, 95, closer to 100 percent, can that then give us some sort of metric of how close to finished we are with the sequencing? So here’s just a quick update in terms of the numbers, as of last week. We had done 55 runs of the Celexa/Illumina instrument, collecting 32 base reads in each of these runs, about 44 billion base pairs which calculates out to about 14.5X haploid coverage. We detected 210 thousand SNPs in this Genome; 83 percent of these are present in dbSNP. And then, here’s our coverage metric. Out of the 481 thousand SNPs that we saw on arrays, we’ve now identified about 183,000, or roughly 38 percent. So just in terms of going through this how close are we to being finished exercise, right? Using this metric, 14X haploid coverage represents about 40 percent diploid coverage. Our theoretical calculations said that we were going to need 25-30X coverage with these short reads, to reach a goal of about 99 percent of the sequence coverage and variants detected. So this looks like we’re on the right slope here and it also seems to converge nicely with some of the other centers that are starting to use this technology are seeing with regard to coverage. We also use the new technology to do some cDNA sequencing. So cDNA sequencing is not a technique that is dead and needs to be put away. This has actually been quite useful. We’ve used a number of different cDNA library construction procedures and normalization schemes that all fit very well with the idea of putting little bit of DNA on solid support as one needs to do for these new platforms. And these were sequenced on both 454 and Celexa. So, I have our pipeline up here, and I think just the key points are is that as these reads from both genomic and cDNA libraries come through, and are checked here, looking for SNPs and small indels and so forth, one of the key things that we do is at some point, especially for non-synonymous and splice site putative variants, we then go back to the old PCR-based sequencing pipeline to try and validate as well as look at the same sequence variants in other AML patient samples. So what have we done so far? We’ve really focused as a top priority on sequence variants that appear to be non-synonymous, that are not in dbSNP, that are detected multiple times in the cDNA sequencing effort and that are detected at least once in the tumor DNA. So this again is ongoing. We don’t have all the coverage that we’d like yet from this particular patient’s genome but we’ve found 59 non-synonymous variants in 43 genes. Most of these are likely rare SNPs. The two somatic mutations that we had found using a PCR-based approach with this same genome were found again with the Celexa sequencing method and the internal tandem duplication, and FLIP3 and the NPM1 mutation. One additional somatic mutation was discovered in FLIP3, a non-synonymous change here at amino acid position 194. All of the other variants -- and these are all coding -- all of the other variants were localized to genes that had not been previously implicated in AML pathogenesis and hence were not on our original target list. And we have identified at least three other putative somatic mutations and these are currently going through our PCR base sequencing pipeline for confirmation. This may be pretty hard to see from the back, but what I’m showing you here is just what we can get out of the cDNA sequencing approach. So I’m showing you the one somatic mutation that we found in FLIP3 and what you have here is an alignment of Celexa reads from the tumor genome, as well as from the cDNA sequencing effort. So we get a nice match of the Ts, reads with the Ts and reads with the Cs, doing no more comparison of figuring out how high a little peak is underneath another peak, you actually get a nine base almost digital readout of the frequency of these two alleles so -- and we see the same thing over here. This would suggest about a 50-50 match, 50-50 expression levels between the two alleles, and we can then extrapolate that to quite a few other genes. So what you’re looking at here is for several genes, listed here. These are the frequencies of the expressed copies of the variants, and the germ line sequences. So, for example, in this one here, PTPN11, we see about 200 to one, or perhaps zero, the variant allele as compared to the germ line. Up here you get a little bit more of a mix, about four to one. In some cases it’s a little bit more of the 50-50. We’ve also discovered several novel splice variants using this. Genes are listed over here. For example, RPAIM1 [spelled phonetically], five new splice variants have been detected and characterized using this combination cDNA and Genome sequencing approach and quite a few others as well. So just to summarize a few points, so next generation sequencing is here, at least this iteration of it. And we can clearly see already that it will have a substantial impact on the study of the cancer genome as well as for other human diseases. Coverage models for next generation whole genome sequencing are converging; we’re starting to better understand exactly what we can do and how much work it takes. These ancillary or orthologous genome-based technologies are really crucial for understanding the target genome before you actually start all this large genome sequencing. So the SNPerase [spelled phonetically] I think still have some value. And then this transcriptome-based approach using cDNAs either as a stand-alone approach or in concert with a whole genome sequencing effort represents a pretty powerful adjunct for cancer Genome analysis. More is clearly needed. We’re at the very early days of all of these technologies, and one of the things that I think you’ll get the message of here today is that all of us are trying to figure out how to bring these to bear, what they can be used for, what sorts of up-front strategies and tools and technologies we need to develop to make them even more powerful. So if my last slide will come up which sometimes it does and sometimes it doesn’t, you can see a list of some of my colleagues. The acknowledgments doesn’t want to happen. If anyone can explain that bit of MacIntosh-ology to me that would be appreciated. So, thanks. [applause] Dr. Eric Green: Questions for Rick. Jeff? [clear] Jeff: So I’m trying to decide if your argument to some extent might be for or against the Valescu [spelled phonetically] model meaning that ultimately, a year from now, three months from now, the whole Genome association and the whole Genome sequencing will both be of sufficient depth and sufficient quality that if today you wanted to define mutations at a level of quality that at least based on standard Sanger-based sequencing, you and others in this room have defined the value proposition which still seem to be weighted in that direction. So I almost, again, I’m trying to understand the dichotomy between the short-term value of implementing ABI-based sequencing in support of cancer sequencing today across defined exons versus waiting for whole genome sequencing. Dr. Richard Wilson: So I think it’s like I said, Jeff, I think it’s still early days on a lot of this sort of approaches that I described that we’re trying to use on AML. I think there still is value in the PCR-based approach. I mean, clearly, you find things that are of use in studying cancer. So I wouldn’t shut down that pipeline quite yet. I think we can improve on it still, and then we can at some point, I think, transition nicely to some of these next generation technologies. There’s still a place for hypothesis-driven sequencing, and I think Richard’s example of the capture array in the 454 sequencing is a nice sort of next place to go, if you will, for that sort of target sequencing. And it’s cheaper, and allows us to do a lot more work. Dr. Eric Green: Karen [spelled phonetically]. Karen: Hi, Rick. Nice talk. I was wondering how long you think that you and the rest of the community will be tinkering with this iteration of next generation sequencing before the next iteration of next gen sequencing comes along? [laughter] Richard Wilson: So, that’s a great question, Karen. I think we have plenty to keep us busy at least for, what do you guys think, another ten years? But we already see things that are right on the horizon -- I think in the next year or two -- that will give all of these current platforms a run for their money. So that’s exciting. And so it’s nice to have sort of a competitive situation, now. We all went through the time when there was only one player and technology moved along as that player allowed, so-- Dr. Eric Green: One more over here. Female Speaker: Hi, Rick. This is actually something that occurred to me during-- Dr. Richard Wilson: Oh-oh. You’re supposed to be on not just shout. Female Speaker: This is something that actually occurred to me during Claire’s talk. I’m not sure you’re the best person to answer it but… It seems to me that with the variety of microbial Genomes that exist in humans and the collection of variants that each individual would have and then, again, the host genome of humans and the difficulty we’ve had making associations between variants in the human genome with disease, is there anybody who’s looking at associations between microbial, in particular microbial variants people have, say in their colon or in their lungs, with the variants in the human Genome and the possible impact that has on cancer? Dr. Richard Wilson: Yes. [laughter] And Claire talked a little bit about this microbiome initiative, which is just underway. And I think it’s initially a cataloguing exercise, but the association, especially between health and disease, is right around the corner, I think. I don’t know if Claire wants to add to that. Dr. Claire Fraser-Liggett: I agree. Dr. Richard Wilson: She agreed. Dr. Eric Green: We have concurrence. Okay, thank you, Rick.