Tcga - Assessing tumor heterogeneity and tracking clonal evolution - Christopher miller

Chris Miller: Thank you. Hi, I'm Chris Miller from Wash U. I'm going to be talking to you today about tumor heterogeneity, clonal evolution, and how we're using sequencing to get some insight into both of these phenomena. So my first statement here will be that tumors are heterogeneous. This was suspected as far back as the '70s, but it's really taken the advent of high-throughput sequencing before we were able to dive deep into these tumors and see that they are, in fact, genetically diverse populations of cells, and that because of that, within these, evolution is occurring at the cellular level. And then last year we were able to, you know, view this in action in a case of relapsed AML, where we sequenced both an AML tumor, a match normal, and a relapse. And using that we were able to put together a model of exactly how this clonal evolution works, at least in this case. It starts off with the hematopoietic stem cell, which gains initiating mutations here, and then as the tumor expands, some of these cells acquire additional mutations, represented here in purple, yellow, and orange. And these mutations may expand. And so when we assay the tumor at diagnosis what we're getting is really a cross-section of this clonal architecture of the tumor, where some of the cells look a lot like the founding clone, some of them are this subclone that occurs in about 50 percent of the cells, and others are smaller fractions of the tumor. As chemotherapy, then, is induced, and treatment goes on, it creates a population bottleneck, where this population of cells is reduced, and only a few pass through. And then this expands back into a relapse, then, and acquires additional mutations after the treatment ceases. And so what's really interesting in this particular case is that the clonal fraction that actually went on to form the relapse only appeared in about 5 percent of the cells in the original tumor, which is a little frightening, to be perfectly honest. And it makes us wonder, you know, whether we could have missed it if we hadn't looked more carefully. And so detecting these minor subclones is, we think, crucially important to understanding, you know, how these tumors are responding to therapy, and to make sure that we get the whole tumor and not just the major subclone. I think there are several challenges that remain in detecting these. First of all, the genomes are sequenced with low coverage. I mean 30x whole genome sequencing is clearly not enough to detect events that are present only at 1 or 2 percent in the tumor, and even a 100 or 150x, you know, exome sequencing may not be deep enough. But that at least seems like a tractable problem. The sequencing costs are dropping rapidly. Perhaps a more pervasive problem that we're interested in is that algorithms aren't designed to detect these low frequency events, by and large. If you look at this power simulation from our somatic sniper algorithm, which is one of the kind of first generation of variant callers, you'll see that even with 90x coverage, our power to detect events at 20 percent variant allele frequency is only 85 percent. And if we drop down to 10 percent variant allele frequency, it's only 10 percent. So, we're clearly missing a lot of these low frequency things. And so that spurred us to develop an algorithm called BASSOVAC, Bayesian Scoring of Somatic Variant read Counts. It's a little convoluted, but it works. And so this incorporates purity, ploidy, base quality, and a host of other factors into a more complex model. We pull these altogether into a Bayesian framework, and then obtain the probabilities that a particular single nucleotide variant is either heterozygous or homozygous, given the input data. And so we've tested this against other algorithms. This is kind of a worse case simulation, but you can see that even in this difficult environment it's pushing our curve of variant allele frequency far to the left compared to somatic sniper in these kind of first-generation callers. We've also done some real world testing, and I want to tell you about one particularly cool dataset that we've been working with. This is a quintet of samples, including a primary breast tumor, a match normal, and three different metastases, from the spinal, the liver, and the adrenal glands. And so we whole-genome sequenced all these to 30x, and ran these through our initial pipeline. This was prior to the development of BASSOVAC. And then capture validation was formed for all these variants. So we have very deep sequencing read counts for all of these variants and all of these samples. And so we were able to combine these all, and then make cool plots that look like this. So I'm showing you here on the x-axis is the variant allele frequency of single nucleotide variants in the primary tumor, and on the y-axis you're seeing the frequency of events in the metastasis. And so several trends emerge from this kind of plot. You can see that in the -- at about 50 percent you see -- which corresponds to 100 percent of the cells in the tumor for heterozygous events -- we see the major clone, or the founding clone, which is also present in about 50 percent of the metastasis as we'd expect. Down here we see another cluster of a kind of minor clone; that's present at about 25 percent, and that, again, also passed through to metastasis. And then contrast down here on the x-axis what we see is a clone that was present in the original tumor, but didn't pass through to the metastasis. So, this was a separate population of cells that didn't make it through the population bottleneck, or didn't make it through the metastasis event. Then on the y-axis what we see are events that happened in the spinal metastasis, presumably after the split, since they're not present in the original tumor -- or at least that's what we thought. When we zoomed in a little bit close to this y-axis what we can see is that these events I've highlighted in red actually were present at the tumor, just at a very low frequency. And so that suggests that maybe they either had a growth advantage in the environment of the metastasis, or just made up a majority of the cells that split off into the metastasis. But either way, they're clearly present. And getting back to our variant calling, then, this gives us a source of very low frequency variants in this tumor that we know are real, because they're present in the metastasis as well. And so we use these kind of events to test the sensitivity of our algorithms. So, this is a comparison between three algorithms: BASSOVAC, our new caller; Sniper, our old caller; and Strelka which is a caller from Illumina, which reports to do better on these kind of low variant allele frequency events. What you can see is that BASSOVAC and Sniper detect a lot of events and they're very comparable performance at kind of high-variant allele frequencies and mid-variant allele frequencies. Strelka maybe doesn't do as well, but about 10 percent there's a inflection point where Sniper just isn't able to detect stuff, Strelka does a little bit better, but BASSOVAC detects a huge number of these very low frequency true positive events. And, you know, in the end even with this kind of biased 30x approach originally, we can see that 50 percent of the variants present in the metastasis are present detectable level in the tumor, even though we would have expected a much smaller proportion if we hadn't looked closely and looked deeply with the capture validation. But more importantly what we can say here is that we can use BASSOVAC to dissect these true variants at very low frequencies, down to and even lower than 2 percent. So, given this kind of information, then, about these low frequency variants, how can we put this to use to kind of infer the subclonal architecture of a tumor and find out, you know, how many clones are present in there? Which variants are present in the different subclones? And this really requires an integrative approach, where you look at both the variant allele frequencies of the SNVs, as well as information on copy number calls, purity, and ploidy. And so we can put it altogether into beautiful charts that look like this. I'm going to zoom in so you can actually see what's going on here. And so we segregate the SNVs according to copy number, and then we plot the variant allele frequencies along with the depth, just kind of for our reference on the Y-axis. And you can see here in the 2x plot you get a clearer indication of the founding clone. And then we overlay it with a kernel density plot on top here. So you can see this, this clear peak at 50 percent which tells us this is the major, the founding clone. And then you see these variants down here with a little bit lower frequency correspond to blips up here that represent subclones. And then over on the right side you can see events that are copy number neutral, loss of heterozygocity, up here near 100 percent. And the copy number 3 regions, what you can see is that instead of a 50 percent major clone as we expect, we expect peaks at 33 and 66 percent, depending on whether the wild-type or the mutant allele got amplified. And we do see that indeed in the data. And so we can build these plots for all of our tumors, and you kind of eyeball it and say, "This clearly looks like it's a two-clone tumor, a major clone and a minor clone." But we get very leery of kind of eyeballing plots. We like to do it in a more rigorous fashion. So we decided to come up with a method that could do this in an automated and kind of unbiased manner. And so what we ended up doing was creating an algorithm that uses a mixture model of binomial distributions to kind of model this data, and then use maximum likelihood expectation to determine what the optimal number of clusters was for any given solution. And so we can see that indeed this algorithm clusters this into two groups, says there's a major clone and a minor clone, have overlaid the calls here. And this is a bi-clonal sample. Here is a case where it's a tri-clonal sample that, you know, clearly agrees with our eyeballing of the data. And there are cases that are a little more -- less intuitive, I guess. This is a case where maybe if you looked at just the density plot you might say this was a two-clone tumor, but if you look carefully, you can see that there's a nice peak here, there's a nice peak here, and then kind of a smear in the middle. And the algorithm does a very nice job of picking that up, fitting another curve in the middle there, and saying this is indeed a three-clone tumor. And then we also have, you know, more messy tumors. This is a multi-clonal sample with a smear of data, and we don't think we're doing too much of a fitting in this kind of case. We think there really are a variety of clones here, but it's very difficult to segregate them accurately at this kind of -- with this kind of smear of mutations. So we've applied this across a large sample set of tumors, looking mostly at AML, breast cancer, and endometrial cancer, and we can say that most of the tumors in those data sets have at least one founding clone and one or more subclones. I also want to emphasize that the numbers I'm showing here, these are going to be lower bound on the number of clones. First of all, detection sensitivity hurts us, because not all of these calls were made using BASSOVAC. But more importantly, I think, is that we're unable to distinguish with this kind of data between two independent clones that both occur at say 20 percent variant allele frequency. Without, you know, kind of single cell methods, there's no way to get that from this data. So, in conclusion, we can detect somatic mutations at very low frequencies using BASSOVAC, our new caller. And we developed an R package for automatically inferring the subclonal architecture in tumors. We hope to release beta versions both of these by the end of the year. They're not currently available, but will be shortly. And really the overarching goal of this kind of research is to characterize these minor subclones at diagnosis rather than discovering their presence at the relapse when it may be already too late to design appropriate treatments. So, in conclusion, I'd like to acknowledge a host of people who made this possible. Mike Wendl has been leading the BASSOVAC project, and Nathan Dees has been pushing the clonality analysis out the door. A host of people at the Genome Center over here who have contributed in one way or another, our collaborators who provided data, and expertise, and advise, and leadership at the Genome Center. And then our funding agencies at the NHGRI and the NCI, and, of course, The Cancer Genome Atlas. Thanks. [applause] Charles Perou: Lou, you go first. Lou Staudt: I wonder if there isn't an important implication in your breast cancer metastasis findings. So if I understand correctly, in the metastasis you've got both the tumor dominant clone -- Chris Miller: [affirmative] Lou Staudt: -- with all the 50 percent alleles, and you've got a tumor subclone. Chris Miller: [affirmative] Lou Staudt: So does that predict, then, the metastasis must not have come from a single cell, but rather a clump of cells that had both the minor and the major on it, or you're recreating all those mutations in the metastasis -- Chris Miller: Well, so anything -- any mutation that's present in the founding clone is going to be present in all the subclones as well, but the fact that it does appear at a lower variant allele frequency does indeed predict that it's not a single cell that caused that metastasis. That is was a clump of cells containing both those original ones, and a subset with additional mutations from the subclone, yeah. Lou Staudt: I'm not an expert in this area, but I know -- Chris Miller: I'm not either. [laughs] Male Speaker: -- there's a lot been done -- a lot said about individual breast cancer cells being found in the bone marrow and et cetera, and whether clumps might be more the thing that metastasize. Chris Miller: Yeah, I don't doubt that single cells may be capable of that, but in this case it's clearly not one cell. Lou Staudt: Okay. Male Speaker: Yes, hi. I have a technical question. Chris Miller: Sure. Male Speaker: It seems to me that the selection of the bandwidth of your kernel density estimate should affect the estimate -- the maximum likelihood estimate of the number of subclonal populations. Have you looked into that, or how do you choose that bandwidth? Chris Miller: So, we don't actually use the bandwidth when we're doing the binomial fitting. The bandwidth is clearly smoothing for eye, just to get the pretty pictures. So we actually just take the raw data and feed it into the algorithm so, yeah. Male Speaker: I got a question. So, if you were to compare the clonality analysis coming from exomes versus full genomes, are they -- do they give you similar answers, the same answer? Chris Miller: That's very dependent upon the number of variants that we're finding in these tumors. For example, even some of the whole genomes are, and definitely in some of the AML exomes where you see a very few mutations, it's very hard to cluster with only 10 mutations, right? It's very hard to know what's going on there. The exomes -- so we do have to set a minimum threshold on the number of mutation that we have -- Charles Perou: We could probably do that on -- like the breast cancer data where there's 20 basal genomes full sequenced and the exomes on those, right? Chris Miller: Yeah, where we have whole genomes, it's really easy, because you can include all those tier two and tier three mutations, and get hundreds of mutations to get much finer resolution on your kind of sub clone architecture. With just tier one exome stuff, it's a little bit harder, but we can do it, provided that there's enough mutations in the sample. Charles Perou: There's enough mutations. Got you. Chris Miller: Yeah. Charles Perou: All right. Thank you. So our next speaker will be Adam Ewing from UC Santa Cruz.