Uncovering The Pseudo - Subclonal structure of tumor sample with... - Yi qiao

Male Speaker: -- talk about uncovering the pseudo-subclonal structure of tumor samples with copy number variation of next-generation sequencing data. Thank you. Yi Qiao: First I'd like to thank the organizer for granting me such a wonderful opportunity to share with you some of the work I have been doing after I joined Gabor Marth Lab as their new graduate student. And as you know, probably, all pretty much aware of, a tumor sample is always going to be mixed with a certain amount of the surrounding normals and if you sequence the sample it is -- it is going, the sequence ingredient is going to be a mixture of the DNA sample from both the tumor and normal. And we really started to, this project, to look at the mixture ratio of normal in any sequencing data by looking at the copy number information from what you what you can gain from the BAM file [spelled phonetically], which is pretty straightforward to do. You can count the read depths in a, for example, in this specific example it is a 10 kb moving [spelled phonetically] window of the tumor BAM file. Also, you can count the same thing with the pair normal. And if you do -- if you calculate the ratio between the tumor and the normal read depths, then the ratio is going to capture the somatic copy number variance that is happening in the tumor sample. And if you plot a histogram of the ratio, then, if you presume [spelled phonetically] that the sample which is being called a tumor is a pure homogeneous sample with some kind of deletion events, heterozygous [spelled phonetically] deletion events, and you expect that, aside from the large peak around one where it correspond to copy number neutral event, you would expect a peak around zero point five, which corresponds to the heterozygous deletion event. However, what is interesting is that we actually observed the peak at zero point seven, which means the peak has been pushed towards where the copy number neutral ratio resides. And since cell cannot really have non-integer level of copy number, the way to interpret this is to think of the original sample being a mixture of both a tumor clone and the contaminating normal clone. And based on a simple linear combination, can actually solve for the contamination ratio in this case. For example, the contamination ratio, the tumor purity is 80 percent, that will shift the line up, is what is observing in this case. And if you locate the peak and identify where the copy number two and the copy number one, which shows as the black and the blue line in this slides, you can actually calculate the contamination ratio. And use that ratio again and the copy number neutral line location, you can actually predict where the copy number three and four in this example, that, in this tumor, that will appear in this histogram. And that fits the data extremely well. If you overlay these lines with the ratio plot versus the chromosome location, you can see that these lines are perfectly explaining some of the segments that you're observing here. So we were really excited about the results and we applied the same technique to all the other chromosomes from the same sample, this is only chromosome 19 of a specific TCGA sample, we observed something quite odd. Different chromosomes show up to have different tumor purity. And they tend to cluster together with very high differences among clusters but very small differences within a cluster. Now, this can [spelled phonetically] happen, because you can only make cells at a cellular level, how come different chromosomes have different contamination ratios? Unless there is more than one subclone that is being found that exist in the tumors clone. And there exists a hierarchical, that's where the tumor heterogeneity comes in. We modeled this as a hierarchical subclone structure where the most prevalent mutation should appear in all the tumor, de-masked [spelled phonetically] tumor cells, which leaves out, like, for example, 20 percent, which would be the normal contamination. But the less represented mutation will exist in a subset of the tumor subclone. And at those locations where the cells don't have that minor mutation will act as if they were normal, and that is why you're observing different tumor purity at different chromosomes. And the same way applies for the even minor copy number. And so this is, this is a result, if we are able to reconstruct the entire structure algorithmically in a way, actually, we come up with a way a good following, which is to look for the -- to approach this problem from bottom up to look for the most diverged subclone in this cell first and then make your way up onto -- so for example, if you have a ratio, you have observed the ratio data which shows, like, the one at in, with the white lines, then you can initialize a model which consists of only normal genotype, which is presumably 100 percent of copy number two. And assuming that this is your tumor sample, you can actually predict where the ratio would be. And granted that is going to be one all over the place. And you can then calculate the differences among the model, between the model and the actual data, and you can come up with numbers that represents the contamination ratio. In this case they are different from each other. So what you would do is that you break the normal, your normal clone into two subclones. And the right leaf is going still to represent the normal subclone, but the left leaf is going to contain the mutations which is snapped at an integer number to explain the differences that you observed in the actual data, and also with the corresponding tumor purity. One thing to notice, the most diverged cell will also have the -- will have both the more prevalent and the less prevalent mutations. So when you initialize the copy number profile of the smaller tumor sample, you have to count in for all the differences that you seeing here. But in the -- with this model you can update your predictive value. And as you can see, one of the segment is going to be explained perfectly. But you can update a number and pretty much do the same thing again, but next time you break -- you always break the normal subclone to account for the part that has not yet been explained. And -- well -- until your data can fit your model well. We can actually devise a score here which is pretty much the sum of the absolute value of the differences between the model and the actual data. And this entire thing is iterated over and over again until the score does not improve anymore. And if you backtrace the leaf notes, that is exactly what you have seen in my previous slides, which is the structure that we we'll learn from these algorithms. However, it is important to keep in mind that given one ratio data, there is always -- there is chances that there are more than one actual biological structure that corresponds to the same ratio observation you have. For example, the model on the left and the model on the right. We don't think that it is possible to distinguish between these two models based on the copy number ratio data alone. So we -- the algorithm, choose to produce the model on the right, just because it is a, we believe it is better representation of the actual biological process. And as many of you who talked to me yesterday during the poster session have mentioned, this looks strikingly like what we expect when there is a cancer stem cell, that they sort of aggregate mutations themselves but along the way they produce cells that constitute the entire tumor. And that's pretty much the method with these simulations. If you start with a 40 percent normal and a 60 percent mixture profile with the copy number profile as shown here in the tumor sample, the method is able to predict 42 percent of normal and 58 percent of tumor, which is fairly close. And we have incorporated a small error term in there that's count for the error that you observed in the actual sequencing data, so that's why the number's a little bit off, but overall it's doing a pretty good job. If you change the simulation profile to include a third tumor subclone, the result is comparable, like 20 percent of a subclone and 57 percent of the other subclone, whereas you have 23 percent of the normal. And we applied the method to some actual TCGA data. This is the result that the algorithm comes from the data that I showed you earlier from the chromosome pileup figure, and it looks pretty close to the one that we come up by hand. In -- this is specifically ovarian cancer. This is another example shown with glioblastoma cancer, and you can see that there are much higher heterogeneity compared to ovarian cancer. And there are some, a bunch of other more examples on glioblastoma cancer. So just give you a taste of how the program looks like. So if you zoom in the part that the actual result is, I don't know if you can see, but the purple line actually represents the observed ratio value subject to genome segmentation. And so that's pretty much representing the somatic copy number events neurosample [spelled phonetically]. And the green line, the dark green line, is what the final model is predicting, and they look pretty close. There are even segments where you only see green line, that just means the green line is perfectly overlapping with the purple line. And, yes, in conclusion, the method is able to simultaneously produce -- simultaneously estimate both the normal cell contamination ratio and some measurement over the tumor heterogeneity, I think both of which will be very important for any downstream analysis. And the algorithm is pretty fast. In the worst-case scenario, if you have as many different copy numbers states as the number of chromosome locations you investigated, then at the most you will need those number of iterations to explain the entire data. The actual speed limiting step is counting the read depths. It takes roughly one days and a half to count the read depths of a whole genome sequencing data, with 40 times medium coverage, and takes roughly two minutes to come up with the subclone structure. And it can be -- the information can be used as a prior for any downstream analysis. For example, if you know that your certain percentage of normal contamination, you might require less evidence to call a variant in your tumor sample. The method is actually designed to be independent of the [unintelligible]. In this case I implemented my own very rudimentary [unintelligible], but if you have your favorite [unintelligible] you can just plug it in and it takes [unintelligible] the segmented [spelled phonetically] genome. The model it produce is a biologically motivated model, but there are definitely other possibilities. It represents a class of possible combinations based on copy number data alone. So, which brings us to the future directions that we want to be able to -- I didn't say here, but might be helpful to incorporate sequence data to try to differentiate between a case where you have no overlapping mutations and overlapping mutations. And well, another thing that we want to do is do validation, since we're a bioinfomatics lab we are actually looking for, looking for collaboration with [unintelligible] groups to do some validation work. And the current examples that I showed up there was tested on whole genome sequencing data and the next [unintelligible] to test and probably make modifications to make these algorithms work with capture data. And -- yep -- I'd like to thank everybody in the lab. I'm pretty new to this lab and everybody has shown tremendous amount of support for my work. And I'd also like to thank TCGA for this opportunity to both getting my hands on some excellent data and the opportunity to also present my work to the community. Thank you. [applause] Male Speaker: Questions? I'll start out with one question. Very exciting, very nice work. How are we going to cope with representing the uncertainty? You gave a good example where there were a couple of different interpretations -- Yi Qiao: Yes -- Male Speaker: -- of the data, and I think we're struggling with that at Santa Cruz as we do similar work. We don't want to present one machine solution as the absolute truth when we're -- when we're actually quite uncertain as to what the -- how the data should actually be interpreted. Do you have any thoughts on -- Yi Qiao: Well, I sort of think about that -- Male Speaker: -- how to -- Yi Qiao: -- for example, given the example when I sort of breaking-- [talking simultaneously] Male Speaker [unintelligible] Yi Qiao: Okay. So I was sort of breaking a tree [spelled phonetically] structure -- Male Speaker: Yeah. Yi Qiao: -- and most of the time it is going to be different ways to break the left node and the right node with equally the same, you know, goodness of explaining the data. And I think there might actually not be the way to differentiate which one is correct -- Male Speaker: Right. Yi Qiao: -- based on the copy number data alone. Male Speaker: Absolutely. We will -- that's the point, we will have uncertain situations where there's no one solution that we're confident in. There could be other equally good solutions. So how can we communicate to the biological community that fact? Yi Qiao: We should first of all make that aware -- Male Speaker: Yes [laughs]-- Yi Qiao: -- announcing [spelled phonetically], and then, most of the time I don't think that is going to matter. For example, it might be equivalent if, for example, a location where there is a homozygous deletion -- Male Speaker: Yes -- Yi Qiao: -- but only 50 percent of the cell has that, versus a heterozygous deletion where all the cells have that. If you break the genome into pieces and do what the sequencing technology that we do nowadays, then that might not be a difference at a DNA level. Male Speaker: Right, but there's a very important difference at a biological level -- Yi Qiao: Yes -- Male Speaker: -- and so -- Yi Qiao: -- yes. Male Speaker: -- we need to know that until we -- Yi Qiao: Yeah, we might -- Male Speaker: Okay. Yi Qiao: -- need to incorporate other data then. Male Speaker: All right. So it's an issue. Any other questions? Male Speaker: So, actually I have sort of related question to [unintelligible]. So their data means [spelled phonetically] there's alternative scenario, but suppose there's only one scenario. Do we [unintelligible] also [unintelligible] confidence interval for the percentage of different tumors? Yi Qiao: Yes, that is actually the thing that I actually want to work on. Currently it is based on a linear model, but it might need to be changed to incorporate a statistical solution. For example, what is a percentage that -- I think it is more, the confidence is more, is going to play more in terms of determining the actual integer number of the copy number in a subclone, versus, you know, it's going to play less than actual ratio. So that's where the confidence is going to play an important role there. Male Speaker: Thank you. Male Speaker: Okay, let's thank Yi again. Yi Qiao: Okay --