Tcga - Network - Based stratification of tumor mutations - Matan hofree

Matan Hofree: Hi, everyone. First, thank you to the organizers for giving me this opportunity to present the work on network-based stratification of tumor mutations. I am a student in Eric Triadika's [spelled phonetically] group, and I have just realized that I am keeping all of you from lunch, so I'll try to keep it interesting and to the point. Let's start. We've discussed, at length, stratification, so if I -- just to say one word. It is clear that it is important, as all of you have tried to do it in various forms. We feel it's a very major milestone in the way for patient-tailored treatment. And there have been many successful attempts; we've seen a few today. And I'll remind us -- so the original work, by Verhaak, who just preceded me, where they have sub-typed glioblastoma into four types -- I guess five types now -- where some of these subtypes have significant association with survival. In other cases, however, for example, ovarian cancer, the subtypes that were defined by expression were not as successful at recapitulating a clinical phenotype. And we were asking ourselves -- there remains this type of data which might harbor information for subtyping that haven't been used yet, which is somatic mutations. And then the question arises, "Why are somatic mutations difficult," or "Why haven't they been used before for subtyping?" And somebody, already today on this podium, sort of mentioned that it's just their -- there's not enough of them; they're -- it's a very sparse data type. Here, I just plotted the patient by mutation matrix just for chromosome 17. The only feature that is apparent is this TP53 mutation, which is in the majority of the ovarian cancer cohort -- this is just for ovarian cancer here. And if you were to quantify this a bit more carefully -- here I show a histogram on the bottom, a histogram of the columns of this matrix, so basically the frequency of mutations for each patient. And here, basically a histogram of the rows. And as we can see, most mutations occur in a very small fraction of the patients. And it is -- we can sort of discount these as passengers. However, in some sense, these might be for that specific patient, very important mutations. So, we were sort of curious -- my lab does a lot of stuff with the -- from a network perspective, and we were curious whether it was possible to basically go from the clustering you see above the stratification, which is sort of not very meaningful. Use networks to sort of provide something that's more meaningful. And we proposed a network-based stratification approach, which is based on consensus clustering. We start, as in consensus clustering, with a boot-strap initialization of the data, so we draw out a sample. Next we apply a network smoothing step, which is basically starting from individual mutations, we propagate them onto a network, and I'll expand further on this further in a moment. On these propagated features, we next apply a network clustering approach, which is basically an NMF with an added regularization layer for the network. And for anybody not familiar with NMF, you should just think about fancier K means. So we basically do a fancier K means with the network. Finally, when we aggregate the results, we get something that actually seems to contain information, as opposed to the same data when we apply consensus clustering NMF out of the box. So an intuition for network smoothing or why we think this captures something that makes sense. Here we see sort of two virtual genotypes, genotype A and genotype B, and if you can see at the bottom here, there are very sparse -- just a few ones, mostly zeros -- on very, very few -- very little overlap between the two genotypes. Through network propagation, we are able to smooth across the network, basically allowing influence from the individual mutations to seek to its network neighborhoods, forming a much denser vector that is now -- has a lot more to compare between these two genotypes; basically, at the end, forming these areas of overlap between these individual genotypes. We start by testing this out in simulation, and we formed a simple simulation framework. What we varied -- the variables we varied are the size of the pathway that we believe is implicated in cancer, or the amount of pathway information that is tied to the cancer. And on the bottom is the frequency of driver. So how much of the mutations are part of the background and how much are part of the cancer drivers. And when we compare between these two approaches, without smoothing when we use our network-based approach, we can see that without the smoothing, if the mutations are very, very common and the pathways are small, we are able to basically capture them using the standard method. However, when we use a network-based approach, we could push this area of informative clustering much further into this space. We actually believe that the real cancer lies somewhere in this space. When we apply this to a random network, as a sanity check, we sort of see a degradation in signal, which we found very encouraging. Of course, I would not come here if we only had results on simulation. We applied this to the TCGA ovarian cancer, which I've already mentioned had sort of not quite as interesting association with biological for the subtyping results. And here you see the conventional stratification, which really is not interesting: You get the monolithic cluster and something that looks like a few outliers. When we apply our approach, you get something that really looks like a meaningful signal. The question is, is it biological? And I'd like to argue that it is. Here I plot the association of patient survival with the number of clusters, so we see that the y-axis here is the survival log likelihood ration, so higher is better. And we see that for quite a number of cluster numbers, we could basically get what is significant association with survival, compared to either permutated or the standard NMF. If we drill down to four subtypes, which we find reasonable both in terms of the association result and due to other intrinsic measures of clustering performance, we can see here that there are four subtypes where there's one subtype that actually performs much worse than the rest in terms of mean survival. And this is also recapitulated when we look at the probability of platinum resistance. So the acquiring of platinum resistance seems to be -- seems to possibly be an event driving this result. If we compare it to other data types -- which we've, thankfully, could download from the Firehose -- we sort of see that they also get some sort of likelihood; the ratio performance, however, is inferior to what we get when we use different networks. So our method is both -- recovers different subtypes, and subtypes that are actually more predictive of survival. Finally, we sort of asked ourselves, can we take these results, as expression measurements are still much easier to come by than somatic mutations. Can we transfer using what TCGA did, our results from the world of somatic mutations, into expression subtypes? So we basically defined the subtypes as before, and we used a supervised learning approach now to sort of predict the subtypes using expression data. So the expression are now used as biomarkers, basically, to predict the subtypes that were defined using somatic mutations. Now we can apply this to an independent dataset, and sort of see how well we're performing. So first of all, just to -- sort of as a measure of how much this actually works, we sort of test this out in cross-validation. Here we sort of see the performance for somatic mutations and here we see the performance for the expression subtypes. And we see there's a degradation of performance still above what we'd get by random chance. And we are able to recapitulate some of the survival difference within the same dataset, and when we apply this to an independent dataset, in this case, the Tothill expression cohort, we still maintain that separation into three of the four subtypes, where one subtype is sort of lost due to lack of patients. But it does maintain the same trend, as we've seen before. Just as a comparison, this is the rerunning standard consensus clustering, NMF. And this results in what is a substantially different clustering that has a much less remarkable association with survival. Just reading a bit into the biology, we can sort of define a subtyping of what are the genes in the network that are different for subtype one, the subtype that is -- actually has lower survival. And we see a number of things that sort of are very encouraging. The first thing that sort of popped into our eyes is this caspase pathway that has quite a few its. It is quite widely reported as being tied to cisplatin resistance and platinum drugs. A second result which has been discussed here today is this FGFR pathway. It has been discussed in the context of other cancers, but there is significant amount of work that sort of shows that the FGF pathway is tied to platinum resistance in human bladder cancer and here in ovarian cancer, finally, in two recent papers. So, really to summarize what we've shown, is that we can use a network-based stratification to recover what we believe are biologically relevant subtypes. We believe that somatic mutation subtypes are different from those recovered for other downstream molecular phenotypes. These subtypes can be recapitulated using genome expression as a biomarker signature, and each of these subtypes seems to have specific effected subnetworks which might explain the reason why we are unable to sort of find these specific genes as mutated over entire cohorts just because they are specific to a certain rare subtype of the disease. So a one slide summary as we go from here to here using a network. Just -- that's the gist of this talk. And if I -- I should give some acknowledgements. First of all J.P. Shen, who helped me extensively and is here in the crowd. Janusz Dutkowski and Andy from my team were -- also had a lot of insights during this work. And, of course, Trey, for his help and support. Thank you all. Charles Perou: All right. Questions? Derek [spelled phonetically]? Male Speaker: That's a fascinating talk. Could you describe, briefly, your method for network propagation? Matan Hofree: So, we used -- the method is described in detail in our paper by Van Uralaw [spelled phonetically], from Rodhek Sharon's [spelled phonetically] group. But very briefly, what we do is we basically use -- normalize adjacency matrix, and we start with a somatic mutation matrix, basically multiplying it -- matrix multiplying it by the adjacency matrix, so it's -- the way to think about it is like a random walk model with restarts. You have a parameter that sort of sets how far you want to propagate the signal. Male Speaker: Thank you. Charles Perou: Question there. Male Speaker: Yeah, so I just wonder, you show that natural caspase grouping show correlation with clinical survival. I just wonder, given those known clinical feature variables, do they provide additional prediction value? Matan Hofree: So the way we explore this approach has been in a completely unsupervised manner. And so we have not included any clinical phenotypes because we sort of wanted to -- I feel like, in some sense, that would sort of make it more of a supervised approach. Using it just as a feature -- we could take it offline if you have specific ideas. Charles Perou: Well, I think he was asking maybe if you did a multi-variable analysis that had the clinical features and your stratification, would your stratification be predictive. Matan Hofree: Then I misunderstood the question, I'm sorry. So we do -- I do have a slide somewhere at the end. We do show that the clinical variables, like stage, grade, and age, are actually not coordinated with the subtypes that we've derived. So these are independent of these variables. Male Speaker: But given those known clinical variables, do you add those -- add to your base, the grouping to write a better prediction? Matan Hofree: So we have not done the survival analysis in this way, but as a sort of post-processing analysis, we can show that these are independent across the different subtypes, if that answers your question. Charles Perou: One more question. Male Speaker: So, we have done very similar analysis, based just on the gene expression, not mutation part. What we do see is very interesting. So what we see is treatment type comes out of the confounding variables. The model works very well across platforms, across different data generation laboratory and so forth. But if you transfer it to another treatment, it has no predictive power. Have you done something similar? Matan Hofree: When you say treatment, you mean like the kind of chemotherapy these patients receive? Male Speaker: Yes. Matan Hofree: So in the case of TCGA, I think that the vast majority of patients got a platinum-based treatment, so where wasn't any variability in the treatment types. So, we didn't really explore that sort of analysis. The results do transfer to the total dataset. I am unsure exactly what were the specifics of the treatment they got there. Charles Perou: Thank you. Matan Hofree: Thank you. Charles Perou: We'll thank the speakers again, and we will --