Tip:
Highlight text to annotate it
X
Matan Hofree: Hi, everyone. First, thank you to the organizers
for giving me this opportunity to present the work on network-based stratification of
tumor mutations. I am a student in Eric Triadika's [spelled phonetically] group, and I have just
realized that I am keeping all of you from lunch, so I'll try to keep it interesting
and to the point. Let's start.
We've discussed, at length, stratification, so if I -- just to say one word. It is clear
that it is important, as all of you have tried to do it in various forms. We feel it's a
very major milestone in the way for patient-tailored treatment. And there have been many successful
attempts; we've seen a few today. And I'll remind us -- so the original work, by Verhaak,
who just preceded me, where they have sub-typed glioblastoma into four types -- I guess five
types now -- where some of these subtypes have significant association with survival.
In other cases, however, for example, ovarian cancer, the subtypes that were defined by
expression were not as successful at recapitulating a clinical phenotype.
And we were asking ourselves -- there remains this type of data which might harbor information
for subtyping that haven't been used yet, which is somatic mutations. And then the question
arises, "Why are somatic mutations difficult," or "Why haven't they been used before for
subtyping?" And somebody, already today on this podium, sort of mentioned that it's just
their -- there's not enough of them; they're -- it's a very sparse data type.
Here, I just plotted the patient by mutation matrix just for chromosome 17. The only feature
that is apparent is this TP53 mutation, which is in the majority of the ovarian cancer cohort
-- this is just for ovarian cancer here. And if you were to quantify this a bit more carefully
-- here I show a histogram on the bottom, a histogram of the columns of this matrix,
so basically the frequency of mutations for each patient. And here, basically a histogram
of the rows. And as we can see, most mutations occur in a very small fraction of the patients.
And it is -- we can sort of discount these as passengers. However, in some sense, these
might be for that specific patient, very important mutations.
So, we were sort of curious -- my lab does a lot of stuff with the -- from a network
perspective, and we were curious whether it was possible to basically go from the clustering
you see above the stratification, which is sort of not very meaningful. Use networks
to sort of provide something that's more meaningful. And we proposed a network-based stratification
approach, which is based on consensus clustering.
We start, as in consensus clustering, with a boot-strap initialization of the data, so
we draw out a sample. Next we apply a network smoothing step, which is basically starting
from individual mutations, we propagate them onto a network, and I'll expand further on
this further in a moment. On these propagated features, we next apply a network clustering
approach, which is basically an NMF with an added regularization layer for the network.
And for anybody not familiar with NMF, you should just think about fancier K means. So
we basically do a fancier K means with the network. Finally, when we aggregate the results,
we get something that actually seems to contain information, as opposed to the same data when
we apply consensus clustering NMF out of the box.
So an intuition for network smoothing or why we think this captures something that makes
sense. Here we see sort of two virtual genotypes, genotype A and genotype B, and if you can
see at the bottom here, there are very sparse -- just a few ones, mostly zeros -- on very,
very few -- very little overlap between the two genotypes. Through network propagation,
we are able to smooth across the network, basically allowing influence from the individual
mutations to seek to its network neighborhoods, forming a much denser vector that is now -- has
a lot more to compare between these two genotypes; basically, at the end, forming these areas
of overlap between these individual genotypes.
We start by testing this out in simulation, and we formed a simple simulation framework.
What we varied -- the variables we varied are the size of the pathway that we believe
is implicated in cancer, or the amount of pathway information that is tied to the cancer.
And on the bottom is the frequency of driver. So how much of the mutations are part of the
background and how much are part of the cancer drivers.
And when we compare between these two approaches, without smoothing when we use our network-based
approach, we can see that without the smoothing, if the mutations are very, very common and
the pathways are small, we are able to basically capture them using the standard method. However,
when we use a network-based approach, we could push this area of informative clustering much
further into this space. We actually believe that the real cancer lies somewhere in this
space. When we apply this to a random network, as a sanity check, we sort of see a degradation
in signal, which we found very encouraging.
Of course, I would not come here if we only had results on simulation. We applied this
to the TCGA ovarian cancer, which I've already mentioned had sort of not quite as interesting
association with biological for the subtyping results. And here you see the conventional
stratification, which really is not interesting: You get the monolithic cluster and something
that looks like a few outliers. When we apply our approach, you get something that really
looks like a meaningful signal. The question is, is it biological? And I'd like to argue
that it is. Here I plot the association of patient survival with the number of clusters,
so we see that the y-axis here is the survival log likelihood ration, so higher is better.
And we see that for quite a number of cluster numbers, we could basically get what is significant
association with survival, compared to either permutated or the standard NMF.
If we drill down to four subtypes, which we find reasonable both in terms of the association
result and due to other intrinsic measures of clustering performance, we can see here
that there are four subtypes where there's one subtype that actually performs much worse
than the rest in terms of mean survival. And this is also recapitulated when we look at
the probability of platinum resistance. So the acquiring of platinum resistance seems
to be -- seems to possibly be an event driving this result. If we compare it to other data
types -- which we've, thankfully, could download from the Firehose -- we sort of see that they
also get some sort of likelihood; the ratio performance, however, is inferior to what
we get when we use different networks. So our method is both -- recovers different subtypes,
and subtypes that are actually more predictive of survival.
Finally, we sort of asked ourselves, can we take these results, as expression measurements
are still much easier to come by than somatic mutations. Can we transfer using what TCGA
did, our results from the world of somatic mutations, into expression subtypes?
So we basically defined the subtypes as before, and we used a supervised learning approach
now to sort of predict the subtypes using expression data. So the expression are now
used as biomarkers, basically, to predict the subtypes that were defined using somatic
mutations. Now we can apply this to an independent dataset, and sort of see how well we're performing.
So first of all, just to -- sort of as a measure of how much this actually works, we sort of
test this out in cross-validation. Here we sort of see the performance for somatic mutations
and here we see the performance for the expression subtypes. And we see there's a degradation
of performance still above what we'd get by random chance. And we are able to recapitulate
some of the survival difference within the same dataset, and when we apply this to an
independent dataset, in this case, the Tothill expression cohort, we still maintain that
separation into three of the four subtypes, where one subtype is sort of lost due to lack
of patients. But it does maintain the same trend, as we've seen before. Just as a comparison,
this is the rerunning standard consensus clustering, NMF. And this results in what is a substantially
different clustering that has a much less remarkable association with survival.
Just reading a bit into the biology, we can sort of define a subtyping of what are the
genes in the network that are different for subtype one, the subtype that is -- actually
has lower survival. And we see a number of things that sort of are very encouraging.
The first thing that sort of popped into our eyes is this caspase pathway that has quite
a few its. It is quite widely reported as being tied to cisplatin resistance and platinum
drugs. A second result which has been discussed here today is this FGFR pathway. It has been
discussed in the context of other cancers, but there is significant amount of work that
sort of shows that the FGF pathway is tied to platinum resistance in human bladder cancer
and here in ovarian cancer, finally, in two recent papers.
So, really to summarize what we've shown, is that we can use a network-based stratification
to recover what we believe are biologically relevant subtypes. We believe that somatic
mutation subtypes are different from those recovered for other downstream molecular phenotypes.
These subtypes can be recapitulated using genome expression as a biomarker signature,
and each of these subtypes seems to have specific effected subnetworks which might explain the
reason why we are unable to sort of find these specific genes as mutated over entire cohorts
just because they are specific to a certain rare subtype of the disease.
So a one slide summary as we go from here to here using a network. Just -- that's the
gist of this talk. And if I -- I should give some acknowledgements. First of all J.P. Shen,
who helped me extensively and is here in the crowd. Janusz Dutkowski and Andy from my team
were -- also had a lot of insights during this work. And, of course, Trey, for his help
and support. Thank you all.
Charles Perou: All right. Questions? Derek [spelled phonetically]?
Male Speaker: That's a fascinating talk. Could you describe,
briefly, your method for network propagation?
Matan Hofree: So, we used -- the method is described in
detail in our paper by Van Uralaw [spelled phonetically], from Rodhek Sharon's [spelled
phonetically] group. But very briefly, what we do is we basically use -- normalize adjacency
matrix, and we start with a somatic mutation matrix, basically multiplying it -- matrix
multiplying it by the adjacency matrix, so it's -- the way to think about it is like
a random walk model with restarts. You have a parameter that sort of sets how far you
want to propagate the signal.
Male Speaker: Thank you.
Charles Perou: Question there.
Male Speaker: Yeah, so I just wonder, you show that natural
caspase grouping show correlation with clinical survival. I just wonder, given those known
clinical feature variables, do they provide additional prediction value?
Matan Hofree: So the way we explore this approach has been
in a completely unsupervised manner. And so we have not included any clinical phenotypes
because we sort of wanted to -- I feel like, in some sense, that would sort of make it
more of a supervised approach. Using it just as a feature -- we could take it offline if
you have specific ideas.
Charles Perou: Well, I think he was asking maybe if you did
a multi-variable analysis that had the clinical features and your stratification, would your
stratification be predictive.
Matan Hofree: Then I misunderstood the question, I'm sorry.
So we do -- I do have a slide somewhere at the end. We do show that the clinical variables,
like stage, grade, and age, are actually not coordinated with the subtypes that we've derived.
So these are independent of these variables.
Male Speaker: But given those known clinical variables,
do you add those -- add to your base, the grouping to write a better prediction?
Matan Hofree: So we have not done the survival analysis
in this way, but as a sort of post-processing analysis, we can show that these are independent
across the different subtypes, if that answers your question.
Charles Perou: One more question.
Male Speaker: So, we have done very similar analysis, based
just on the gene expression, not mutation part. What we do see is very interesting.
So what we see is treatment type comes out of the confounding variables. The model works
very well across platforms, across different data generation laboratory and so forth. But
if you transfer it to another treatment, it has no predictive power. Have you done something
similar?
Matan Hofree: When you say treatment, you mean like the
kind of chemotherapy these patients receive?
Male Speaker: Yes.
Matan Hofree: So in the case of TCGA, I think that the vast
majority of patients got a platinum-based treatment, so where wasn't any variability
in the treatment types. So, we didn't really explore that sort of analysis. The results
do transfer to the total dataset. I am unsure exactly what were the specifics of the treatment
they got there.
Charles Perou: Thank you.
Matan Hofree: Thank you.
Charles Perou: We'll thank the speakers again, and we will
--