Tip:
Highlight text to annotate it
X
Josh Stewart: I want to thank the organizers for inviting
me. So I'm going to talk about our method called Paradigm which integrates multiple
types of data on patient samples for inferring what's going on in these cancers. So as folks
know in TCGA, we generate lots and lots of data and it's often referred to as a flood.
The Broad calls their system Firehose [spelled phonetically] for an appropriate reason. And
my point is that when you participate in these projects you often want to do lots of different
types of comparisons from comparing expressions and methylation to figure out why something's
not expressed or, you know, looking at the copy number and expression and methylation
all together. This quickly gets out of control. You have lots of combinations of things you
want to look at and it can be overwhelming. More importantly, when you're thinking about
a gene and trying to figure out what's going on with that gene is it active, is it not
active, you've got all these different pieces of data telling you different things, you
feel like you're at this stop light and you don't know whether to go or not. Well, at
least that's how I feel and this is often how many of us feel [laughs]. This is what
it makes you want to do.
This is your brain on all this types of data. So our particular approach is to say let's
use a knowledge based approach and the analogy I like to use is you're kind of like a detective
or a car mechanic in this example and each patient is a different accident, let's say.
And something different went wrong and some things are more serious than others here.
And if you could try to do data mining on these car wrecks, if somebody handed you a
ream of data, how fast the car was going, the direction, what people were saying. Some
of it's relevant; some of it's not relevant. You're going to be better off if you use knowledge
about how the system works and I like car talk so I'm showing Click and Clack here,
right. People call them because they know a lot about cars and can figure out and diagnose
the problem. And this cartoon shows a radiator running off and the mechanics looking in the
engine and saying, "I know what the problem with the car is, that you don't have a radiator."
Now you laugh, but with this data set, you know, it took a little bit of knowledge in
this case to know what was missing.
So in cellular systems obviously we have put together at least some of the circuitry and
the machines inside cells and so we should use those. And I'm going to show you a system
that defines a computational model to represent these types of systems and we benefit from
all these efforts out there and there are many I didn't list, that's the ellipses at
the end, we've drawn from Reactome Kegg, BioCarta, NCI PID, many different institutions and our
favorite of course combines all of them Pathway Commons from Memorial Sloan-Kettering. And
so we try to suck in all that data to learn something about what's going on in a cell.
So to motivate why we want to do this beyond just data fusion just think of a simple example,
we've got a transcription factor and you're looking at the expression of the transcription
factor and there's let's say three different transcription factors shown here. You know,
you've got two that have high expressions shown in red and one that's lower expression.
And we know that expression isn't everything and so it's almost a teleological argument,
but how do you figure out whether something's working or not? How would you figure out that
an enzyme's working? You know, even if you had magic goggles and you could look inside
a cell and see that it's bumping around and moving in a cell and chewing things up, you're
going to look at its secondary effects. You're going to look, did it actually metabolize
substrate? Or did it, it's a kinase, did it actually phosphorylate a target? And for a
transcription factor, is it turning on its targets? Right, and so that secondary evidence
tells you something about the activity of the transcription factor so in this case you
assume or you infer that the transcription factor's on and that might confirm your expression
evidence.
Another case you might see that oh, well, the targets aren't doing anything downstream
of the factor and in this case you would think it's off either the post-transcriptionally
or even translationally. We didn't activate this protein or it's not localizing correctly
or there's a mutation that stopped blocking its function or its co-activators, right,
aren't around. On the reverse, you could have a low level of expression of factor and yet
it's still enough to have potent transcriptional activity. So you want to look around the neighborhood,
is the argument here, to figure out what's going on in these things.
And one more, so that's one piece of the -- of the puzzle is to look at neighbors. And the
other idea too is, you know, in this previous example we infer that the factor was on because
of its downstream targets. But suppose I ask Gady [spelled phonetically] to give me JISTIC
plots now this is a different type of data, copy number data, and all of these just serendipitously,
all the targets are amplified now. And so I could explain a way that those over expression
via amplification, and so I'm less likely now to think that the factor's on. Maybe I'm
still -- maybe I still think, you know, over my prior expectation it's on. But it's not
as high anymore because I have another piece to explain, the up-regulation of those targets
via assist regulation type of machinery.
So to model all those -- two pieces of information were also standing on the shoulders of giants
here. There's been lots of development in the '80s and '90s and even currently by seminal
work in the field from Judea Pearl and Heckerman in the early '90s and more recently by Daphne
Koller and Nir Friedman and Aranci Gal [spelled phonetically]. There's lots of people in this
list and I would recommend folks read this really nice review article by Nir Friedman
in Science in 2004, so it's getting dated, but it's still a very nice read. So these
Bayesian networks and probabilistic graphical models that they describe give us a very nice
way of modeling lots of different data and dependencies and we can -- we can learn something
from data where we might have had a knowledge bottleneck before.
And so just a simple example here, let's go back to the diagram we had from the -- from
the nice work from Sloan-Kettering and the GBM study and we have a oncogene MDM2 that
is known to inhibit p53. So there are two parts to this system that we model, one has
to do with the regulation of MDM2s activity and the other part has to do with the interaction
between it and p53. And just as a quick toy [spelled phonetically] example, the model
that we have, so when you see our activities for genes it's actually a little bit more
of a rich representation that looks something like the central dogma for a gene, right?
You could -- you have a certain number of copies in the genome, you can express it,
you can have a certain level of protein and a certain activity in that protein and all
these variables are beliefs that you infer from data and these little black boxes show
you constraints that help you infer those beliefs from data or from other beliefs in
the system. And you can propagate this information to infer something about a higher level thing
like apoptosis or activities for these genes. And that's what we use downstream for our
downstream analysis.
And so the big picture looks like we take a cohort of patients, various types of data,
run it through our pathway models and then we produce one matrix that we can now do analysis
on. So we don't have to think about all these different modalities anymore, we can just
think about is the gene active in the sample and provide this new matrix for analysis.
So for the ovarian study, the obvious signature here from the paradigm analysis was this FOXM1
signature so when we zoomed in on this, all the patients pretty much had a up-regulation
of this known mitotic regulator, FOXM1. The slightly more interesting story about it is
that it has two isoforms and one part feeds into proliferation, the other part feeds into
DNA repair and there's a lot of disruptions in the genome and I know all the ovarian samples,
they're getting constituent activity signaling through like ATM and ATR, turning on genes
like FOXM1 that if they're not being spliced correctly are promoting two different, very
opposite kinds of things that you want to happen in a cell, both you know, this proliferation
switch and this DNA repair switch. SO FOXM1 also regulated BRCA2 for example.
So, very interesting story surrounding FOXM1, if you take the pathway activities and you
try to define subtypes for the ovarian samples then the good news there was that we could
actually start seeing a delineation of meaningful subtypes so this purple cluster shows you
that they have slightly better survival patterns than the rest of the patients.
We've recently worked on the colorectal paper led by Rajinder Kaul [spelled phonetically]
and David Wheeler [spelled phonetically] and in this case the story isn't so much FOXM1
but activated MIC throughout and that's an interesting piece of information. As we see
in the mutation data and other types of genomic perturbations when TFG-beta signaling pathway
genes are mutated and those all impinge on this mis-regulation of MIC and that also bears
out in the pathway analysis. And so one other type of analysis that we're doing with the
pathways is we can take two groups of samples or patients and look for markers of one subtype
versus another say, and then hone in on sub-networks that are markers for a particular cohort.
And we're working on this for the luminal basal comparison. So in the breast cancer
model and just to show you, this is the closest we get to the dreaded hairball, but you can
-- you can see that you know there's so blue is more expressed or more active than luminal.
And you can see the expected sort of ER signaling pathways and then you have some other intriguing
pathways among the proliferative ones for basal shown in red like F1-alpha, for example.
So, the way we can use that hairball is to do something like a master regulators analysis
like Andrea Califano [spelled phonetically] likes to do with ARACNE. You can look upstream
in this example of a -- of a basal marker such as FOXM1, like I showed and sort of by
chain of reasoning, up the regulation hierarchy you see that there's a polo kinase. And so
the prediction there is that basal cells will be more sensitive to a polo kinase inhibitor.
And this actually pans out in a cell line model shown in Joe Gray's [spelled phonetically]
lab with his cell lines. So this plot here shows you sensitivity to a polo kinase inhibitor
for basal and claudent [spelled phonetically] lows contrasted against those in luminal cells.
And the reverse is true as well. You can look up a marker for luminal, like a luminal hub
and in this case it was an HDAC. And so the prediction is that an HDAC inhibitor would
be more sensitive in luminal cells and that's what turns out to happen in these cell line
models.
And you saw a nice example yesterday from Sam Ng [spelled phonetically]. Just to go
through that real quick because I wanted to show you one more result that Sam didn't have
time to show. So he's developed a clever method where you can run our pathway analysis twice.
One where you connect the gene downstream, to its downstream targets, infer an activity
for it, another where you connect it to its upstream targets and infer an activity. And
just look at the difference to get what he calls the discrepancy in the activities that
are inferred. And he showed you an example, sort of a positive control for Rb. You can
see that the mutated cases, he's seeing a lower discrepancy which corresponds to a loss
of function event. And he showed you the pathway surrounding these things.
So we've tried this for a few positive controls and he showed you p53. And you can kind of
squint and see that for the cases in red around the circle plot, the tick marks are patients,
sorry, I didn't mention that, you can see a lower activity being inferred. And so I
asked Sam late last night actually, "Can you please run this for the lung squamous results?"
And as you saw before he was predicting for NFE2L2 this known oncogenic gene that he's
getting a positive discrepancy. And there are 30 mutations in CDKN2A and consistent
with, you know, other deletions, homozygous deletions in CDKN2A, he's predicting loss
of function. So that's interesting.
But now the power is, and these are sort of for more frequent like events, but you can
now start actually drilling into some of these more lower frequency events and there are
some intriguing stories I think in there. But and I wanted to just point out that some
of its highest scoring discrepant genes now are not the most frequent, right? So you have
a -- you actually have a HIF, a hypoxy-inducing factor up here in seven samples. Why would
that be? And among these up here are going to be possible new targets that you could
go after for your drugable genome, for example. So we even have a map, kinase-kinase up there
that might be worthwhile. And on the other end of the spectrum, there are some other
loss of function events that we would, might want to pay attention to.
So you might ask, "What do you do if you don't have good pathway models for genes? How can
you infer activity? Or do these mutations mean anything?" You can plot them against
clinical information. And so this is just sort of an overview of -- you can show some
phenotypic information against these pathway activities and infer a connection between
mutations or phenotypes. And just really quickly, since I'm almost out of time, we've done this
for -- piloted this in the colorectal study and you can cluster the mutations based on
these signatures and you can see you can look up that APC and p53 tend to have the same
correlations in the colorectal study, for example. And it confirms that APC mutations
are correlated with MIC activity, in this case anti-correlated with the repressed targets
of MIC. And on the other end of the spectrum you have TGF-beta pathway mutations, so those
cluster together. And in the middle you have RTK and PI3 kinase pathway mutation.
So, the obvious idea here is if you have a mutation in gene X and it has a -- and it
looks like it's associated with the same activities in different, in possible different patients,
perhaps it's also acting in the same pathway based on this type of association analysis
that Ted's [spelled phonetically] doing. And so I'm basically out of time. I'm going to
skip to the end. Obviously we want to use these to look across multiple cancers. The
pathway activities give us a way to do that and we're working on pan cancer analysis,
a basal comparison to ovarian for the breast work and so one.
So I hope I showed you that we have a nice model for integrating a lot of different data
sets. We use knowledge about pathways. We're trying to expand that with predicted interactions
now. We can stratify patients with that, find predictive sub-networks and so on and use
it to predict hopefully more of these rarer mutations. And the beliefs allow -- the inferences
allow us to connect cancers across different data sets. And hopefully, the last slide that
I just skipped there, it was just trying to make a point that we can connect subtypes
together, maybe get a clue about therapies. So, I wanted to just say a special thank you
to the Broad team here. They've got PARADIGM working and Firehose [spelled phonetically]
and this is not a trivial feat. And a lot of these big network methods, by the way take
a lot of CPU time to run so this is really nice that it's going to put the results in
the hands of public actually. And so you don't have -- you don't have to go off and implement
these yourself.
And this is my group that worked on the integration analysis. I've highlighted the work of the
folks circled there, especially Sam Ng who you saw speak earlier. And this is work in
collaboration with David Haussler who actually heads the whole team and Chris Benz and Jane
Ju [spelled phonetically] ran a tutorial yesterday and she runs the engineering staff. So thank
you and I'll take any questions. Sorry I went a couple minutes over.
[applause]
Male Speaker: Time for one quick question for Josh.
Josh Stewart: Crystal clear.
Male Speaker: No, okay, well I'm sure he'd be happy to take
it up over coffee if something emerges. So thank you, Josh.