Tip:
Highlight text to annotate it
X
Richard Gibbs: Our next speaker is going to talk about somatic
mutation analysis across many different diseases. If he's here. It's Petar Stojanov. I hope
I did okay with your name.
Petar Stojanov: Okay, I'd like to thank the TCGA and the organization
for giving me the pleasure of presenting the results that we've been collecting with Mike
Lawrence on the pan-cancer data that's being analyzed by the TCGA right now.
So many of you have probably seen this slide, and it just goes through the basic notion
of how genome and exome data is analyzed in many of these sequencing projects. So the
tumor and the mass normal extracted from the patient, and they're compared against each
other to come up with the characterized the most important somatic genetic alterations
that we tend to look at nowadays, which is single nucleotide variants, indels, copy number
alterations, translocations et cetera. And then we combine these, we get cohorts of patients
either in the same tumor type, or now as you will see across tumor types, to look at -- to
perform statistical analysis to see which genes are -- which events are most recurrent,
significantly recurrent in the population, and then see what are these genes, what are
the pathways, and what is the selection that they provide to tumor genesis.
So this is a little bit of overview of the data that has collected due today, and as
the TCGA has contributed to a large portion of it. And as you can see, we cover about
20 to 25 tumor types with about 500 tumor-normal pairs for each. And you can see that we have
many types data for each. We have whole-exome sequence, we have whole-genome sequences,
RNA-seqs, SNP arrays, and methylation data, and as you can see, the ICGC is also contributing
for a huge amount to up to 50 different tumor types with 500 tumor-normal pairs for each.
So this, in a nutshell, says that there is a huge amount of data, and this is a flood
that we have to handle and we have to take advantage of the power that it gives us, and
also we have to deal with the complexity that it actually represents when we add so much
data together and the way this data is processed. And this is something that Mike Noble will
go over in his talk.
So, basically, going from sequencing data to a BAM file, which is the line data against
the reference genome, and then going through quality control, and then going through the
characterization pipelines, and they're all in Firehose at the Broad, and they detect
all of the genetic alterations that we try to look for, which is mutations, indels, characterizing
purity and ploidy of samples, copy number, rearrangements, and pathogens in tumor data.
So this is the overview of the pan-cancer dataset. So we have eight different tumor
types. So it's: breast, colon, gliobastoma, kidney, lung squamous, ovarian, ***, and
endometrial, to a total of 2,143 patients, and also that amounts to 436,755 mutations,
coding mutations. So that's an extremely large amount of data, and it's -- we can see here
the spread of the different mutation frequencies across tumor types. So we can see that they
can vary, and that lung has a significantly higher mutation frequency than the rest of
the tumor types. And we can see that within tumor types we have variable mutation frequency.
Also, we can see the different mutation categories on the bottom panel, and the C to A changes
are the ones that are prevalent in lung squamous, and those are the typical signature for smoking.
And also we can see that C to T transitions are prevalent across all tumor types, and
those are largely the C to G context mutations which are contributing to the high amount
of the background mutation rate.
So, in order to deal with the 400,000 -- approximately 400,000 mutations and figure out what are
the recurring genes, we have been working for the pipeline MutSig for several years
now, and at this point in time, this algorithm takes into account multiple factors. So it
calculates sample-specific, gene-specific, and context-specific diagram mutation rates.
So this is for each gene, we try to estimate the background model just based on the number
of mutations that are there. And so then we look at the base-level evolutionary conservation
of the events, we look at the positional configuration along the CDNA to see where are the mutations
located and are there particular hot spots that they cluster in. And we also have a separate
metric for truncating mutations.
So this is an overview below of all of the different tumor types that we analyzed, and
we see that the number of significant genes varies from a couple of dozen to only a few
significant genes. And we see that in pan-cancer, we actually detect a lot more. And we'll go
into the details one by one, so I'll present -- I'll present all of these published studies
one by one in the chronological order in which they were published. We can go over some of
the genes that we find, and then they'll see how that actually is represented once we combine
all these datasets together.
So this is gliobastoma, and this was published a long time ago already, 2008, and it was
-- and we can see that here we find most of the genes that were published at the time
that are characteristic for GBM, like EGFR mutations, PTEN mutations, RB1 mutations,
PIK3R1 mutations, and then soon after this paper was published, also IDH1 was published.
And we see that the mutations here are clustered in two hot spots in two sites, and there's
15 of them, so I'm going to go over what the columns mean of this table.
So, basically, in this table we have a list of genes, we have a number of mutations, number
of patients, number of sites, and then we have the different P values of the algorithm
amount. So we have the background model P value, we have the clustering P value, we
have the P value for conservation, and then we have a P value after we combine all of
these different metrics together. So this is for gliobastoma, and then we see that IDH1
is up on the list, and we have GABA1 and integrin alpha might also be interesting because it's
clustered and also well-conserved.
And so also for variant we see the basic highlight that most patients have P53 mutations, and
also by performing clustering conservation analysis we managed to pull up SRC, which
is, albeit on low recurrence, there is only four mutations in four patients, but they're
in two sites that are well-clustered and well-conserved.
So for colon we have, as you've seen in the published data, we have the two major pathways:
the wind pathway and the TGF beta receptor pathway. We have FBXW7, we have APC, we have
FAM123B. And then for TGF beta receptor we have SMAD2, we have TGFBR2, we have SMA 4.
Then we also have some PIK3CA mutations, we have BRAF mutations with 600 E that also cluster
and they're very well-conserved as you can see here.
So we managed to get most of the genes that were published in order on the top of the
list. There's a similar -- so the same two pathways are implicated in *** tumor as
well as you can see here.
For lung squamous we also managed to get a lot of the genes that were published in the
manuscript, and the most important -- one of the most important ones are NFE2L2 and
KEB1 [spelled phonetically], which are binding partners. Then we have NOTCH1 loss-of-function
mutations that are similar to the head and neck paper that also came out with similar
type of mutations. We have RB1, we have MLL2. So, as you can see, we're getting the same
mutations that were published, and these are important pathways and they're important to
their respective tumor types.
For breast, we also get all the genes that were -- and even though we use -- in this
case we use different algorithms, we get largely the same genes that were published, so we
get the genes from every different -- each sub-type of breast cancer. Luminal A, Luminal
B, basal, et cetera, so we get GATA1, RUNX1. We get AKT1, we get MLL3, we get PTEN. And
also -- we also, very interesting, that we get SF3B1 which was published in a chronic
lymphocytic leukemia paper in New England, and also -- it was also implicated in MDS.
And it's a splicing factor that -- it's still being largely researched to see what the target
that it's mis-splicing is.
So for kidney cancer we have the two important genes that we just talked about in the previous
talk: PBRM1, VHL, and VAP1. And so here we have a much lower recurrence of P53, and also
there is an interesting gene here, D9H9, which is clustered and conserved with plenty mutated
sites.
So with endometrial, the most interesting genes here, apart from the ones that are in
the known pathways that we know about, are NFE2L2, which is indicated in lung cancer,
and also we have SPOP, which is important in prostate cancer. And NFE2L2 is also clustered
and conserved the same way it is lung squamous, but we don't see its binding partner here,
so it would be interesting to investigate exactly how this pathway functions in endometrial
tumors.
So now, after putting all of these datasets together, we have a lot more power to detect
genes that are not significant in any of their respective tumor types because there is not
enough power to detect them in them, but now once we combine everything together, we'll
see that apart from the genes that we get that are significant and they're hallmark
genes for each separate tumor type, as we go down the list, we see genes that we can
only detect by combining tumor types.
So here on the top of the list, this is 150 genes, so there is five parts of this table,
and the top part shows the older genes that I just went over, and they're all from the
different tumor types that we described and they're all hallmarks of each respective tumor
type, up until we start getting to this part of the list where we have MTM1 and HCN1, which
are new, and then we have NFE2L2 and then we go forward to see another family member
of hyperpolarization activated cyclic nucleotide receptor, and then we have beta-2-microglobulin.
So we have a lot of genes that we wouldn't be able to find if we were doing all of these
tumor types separately.
Also, ATM is a famous -- it's a famous tumor suppressant gene. It was found to be significant
in CLL in the New England Journal paper where SFTB1 [spelled phonetically] was found as
well, but we didn't find it as significant in any of these tumor types that we analyzed.
And also, ERBB2 is significant, and other genes as well on this list.
And here we have a different transcription factors like E2F1which is a well-clustered
and well-conserved, and we have also TCF7 and STK3, which is a serine threonine kinase,
which is not very recurrent in the set, but it's really well-clustered and well-conserved.
So here are the genes represented as a percent of the maximum number of percent mutated for
their respective tumor types. So we can see in this table that the genes order in the
way that they were -- so we see most genes that we found significant by analyzing the
tumor types separately. So we have TP53, APC, PTEN, KRAS, PIK3CA, so we have all of the
genes that we talk about in all of the papers that we've published so far, and this represents
-- summarizes the recurrence of the results when we analyze all the tumor types separately.
And so when we combine the datasets together, we get a table where there's sort of a percent
recurrence in the overall dataset, the pan-cancer dataset. And so here we see the same genes
up on top that we get in all the papers but now we get genes that we think they're significant
in lung tumors type, but they're also mutated in every other tumor type, and that's DNAH9,
we have FAT4, we have MLL2, and then we have certain genes that we have to see if they're
real or not, like EYS and so on. NFE2L2 is only in lung but...
So, to summarize these results, as I said, we have different number of significant genes
per tumor type and we have a lot of significant genes on the pan-cancer dataset. And we have
a lot of new genes that come up, and I only talked about a few of them, but you see that
there's a big list. And here are some of them that I just mentioned that are family members
that might be important. And then we have beta-2-microglobulin which is an immune pathway,
an antigen marking pathway. And then we have MTM1, which is a muscle cell differentiation
molecule, and we know that the differentiation is an important mechanism in cancer, but there
are still genes that are part of it that we haven't found yet.
And so to conclude, I think that when combining tumor types together, there's two things that
we have to keep in mind. Combining tumor types give us the significantly more power to detect
putative driver genes that we're underpowered to detect in these tumor types separately.
And on the flip side, it also dilutes the power to detect driver genes that are potentially
important in their respective tumor types. So genes that are found on the bottom of this
significance table that are barely recurrent enough to actually be noticed by the analysis
team in the respective tumor type will not make it in the final list once we combine
all the datasets together.
But here's some future steps that we need to consider when we combine datasets, because
this is a pretty complex problem. So there's a couple of things we can do. We can incorporate
other information for potential functional role apart from conservation. So there is
polyphen2, there is mutation assessor, and there is CHASM, and Rachel Karchin will be
talking about those things in the next talk. And then we can perform the significance analysis
on curated gene sets, which we have done before for different types. And then we can extend
this analysis to look at correlation and mutual exclusivity with MeMo within and across tumor
types, and we can take into account the variable background mutation frequency across the genome.
And by taking the variable mutation rate across the gene, we can also look at pathways by
performing significance analysis and gene subnetworks by working with HotNet and paradigm
as well.
So it's important to collaborate with these groups together. And also, the other thing
that's really important, as the previous speaker mentioned, is that integrated analysis has
not been done yet, especially on these huge datasets where we get genes that are new and
that are not significant in their respective tumor types.
So with that, I would like to conclude, and thank, first of all, Gaddy Getz, who's being
spearheading this pan-cancer for the Broad, and then Matthew Meyerson, Stacey Gabriel,
Levi, Eric, Lynda, and Todd, who are, you know, the leaders of the Broad who help a
lot with this analysis and bringing about these ideas. And also I'd like to thank our
analysis team and our collaborators. Thank you.
[applause]
Richard Gibbs: There's time for one or two questions. Lou.
Lou Staudt: So how often -- two related questions. How
often do these look like gain-of-function versus loss-of-function, these rare ones that
you're pulling out? Do you see second -- sort of corollary -- do you see particular point
mutations at particular amino acids showing up in multiple cancers very infrequently,
or these are more often very loss-of-function in many different ways?
Petar Stojanov: So for certain genes that we found in the
last table that I showed, we haven't investigated if they're really loss-of-function, but we
have both cases. For DNAH9, for example, we have hot spots, and for beta-2-microglobulin,
we have to see if they're loss-of-function or not. But we haven't really looked at these
genes closely, if they're just, you know, just fresh out of the computer and we have
to go through them.
Richard Gibbs: Just two more questions, and then we'll have
to move on.
Female Speaker: Hi, I'm Angela from Harvard. I have a question
on -- since you've done the pan-cancer analysis now, can you comment on which pathway had
the most mutated genes from all the different tumors you've analyzed? And my second question
is, are you going to look at the promoter regions in your whole genomes to look for
significantly altered regions?
Petar Stojanov: So for the first question, I think from what
I've been noticing, and the most implicated pathway seems to be, you know, the TGF beta
receptor and the wind signaling pathway mutations that are on top of the list, but we also have
to look and see if we can place the other genes that we just discussed that are more
rare in the different pathways. And for the second question, I'm not exactly sure how
many whole genomes we have to analyze this, but we can use the flanking regions and the
coverage in the flanking regions to see if there's any promoter mutations.
Richard Gibbs: Thanks, one more.
Female Speaker: Yeah. Basically, so I'd like to know that
if you have the percentage information or most today for this mutation. Are they like
a site-specific mutations, or are they more sort of related to [unintelligible], like
related to that? Also, I'd like to know what kind of software are you using to identify
this kind of mutations. Thank you.
Peter Stojanov: Excuse me, can you repeat what type of mutations?
Female Speaker: The first one? Yeah, my first question is
asking if you have any information related to this gene mutations are they mostly like
a site-specific mutations or are they like a --
Petar Stojanov: Oh, you mean the clustering?
Female Speaker: Yes. [unintelligible]
Petar Stojanov: So the first question is about the cluster
mutations right?
Female Speaker: Yeah, related to the mutations --
Petar Stojanov: So we have an add-on to the MutSig algorithm
that looks at -- jointly looks at the conservation and the clustering, and then we combine that
metric with the P value that we get for the different covariants that build the background
model. And so that we can pull up genes that are not as recurrent in the dataset but are
well-clustered and have some sort of conserved hot spot and might be important in a pathway.
And that's how we can get genes like SF3B1 with a canonical site, which is a splicing
factor, and then we can get other mutations like a lot of different kinases this way.
Richard Gibbs: Thanks, Petar. We've got to move on.
[applause]