Tcga - Analysis of somatic mutations across many tumor types - Petar stojanov

Richard Gibbs: Our next speaker is going to talk about somatic mutation analysis across many different diseases. If he's here. It's Petar Stojanov. I hope I did okay with your name. Petar Stojanov: Okay, I'd like to thank the TCGA and the organization for giving me the pleasure of presenting the results that we've been collecting with Mike Lawrence on the pan-cancer data that's being analyzed by the TCGA right now. So many of you have probably seen this slide, and it just goes through the basic notion of how genome and exome data is analyzed in many of these sequencing projects. So the tumor and the mass normal extracted from the patient, and they're compared against each other to come up with the characterized the most important somatic genetic alterations that we tend to look at nowadays, which is single nucleotide variants, indels, copy number alterations, translocations et cetera. And then we combine these, we get cohorts of patients either in the same tumor type, or now as you will see across tumor types, to look at -- to perform statistical analysis to see which genes are -- which events are most recurrent, significantly recurrent in the population, and then see what are these genes, what are the pathways, and what is the selection that they provide to tumor genesis. So this is a little bit of overview of the data that has collected due today, and as the TCGA has contributed to a large portion of it. And as you can see, we cover about 20 to 25 tumor types with about 500 tumor-normal pairs for each. And you can see that we have many types data for each. We have whole-exome sequence, we have whole-genome sequences, RNA-seqs, SNP arrays, and methylation data, and as you can see, the ICGC is also contributing for a huge amount to up to 50 different tumor types with 500 tumor-normal pairs for each. So this, in a nutshell, says that there is a huge amount of data, and this is a flood that we have to handle and we have to take advantage of the power that it gives us, and also we have to deal with the complexity that it actually represents when we add so much data together and the way this data is processed. And this is something that Mike Noble will go over in his talk. So, basically, going from sequencing data to a BAM file, which is the line data against the reference genome, and then going through quality control, and then going through the characterization pipelines, and they're all in Firehose at the Broad, and they detect all of the genetic alterations that we try to look for, which is mutations, indels, characterizing purity and ploidy of samples, copy number, rearrangements, and pathogens in tumor data. So this is the overview of the pan-cancer dataset. So we have eight different tumor types. So it's: breast, colon, gliobastoma, kidney, lung squamous, ovarian, ***, and endometrial, to a total of 2,143 patients, and also that amounts to 436,755 mutations, coding mutations. So that's an extremely large amount of data, and it's -- we can see here the spread of the different mutation frequencies across tumor types. So we can see that they can vary, and that lung has a significantly higher mutation frequency than the rest of the tumor types. And we can see that within tumor types we have variable mutation frequency. Also, we can see the different mutation categories on the bottom panel, and the C to A changes are the ones that are prevalent in lung squamous, and those are the typical signature for smoking. And also we can see that C to T transitions are prevalent across all tumor types, and those are largely the C to G context mutations which are contributing to the high amount of the background mutation rate. So, in order to deal with the 400,000 -- approximately 400,000 mutations and figure out what are the recurring genes, we have been working for the pipeline MutSig for several years now, and at this point in time, this algorithm takes into account multiple factors. So it calculates sample-specific, gene-specific, and context-specific diagram mutation rates. So this is for each gene, we try to estimate the background model just based on the number of mutations that are there. And so then we look at the base-level evolutionary conservation of the events, we look at the positional configuration along the CDNA to see where are the mutations located and are there particular hot spots that they cluster in. And we also have a separate metric for truncating mutations. So this is an overview below of all of the different tumor types that we analyzed, and we see that the number of significant genes varies from a couple of dozen to only a few significant genes. And we see that in pan-cancer, we actually detect a lot more. And we'll go into the details one by one, so I'll present -- I'll present all of these published studies one by one in the chronological order in which they were published. We can go over some of the genes that we find, and then they'll see how that actually is represented once we combine all these datasets together. So this is gliobastoma, and this was published a long time ago already, 2008, and it was -- and we can see that here we find most of the genes that were published at the time that are characteristic for GBM, like EGFR mutations, PTEN mutations, RB1 mutations, PIK3R1 mutations, and then soon after this paper was published, also IDH1 was published. And we see that the mutations here are clustered in two hot spots in two sites, and there's 15 of them, so I'm going to go over what the columns mean of this table. So, basically, in this table we have a list of genes, we have a number of mutations, number of patients, number of sites, and then we have the different P values of the algorithm amount. So we have the background model P value, we have the clustering P value, we have the P value for conservation, and then we have a P value after we combine all of these different metrics together. So this is for gliobastoma, and then we see that IDH1 is up on the list, and we have GABA1 and integrin alpha might also be interesting because it's clustered and also well-conserved. And so also for variant we see the basic highlight that most patients have P53 mutations, and also by performing clustering conservation analysis we managed to pull up SRC, which is, albeit on low recurrence, there is only four mutations in four patients, but they're in two sites that are well-clustered and well-conserved. So for colon we have, as you've seen in the published data, we have the two major pathways: the wind pathway and the TGF beta receptor pathway. We have FBXW7, we have APC, we have FAM123B. And then for TGF beta receptor we have SMAD2, we have TGFBR2, we have SMA 4. Then we also have some PIK3CA mutations, we have BRAF mutations with 600 E that also cluster and they're very well-conserved as you can see here. So we managed to get most of the genes that were published in order on the top of the list. There's a similar -- so the same two pathways are implicated in *** tumor as well as you can see here. For lung squamous we also managed to get a lot of the genes that were published in the manuscript, and the most important -- one of the most important ones are NFE2L2 and KEB1 [spelled phonetically], which are binding partners. Then we have NOTCH1 loss-of-function mutations that are similar to the head and neck paper that also came out with similar type of mutations. We have RB1, we have MLL2. So, as you can see, we're getting the same mutations that were published, and these are important pathways and they're important to their respective tumor types. For breast, we also get all the genes that were -- and even though we use -- in this case we use different algorithms, we get largely the same genes that were published, so we get the genes from every different -- each sub-type of breast cancer. Luminal A, Luminal B, basal, et cetera, so we get GATA1, RUNX1. We get AKT1, we get MLL3, we get PTEN. And also -- we also, very interesting, that we get SF3B1 which was published in a chronic lymphocytic leukemia paper in New England, and also -- it was also implicated in MDS. And it's a splicing factor that -- it's still being largely researched to see what the target that it's mis-splicing is. So for kidney cancer we have the two important genes that we just talked about in the previous talk: PBRM1, VHL, and VAP1. And so here we have a much lower recurrence of P53, and also there is an interesting gene here, D9H9, which is clustered and conserved with plenty mutated sites. So with endometrial, the most interesting genes here, apart from the ones that are in the known pathways that we know about, are NFE2L2, which is indicated in lung cancer, and also we have SPOP, which is important in prostate cancer. And NFE2L2 is also clustered and conserved the same way it is lung squamous, but we don't see its binding partner here, so it would be interesting to investigate exactly how this pathway functions in endometrial tumors. So now, after putting all of these datasets together, we have a lot more power to detect genes that are not significant in any of their respective tumor types because there is not enough power to detect them in them, but now once we combine everything together, we'll see that apart from the genes that we get that are significant and they're hallmark genes for each separate tumor type, as we go down the list, we see genes that we can only detect by combining tumor types. So here on the top of the list, this is 150 genes, so there is five parts of this table, and the top part shows the older genes that I just went over, and they're all from the different tumor types that we described and they're all hallmarks of each respective tumor type, up until we start getting to this part of the list where we have MTM1 and HCN1, which are new, and then we have NFE2L2 and then we go forward to see another family member of hyperpolarization activated cyclic nucleotide receptor, and then we have beta-2-microglobulin. So we have a lot of genes that we wouldn't be able to find if we were doing all of these tumor types separately. Also, ATM is a famous -- it's a famous tumor suppressant gene. It was found to be significant in CLL in the New England Journal paper where SFTB1 [spelled phonetically] was found as well, but we didn't find it as significant in any of these tumor types that we analyzed. And also, ERBB2 is significant, and other genes as well on this list. And here we have a different transcription factors like E2F1which is a well-clustered and well-conserved, and we have also TCF7 and STK3, which is a serine threonine kinase, which is not very recurrent in the set, but it's really well-clustered and well-conserved. So here are the genes represented as a percent of the maximum number of percent mutated for their respective tumor types. So we can see in this table that the genes order in the way that they were -- so we see most genes that we found significant by analyzing the tumor types separately. So we have TP53, APC, PTEN, KRAS, PIK3CA, so we have all of the genes that we talk about in all of the papers that we've published so far, and this represents -- summarizes the recurrence of the results when we analyze all the tumor types separately. And so when we combine the datasets together, we get a table where there's sort of a percent recurrence in the overall dataset, the pan-cancer dataset. And so here we see the same genes up on top that we get in all the papers but now we get genes that we think they're significant in lung tumors type, but they're also mutated in every other tumor type, and that's DNAH9, we have FAT4, we have MLL2, and then we have certain genes that we have to see if they're real or not, like EYS and so on. NFE2L2 is only in lung but... So, to summarize these results, as I said, we have different number of significant genes per tumor type and we have a lot of significant genes on the pan-cancer dataset. And we have a lot of new genes that come up, and I only talked about a few of them, but you see that there's a big list. And here are some of them that I just mentioned that are family members that might be important. And then we have beta-2-microglobulin which is an immune pathway, an antigen marking pathway. And then we have MTM1, which is a muscle cell differentiation molecule, and we know that the differentiation is an important mechanism in cancer, but there are still genes that are part of it that we haven't found yet. And so to conclude, I think that when combining tumor types together, there's two things that we have to keep in mind. Combining tumor types give us the significantly more power to detect putative driver genes that we're underpowered to detect in these tumor types separately. And on the flip side, it also dilutes the power to detect driver genes that are potentially important in their respective tumor types. So genes that are found on the bottom of this significance table that are barely recurrent enough to actually be noticed by the analysis team in the respective tumor type will not make it in the final list once we combine all the datasets together. But here's some future steps that we need to consider when we combine datasets, because this is a pretty complex problem. So there's a couple of things we can do. We can incorporate other information for potential functional role apart from conservation. So there is polyphen2, there is mutation assessor, and there is CHASM, and Rachel Karchin will be talking about those things in the next talk. And then we can perform the significance analysis on curated gene sets, which we have done before for different types. And then we can extend this analysis to look at correlation and mutual exclusivity with MeMo within and across tumor types, and we can take into account the variable background mutation frequency across the genome. And by taking the variable mutation rate across the gene, we can also look at pathways by performing significance analysis and gene subnetworks by working with HotNet and paradigm as well. So it's important to collaborate with these groups together. And also, the other thing that's really important, as the previous speaker mentioned, is that integrated analysis has not been done yet, especially on these huge datasets where we get genes that are new and that are not significant in their respective tumor types. So with that, I would like to conclude, and thank, first of all, Gaddy Getz, who's being spearheading this pan-cancer for the Broad, and then Matthew Meyerson, Stacey Gabriel, Levi, Eric, Lynda, and Todd, who are, you know, the leaders of the Broad who help a lot with this analysis and bringing about these ideas. And also I'd like to thank our analysis team and our collaborators. Thank you. [applause] Richard Gibbs: There's time for one or two questions. Lou. Lou Staudt: So how often -- two related questions. How often do these look like gain-of-function versus loss-of-function, these rare ones that you're pulling out? Do you see second -- sort of corollary -- do you see particular point mutations at particular amino acids showing up in multiple cancers very infrequently, or these are more often very loss-of-function in many different ways? Petar Stojanov: So for certain genes that we found in the last table that I showed, we haven't investigated if they're really loss-of-function, but we have both cases. For DNAH9, for example, we have hot spots, and for beta-2-microglobulin, we have to see if they're loss-of-function or not. But we haven't really looked at these genes closely, if they're just, you know, just fresh out of the computer and we have to go through them. Richard Gibbs: Just two more questions, and then we'll have to move on. Female Speaker: Hi, I'm Angela from Harvard. I have a question on -- since you've done the pan-cancer analysis now, can you comment on which pathway had the most mutated genes from all the different tumors you've analyzed? And my second question is, are you going to look at the promoter regions in your whole genomes to look for significantly altered regions? Petar Stojanov: So for the first question, I think from what I've been noticing, and the most implicated pathway seems to be, you know, the TGF beta receptor and the wind signaling pathway mutations that are on top of the list, but we also have to look and see if we can place the other genes that we just discussed that are more rare in the different pathways. And for the second question, I'm not exactly sure how many whole genomes we have to analyze this, but we can use the flanking regions and the coverage in the flanking regions to see if there's any promoter mutations. Richard Gibbs: Thanks, one more. Female Speaker: Yeah. Basically, so I'd like to know that if you have the percentage information or most today for this mutation. Are they like a site-specific mutations, or are they more sort of related to [unintelligible], like related to that? Also, I'd like to know what kind of software are you using to identify this kind of mutations. Thank you. Peter Stojanov: Excuse me, can you repeat what type of mutations? Female Speaker: The first one? Yeah, my first question is asking if you have any information related to this gene mutations are they mostly like a site-specific mutations or are they like a -- Petar Stojanov: Oh, you mean the clustering? Female Speaker: Yes. [unintelligible] Petar Stojanov: So the first question is about the cluster mutations right? Female Speaker: Yeah, related to the mutations -- Petar Stojanov: So we have an add-on to the MutSig algorithm that looks at -- jointly looks at the conservation and the clustering, and then we combine that metric with the P value that we get for the different covariants that build the background model. And so that we can pull up genes that are not as recurrent in the dataset but are well-clustered and have some sort of conserved hot spot and might be important in a pathway. And that's how we can get genes like SF3B1 with a canonical site, which is a splicing factor, and then we can get other mutations like a lot of different kinases this way. Richard Gibbs: Thanks, Petar. We've got to move on. [applause]