Tcga - High - Grade serous ovarian adenocarcinoma transcriptome sequencing - Andrew mungall

08 TCGA 112712 Andrew Munqall Andrew Munqall: Thank you, Richard, and thank the local organizers for inviting me to talk at this prestigious event. So, I'm actually going to tell you about a -- essentially, it's a cancer that's already published as far as the marker paper is concerned, but I will be talking about two new datasets that we've added quite recently, those of the transcriptome sequencing, and that's both the messenger RNA sequencing and also microRNA sequencing. Previously, these things had been studied at the microarray level. So just a little background about the high-grade serous ovarian cancer cohort that the TCGA consortium has collected. A little background. Most deaths from -- are the result of this advanced stage high-grade serous ovarian cancer, about 70 percent of all ovarian cancer patients. And the TCGA group had published last year this marker paper in which a cohort of 489 tumors were studied primarily at the expression level for messenger RNA and microRNA, DNA copy number evaluations, as well as DNA methylation. And additionally, you've heard already from the Broad Institute, this 316 cases of tumor and normal exome sequencing to complement the dataset. Fundamental messages coming from that marker paper included the diseases defined by and characterized by very simple mutational spectrum, in which TP53 mutation is predominant, in almost 96. So, almost all patients have these TP53 mutations, and also, characteristically high frequency of somatic copy number alterations, both focal gains and focal losses. That was in stark contrast to previous glioblastoma multiforme study in which there was very low copy number. The aims of this study that I'll tell you about in the next 10 minutes are essentially the transcriptome sequencing, and to use this transcriptome sequencing, primarily the RNA sequence, to define subtypes, importantly structural variants that could not be established well with microarray-based technologies, and alternative spliced transcripts to name but a few. So the dataset is described in this slide. We received 490 tumor total RNA samples from the Biospecimen Core Resource Repository. These samples had been collected from 15 different tissue source sites across the world, and we were able to generate RNA sequence libraries for sequencing of 420 of those, all of which have been submitted to the Cancer Genome Hub and the Data Coordinating Center. Of these 420, 300 were what we deemed high quality expression datasets that passed very stringent quality control metrics. Those expression profiles have likewise been submitted to the DCC. A further 485 samples have microRNA sequences, again, submitted and publically available. And then the preliminary analyses that we performed on these datasets are listed at the bottom of the slide here, include unsupervised consensus clustering to identify subtypes. I'll talk to you a little about that. The microarray anti-correlations with the gene isoform expression. I'll very briefly touch on that. And then a little more on fusion identification using two platforms, our in-house Trans-ABySS, and then the University of Chicago's fusion-finder algorithm. So to touch on subtypes. In this slide I'm showing Figure 2 from the ovarian marker paper published last year. In this study there were four different subtypes defined, corresponding here differentiated, immunoreactive, mesenchymal, and proliferative subtypes. When we perform similar unsupervised cluster analysis using our sequence-based expression profiles, from 300 tumors we identify potentially two additional groupings, and this NMF cluster is illustrated here with both cophenetic scores showing high values here and the average silhouette widths also supporting that there may be additional clusters to the four previously published. Of course, if we then look for the correspondence of the samples within this new six-cluster solution to the existing four, we see four discrete clusters that map almost identically to those prior published. These are our clusters four, one, two, and five, but then additionally we see these two slightly smaller clusters, cluster six and cluster three, for which the samples don't map to a single pre-defined locus. And so this adds some support to the fact that there may be additional subtypes within this data that we're seeing through the sequencing work. Those are the two additional ones there. We can do -- perform the same analysis for microRNA sequencing. Again, in the consortium publication, three robust microRNA clusters were identified. We also see reasonably robust evidence for six clusters, and here we're putting some of the top driver microRNA signatures onto each of these clusters. Many of them are familiar to those of you working on pan-cancer and multiple different tumor types. But in this case, unlike the RNA sequencing data, we see very little correspondence between the novel cluster solution and those existing previously, and clearly, we need to dig deeper into these analyses to identify and perhaps add P values to these Bezier curves to identify whether there are enrichments between certain clusters. With these expression signatures in hand, we can turn to ask questions such as the interplay between microRNA and messenger RNA, and here just to give an example is a relationship that was actually published by Chad Creighton this last year between this microRNA-29a and the locus DNMT3A DNA methyltransferase gene. What we're showing here are the expression-based for each of the six subclusters that we've identified for RNA sequencing, the expression of DNMT3A in each of those clusters, and we can see, for example, in the gray cluster increasing RNA expression. Conversely, if we look at this bottom plot, this is the expression profile for the microRNA-29a and we see decreasing expression corresponding, so anti-correlated with the RNA. But we only see this trend in cases for which the microRNA binding site is present in our isoform. And this is where the sequencing gives us additional resolution that may not be captured in microarray experiments. So an example shown, the three top isoforms of this gene all contain a microRNA binding site; the shorter isoform is absent and has no expression correlation with that of the microRNA. Turning now to the gene fusion detection within this cohort. We've applied our in-house assembly and analysis pipeline, TransABySS, to all 420 cases, and identified about 4,300 candidate fusions. In the absence of total RNA remaining, total RNA for verification, we've turned to orthogonal approaches, and we have been working with Kevin White's group at the University of Chicago. Their group has been running UC-fusion-finder on this same cohort. And looking only at the intersection, we identify approximately 1,500 such gene fusions called by both platforms. Of these 1,500, 64 are recurrent; that is, present in two or more cases. And the distribution of these is very interesting. So -- and really in stark contrast to other studies, such as the acute myeloid leukemia study. In ovarian we see a high degree of duplication, and this is consistent with the findings in the marker paper of copy number -- focal copy number gains and losses. And so many around this Circos plot, those arcs that are linked in the same chromosome block are essentially the result of duplication, and there were very few cases of translocation for ovarian, but you can see the density of the recurrent gene fusions. That's all that's being plotted here. In contrast with AML, the very many more in-frame fusions indicated by the green color, and the thickness of these bars as corresponding to the level of recurrence. So, many more highly prevalent fusion events and the result largely of translocations within AML. So very stark differences. If we tease apart the ovarian fusion events into both in-frame and out-of-frame, we identify the most recurrent in-frame events in this chart, and the colors here indicate events that are seen in the Mitelman database of chromosome aberrations in cancer. So those in purple are known fusion events, where both gene partners have been previously reported. Those highlighted in green are where a single gene member of that gene fusion partner have been previously reported. And then the remainder in gray are entirely novel gene fusion constructs identified through our analysis. To draw your attention to -- there's a single case of TFG-GPR128, which is a known polymorphism within the database of genomic variants. So the most highly prevalent gene fusion event we have is -- in-frame is this MECOM, or MDS1 and EVI1 complex locus. And this was observed to be focally amplified in over 20 percent of the ovarian tumors in the TCGA early report. Of interest, MECOM is a target of a couple of FDA-approved therapeutic compounds listed here. And like I said, we've identified these in-frame fusions in approximately 3 percent of this cohort. Primarily, the diffusion events fuse the exon1 of MECOM to an entire transcript of a novel partner. And as cartooned in this slide, MECOM and the partner genes, as a result of the duplication events, are present on chromosome three band, q26.2, and we have a fusion between this exon1 of MECOM, and in this particular case, in which six patient samples contain the fusion, we have the entire transcript locus for this leucine-rich repeat containing protein. Of interest, the 5' end of MECOM contains a 12 amino acid signature sequence, which has previously been shown to recruit MAP kinases, SMAD3, and SUV39H1, and so transcriptional corepressors and the like. So we've now taken the gene fusion partners for all 1,500 events and identified pathways which may be linked to these genes. So, of the 2,500 unique genes, we see an enrichment within the COSMIC database, 105 of these genes are seen in the cancer census as causally implicated with cancer. Some of the pathways listed on this slide are familiar to many of you. If we then remove these 105 genes from the total set, and the one remaining pathway is the ubiquitin-mediated proteolysis, and so certainly this warrants further investigation. So to summarize, we've generated mRNA-seq and microRNA-seq for 420 and 485 of these TCGA ovarian samples. Unsupervised clustering of the expression profiles identifies potentially additional sample groupings, and an exploration of putative microRNA and mRNA interactions identify significant expression anti-correlations, including the example I provided that was previously published. In contrast to other cancers, AML being an example, duplication is the primary rearrangement leading to gene fusions and is consistent with the TCGA publication. And MECOM fusions are the most recurrent in-frame events that we've identified within this tumor type. So ongoing work includes the identification of recurrent partial tandem duplications and the internal tandem duplications, and my colleague, Lucas Swanson, is here with poster number 106. I encourage you to visit. Further pursuit of this MECOM, especially in light of the therapeutic target, is warranted, and, of course, differential expression and a discriminatory gene analysis, and further integration with existing and novel TCGA datasets is in the pipeline. So I thank you for your attention. I thank my colleagues at the B.C. Cancer Agency Genome Science Center, and I'll happily take any questions. Thank you. [applause] Richard Gibbs: Time for a quick question or two. So the correlative observation of the large number of fusions with the overall level of genomic rearrangement in ovarian, at what point do you say there is a strong causative association, you know, the genome is rearranged because that disease wants to see more fusions? Where, you know, where do you [inaudible] -- Andrew Munqall: It's key, I believe, that TP53 is mutated in almost all, if not all, of these cases, and so genome rearrangement is clearly an integral part of this disease and quite different to many of the other tumor types we see. Whether the transcription fusions -- I mean, I think it must be looked at the pathway analysis, because the highest recurrency we've seen is still relatively low at around 3 percent, and so whether it's a combinatorial driving of the disease needs further exploration. Richard Gibbs: Thank you. Well, we'd better move on. We've got -- [applause] [end of transcript] NHGRI/NCI: 08 TCGA 112712 Andrew Munqall 4 12/13/12 Prepared by National Capitol Captioning 200 N. Glebe Rd. #1016 (703) 243-9696 Arlington, VA 22203