Dr. Chris Kinsinger Presents - Proteogenomic integration of cptac data

>> The Speaker Series is a knowledge sharing forum featuring both internal and external speakers on topics of interest to the biomedical informatics and research communities. We remind you that today's presentation will be available on the wiki for the Speaker Series as a screencast with voiceover and also posted on the Speaker Series YouTube playlist. If you Google NCI Speaker Series, you'll find that wiki page. Also, information about our future speakers is available via Twitter and on our blog. Just Google NCI blog to find that and our Twitter handle is @nci_ncip. Today, I'm very happy to welcome Dr. Chris Kinsinger who is Technology Program Manger for NCI's Clinical Proteomic Tumor Analysis Consortium or CPTAC. Dr. Kinsinger focuses on the expansion and coordination of open data access and programmatic goals involving mass spectrometry, informatics, and biospecimens. He completed his postdoctoral training at NIST where he researched fragmentation pathways of peptide ions and mass spec and holds a Ph.D. in chemistry from the University of Minnesota. The title of his presentation is Proteogenomic Integration of CPTAC Data. And with that, I'll turn the floor over to Chris. >> OK, thank you Tony. And thank you everyone for participating in the speaker series on this icy morning of February. My goals today are to raise awareness of data analysis, highlights and challenges that the interface of genomics and proteomics. And secondly, I'd like to convince you that proteomics does offer novel insights into cancer biology. So, the main program that I'm going to be talking about is the one in which I manage which is called CPTAC or the Clinical Proteomic Tumor Analysis Consortium. So, the outline is I'm going to give you a brief introduction to CPTAC. We'll talk about some of the methods used in proteogenomic analysis, share some preliminary results from the tumor data that we have which is tumors actually came from TCGA, and then we did the proteomic analysis on top of them. I'll talk about how to get access to CPTAC data, some of the other resources that we have, and then the conclusions and challenges. So, just a little background on CPTAC, if you go back about 12 years now, the human genome was published and the National Academy of Sciences have one of their workshops to look at the future of omics technologies. And at that time what they determined with the technology was that the sequencing technology for DNA was pretty mature. And so, they were ready to move forward into production mode. The transcriptomics was up and coming and there are a lot of array methods that were established. And since then, obviously, we've had RNA-seq developed. But proteomics was the least developed in terms of technology back in 2002. So, the NCI took that and a lot of other input and in 2006, launched what we know as The Cancer Genome Atlas. But also at the same time through the same report from the National Cancer Advisory Board started a proteomics program that was called the Clinical Proteomic Technology Assessment for Cancer. You'll notice that's the same acronym as the tumor analysis for cancer. We kept the branding of CPTAC for both phases of this program. But in 2006, we're really focusing on technology standardization validation of those methods. And because of the reproducibility issues that were apparent in the field. So, what we did for the first five years of 2006-2011 is really take a close look at the discovery technologies in proteomics as well as how we can beef up the validation of proteomic assays. And the main output I would say was to develop kind of a middle ground which we call verification which is to use these mass spectrometers that do most of the discovery work and to do target analysis with those instruments so that you could focus in on a few peptides or few proteins and get much reliable quantitative information, to can move into then the validation phase. In addition to that, we came up with some data release and sharing policy for proteomics data, which it was a bit of a Wild West back then. We also developed the antibody characterization program and I'll share more about that at the end of the talk. So, then when we got to 2011, of course we sought to reissue the program but it really took a different direction rather than technology assessment. We moved into more full-fledged, I guess pilot production tumor analysis. So, we are able to take-- right now we're looking at three tumors that TCGA has already analyzed. We check the residual tissue and are doing the proteomic analysis of that. And the goal here is to understand the molecular basis of cancer through proteomic technologies and using the genomic data as well. So, our consortium consisted of these five teams, the Broad Institute, Johns Hopkins University, Washington University at Saint Louis, Vanderbilt University, and Pacific Northwest National Lab. And each of them have a number of other centers that they partner with as well as the number of support centers that help in the data analysis as well as collecting future specimens for the program. And so, as I said, we're analyzing colorectal tissue, breast tissue, and ovarian tissue. We're doing about a hundred tumors per cancer type and the reason where TCGA did 500, we're doing 100 because one, our through-put is not as fast as what genomics technologies are right now. And also the residual tissue was limited in number that was available to us. So, this is the main workhorse technology for proteomics discovery. This is shotgun mass spectrometry and all the results I'm talking about today will be the colorectal data which is the data set that's currently publically available. And Dan Liebler of Vanderbilt produced these data in his lab. So, what we do with a typical proteomics workflow is we start with the tissue and then extract the protein at the same time digest the proteins into peptides using the trypsin enzyme. So, we have peptide residues that are about 7 to 25, 30 amino acids in length. And then we run that through some two- dimensional chromatography. It gets pretty fancy here. So, we're not going to dwell on that because this is mainly an informatics talk. So, you separate the peptides into different fractions and then run them through the mass spectometry. This is-- the Orbitrap is a common instrument that has pretty good resolution and the way that you identify the peptide spectrum is you actually break up the peptides into smaller fragment ions and you get a spectrum of MS-MS fragment ions. And so then we have software that can go into that and match a spectrum with a known peptide. And I'll discuss that a little bit more on the next slide. So, then what you're left with is you get a list of peptides and then of course you have to assemble those peptides into proteins, and then you can get the quantitative information. There's a few different ways to do quantitative information. For this data set, we used the method called spectral counting which is basically counting the number of times you identified a peptide across the data set and you're making an assumption that that is correlated with the overall abundance of that peptide in the sample. OK. So, let's talk about how we actually do the peptide identification. So, this is all we have. We've got the tandem mass spectrum here which again is mass-to-charge ratio on the X axis and ion intensity on the Y axis. So, that's what we have and it doesn't really tell you anything about what peptide is in that sample. So, we have to start with what we know and what we do know is to do proteomics you have to know the genome of the organism that you're studying. So, in this case, we're doing human tumors. So, we have the human genome. So, we usually start with something like the Ensembl database or the RefSeq database. And from that, we can find the exomes and translate those into theoretical protein sequences. And we call that a FASTA database of protein sequences. And we do an in silico trypsin digest to get the theoretical peptides we might expect to identify. We can expand that by various post translational modifications that we might expect to be in the sample. However we need to be careful with that because if you expand it too far then you're broadening your search base too much and you're going to get a lot of false positive identifications. So, there's different ways of filtering the peptides that you have. And eventually, you get a list of fragment ions that you would expect from a given peptide. You got a theoretical spectrum. And so then you're matching the theoretical spectrum with the real spectrum and there's different scoring algorithms. It's probably at least a hundred different score in algorithms that graduate students have developed to match these and you take the best matches and so forth. Then there are ways to compute a false discovery rate by using say a randomly generated database of non-real peptides and seen how well those match your spectrum and then you get-- that's kind of how you do the false discovery rate. So, this is what's going on. Now, I also want to talk about the-- well, what I want to say about this is that there are a lots of different peptide identification algorithms. I think, you know, they're not perfect yet but the field has come along. It's a nice tractical problem that graduate students like to work on. The harder problem in proteomics is this assembly problem, where you have a list of peptides and then-- now, you're tying to roll them up into various proteins. But we know that the protein is not translated the same way each time. For instance, VEGF consists of eight exons but you can a VEGF protein without exon 6 and 7, or with only exon 6, or with exon 6 and 7. So, we have these different isoforms of these proteins. And what do you do if you don't have all the peptides? And a lot of times, we don't have all the peptides for a protein. Then we have situations where it might be Protein A and Protein B and we've identified Proteins 1, 2, and 3, but none of these peptides are unique to Protein A or Protein B. So, which one is really there, the Protein A or Protein B? So, we have to work that out. You could also have a situation where there's three peptides that match the Protein A. And two of those three peptides match Protein B, Is Protein B is actually there? So obviously the likelihood that Protein A is there is pretty strong but there might be some of Protein B there. Then you have subsumable proteins where there are unique peptides that match Protein A and C but there aren't any peptides that match to-- all right there aren't any unique peptides to match the Protein B. So, again is Protein B really there? Or what's the likelihood? Of course we have these probabilities and compute this. And I would just say I'm always recruiting new people to study this problem because I think there's less work on this problem than some of the other problems in proteomics. So, then here's a real situation where you have all these different peptides matching to many different proteins and which proteins are actually there, which ones aren't there. So, the algorithm that we're currently using is called ID picker and it tries to do this in the most parsimonious manner. So that you're not coming up with false positive proteins that are actually not in your sample. So, again, that's just a plug for more work to be done on the protein assembly piece. All right. Well, lets move on to talking a little bit about the data. First, I mentioned that we had this reproducibility issue in proteomics in the 2000s. And so, that hasn't entirely gone away and so we-- at CPTAC we wanted to maintain our status as caring a lot about reproducibility in high quality data. So, what we did for the first year of the second phase of the program in 2011 was qualify labs and platforms to be able to have access to the TCGA samples. And so, the way we did that was using some xenograft breast tumors from the University of-- the Washington University in Saint Louis where they took human breast tumors and grafted them into mice, and then you can grow up large lots of breast tumor material that's quite homogeneous. And so we used that as our performance mixture. So, every lab within CPTAC had to run this and the consortium agreed, OK, yes these labs are ready to receive the samples and these labs still need a little bit more work and we have some hard conversations but eventually everybody got there. So, we're happy about that. And then we continued to use this material to run interstitially through the CPTAC or through the TCGA tumors as we ran them for proteomics. So, every 5th sample for the colorectal set had one of these xenograft samples. And so, I think it took about nine months to get through a 100 tumors. And so, you could have instrument variability of course over those nine months. But we're able to look at, you know, the sample that we ran the first week of the xenograft and the sample that we ran nine months later and see much drift there had been in the instrument. There is also-- yes, so it was a good-- a good normalization method for the platform itself. This is the result of those CompRef samples across the sample set. And it's a big complicated slide with a lot of numbers but the main thing to show-- there were two types of samples. One was a basal breast tumor and the other was a luminal breast tumor. And so, we have very high correlation for the luminals of about 0.85 and with the basals-- to basal it was 0.88. And if you do the correlation across the two samples it was 0.68. So, the basal is matching really well with the basal and luminal is matching really well with the luminal. So, we're pretty convinced that the platform is quite stable over the analysis time for these samples. OK. So, let's talk about the tumors themselves. So, there are colorectal tumors, about 55 percent male, 45 percent female. Most of them came from the colon region rather than the *** region about two-thirds colon, one-third ***. At that time most of the people were still living. So, this is the-- the hypermutated status. Most of them-- most of the tumors were non- hypermutated. But about-- what's that, 20 percent were hypermutated. And also within-- colon cancer there's this microsatellite stability versus instability. And so, here's the breakdown of that where MSS is stable. Microsatellite instability low is this yellow region here and the high, MSI high is the red region there. And so then you can see the breakdown of stages of tumors also. So, all in all we had 90 colorectal samples. Five of them were actually-- there were 95 total samples but five of them were duplicates so we also had those duplicates, then those looked quite reproducible-- reproducible as well. So, we run these through the mass spectrometer. We've got a lot of data, millions and millions of spectra. And from those when we identified the peptides we came up with 94,000 distinct peptides. And that's at a 0.1 false discovery rate. So, I guess that means like one in 10,000 is likely to be misidentification. So, we're looking at maybe 9 of those 94,000 on average would be false. We roll up those 94,000 peptides, you get to about-- we find 7,500 different protein groups. So, that's-- a protein group is a group of proteins that's indistinguishable, which I talked about earlier. So, there could be more than that, but that's-- that's how many we can find that are distinguishable. Now, it's typically done in proteomics, this 0.01 percent FDR, is quite stringent. So, since we found these proteins then we went back to the data and looked and said, "Well, what other peptides map to those proteins?" So, we're not introducing any new proteins here. But if we have peptides that didn't quite meet that high threshold but they did-- they do map to the proteins that we found. It's pretty likely that they would be there. So, we allowed the false discovery rate to come down to one percent which is kind of the gold standard in proteomics today. And so, when we do that, then we get a 124,000 distinct peptides. Still the 7,500 protein groups and these map back to 7,200 genes. OK. Now, we can't quantify all of those proteins because in some cases we just have a few spectra but for the ones where we have more spectrum we're doing the spectral counting method then we can quantify about half of those, 3,900 quantifiable proteins. So, that's the dataset we're working with when we do the quantification analysis that's coming up. OK. So, let's shift gears now and talk about how we do proteogenomics. There's a lot of different approaches to proteogenomics, I'm just going to talk about one approach today. But it really has to do with what would put in this FASTA database. So, for using the RefSeq database then we're limiting ourselves to kind of the reference human genome. But for this sample set TCGA has calculated the DNA sequence, they've done their RNA sequence. So, we may as well use the variance that they've found and the different splice variations and things and put those into the FASTA database. So, that's what our-- our investigators the-- New York University David Fenyo and Kelly Ruggles have done. So, they've taken the non-tumor sample. The germline sequence and identify the variance, put those into the FASTA protein sequence. They do the same mainly with the RNA-Seq. When you-- if you take all the variance from the DNA, then again you-- you know, you've got to be careful. You're expanding your search base quite a bit. You're going to open yourself up to a lot of false positives. So, a preliminary stage which this is, just looking at the variance within the RNA-Seq and putting those into the FASTA database. And it's not in the FASTA database we're not looking for the peptides in a way that we'll find them. So, that's our limitation. OK. So, then with that they look for alternate splicing, they look for novel expression, fusion genes, and of course single amino acid variants. So, we'll look at the variants first. So, on the left are the variants that are predicted from the RNA-Seq which you can see is on the order of 15 to 25,000. The blue are the germline variants, the purple are the somatic variants and then on the right are the ones that we found so far in the proteomics data which are on the order of 50 to 100 variants. So, that's actually pretty common for human proteomics that you tend about one percent of the peptides you find tend to be variants, and 99 percent are non variants. Here's another way to look at how these variants break down. So, the TCGA somatic variants, there's 64 peptides aligned with those. We also looked across all the somatic mutations in cancer, that's the COSMIC database and there's 101 of those. And then the other variants are in dbSNP, most of them show up there, some 550 or so. So, here's on the bottom left, this Venn diagram, you can see the-- how they overlap between those three different databases. And then most of the variants that we see are only in one sample. We do have some that show up in 2 to 10 and the variants that show up in more than 10 samples tend to be from the dbSNP. So, they've been identified before. OK. Now, we'll talk about the splice variants that we found. Now, there's a few different ways to look for splice variants. You can have unannotated splice variants where you just skip an exon and go right to the next exon. You can have partially novel expression where you pick up in between exons or in the middle of an exon. Or you can have completely novel expression where both the upstream and the downstream translation are starting outside of an axon-- of a known axon. And of course, we have fusion events where two axons are coming together from different chromosomes. And we already talked about variants. So, here is the alternative splicing that we found in the colon. So, again, there's 10 to 15,000 predicted overall from the TCGA analysis. And so far, we've only-- we haven't found more than one alternative splicing event in any sample except for this one here where we found two. So, you can see we're finding, you know, six unique peptides that are splice peptides for partially alternative splicing. For unannotated, we find 7 different peptides. For completely novel, we found 23 peptides that have that but never more than one in a sample. So, I think there's a few explanations that could explain the discrepancy here. One is alternative splicing might really interrupt the translation to proteins, and so many of the genomic splice events are just not translated into protein. Another is that they could be the splice peptides could be very low abundant peptides and the mass spectrometer isn't sensitive enough to detect them. And quite the third of course is that we're somehow not looking in the right place or looking at the right way. And so, these are preliminary data. So, I think we'll-- this is still kind of an unsolved mystery, but I think we'll get to the bottom of it because as more and more people look into this. OK. And then the fusions, there are very few fusions that identified at the genomic level, and so far the proteomic level. We have looked, but we haven't found any yet. All right. Well, now we're going to go on and talk about the proteomics data and not focus on the proteomic genomic as much, but talked about this a couple of times before and somebody always wants to know about the correlation between the protein expression and the mRNA expression. And so, this is what we've done to answer that. There's a couple of different ways you can correlate the protein with the RNA. You could get a Spearman correlation coefficient, you know, across an entire sample. And so, that's what's represented here on the left. And when you do that, there's the probability density and the mean is about 0.47. You could also make protein RNA pairs across the entire sample set and rank them that way. And when you do that, you know, now you're talking, I guess-- was it, 4,000, 3,900 proteins that were quantified. And when you do that, we get a mean correlation coefficient of closer to 0.23. And some-- when you do that, you do find some negative correlation for some pairs. So, I don't know if we learn that much from the average correlation. I think it's clear that in some cases, protein and RNA correlate well, in other cases, they don't correlate so well. So, what's more interesting is can we identify the classes of proteins in RNA that correlate well, then classes of genes that don't correlate well. So, that's what this slide is about. And what you can see is that a lot of the metabolic pathways have genes that tend to correlate well. The arginine and proline metabolism, butanoate metabolism, but then when you get to pathways that are much more involved in protein production or oxidative phosphorylation, that's when we see a lot more of these pairs that are anti-correlated. So, again, that's preliminary, I think there's a lot more work to be done there but certainly what I think a lot of people would be interested to look into. OK. And then the kind of the culmination of the proteomic work is to look at subtyping and can the proteomics-- when we subtype that do we get any new subtypes that are not in the genomics data? So, from TCGA, they identified three transcriptomic subtypes which are called MSI/CIMP, Invasive, and CIN. So, here's that which was kind of the feature figure in the Nature article from 2012. So, when we do the proteomics, we get about 77 of the samples cluster into these five different clusters here, A, B, C, D, and E. OK. And you can see how those map onto the TCGA transcriptomic subtypes that especially the MSI/CIMP which have a lot of the hypermutated subtypes, they kind of break apart into the B subtype and C subtype of the proteomic level. And then-- so, that's what is shown here. Now, as I mentioned, a lot of the methylation subtypes show up in the subtype B but not so much in subtype C. So, we kind of teased that out, which is interesting. Also, subtype B has a lot of the common mutations, the POLE mutation, the BRAF mutation but it doesn't have interest in the p53 mutation or the 18q loss. Those show up much more in proteomic subtype E. So...and then this is just showing how it maps to some other clustering analyses that have happened in the TCGA data from Sadanandam and De Sousa. So, we don't fully know what this means clinically or biologically even but it's certainly interesting and we're following up to see how this plays out in other sample sets. OK. So, that's the preliminary data from the colorectal dataset. Now, I'm going to finish up here by talking about some of the other resources that we have available through CPTAC, and really how we're distributing that information and, you know, CBIIT is certainly interested in distributing resource information. And so if you have any comments or advice, we'd be happy to hear that. OK. All right. So-- yeah. So, the main-- we're going to talk about is this three portals. We have antibody portal, an assay portal that's probably going to live in the next couple of weeks and then we have a data portal where you can download the raw data of what I showed earlier this morning. So, first, we'll talk about the data portal. This-- you can get to this from proteomics.cancer.gov and then click on the tab that says "Data Portal." And this is where we released our raw mass spec data and the peptide-spectrum matches. So, that's where it is right now. The protein assembly is in the works, it's coming but we want to make sure that it's robust before we release it publicly. Also, the breast cancer data is coming soon. It's just about through quality control. That should be released in the next couple of weeks. And I'll just throw out a teaser that that has phosphroproteomics data which the colon data does not. And our friends at ESAC and Georgetown managed this contract, so they oversee that. So, right now, we've got about 11,419 raw data files in 2.2 terabytes. That's going to increase by about 30 percent once the breast data comes out in the next few weeks. We have generated some global interest and when we released the colon data, people downloaded that like crazy. But I think mainly the proteomic informatics community. And then we have this NCI Antibody Portal which has been around for I think five or six years now and we have about 280 antibodies that are linked-- that map to some 85, 90 antigens. And what we're showing here is characterization data for antibodies that we've developed through contracts at NCI. And then you can acquire the antibodies through a third party which is the Developmental Studies Hybridoma Bank at the University of Iowa. Last year, we had 40,000 unique visitors, which is really good for one of our websites. And the trick was to get these links into something called LinkOut which is part of PubMed. I'm kind of happy about that, I figured that out. So, the-- let's see-- yeah. So, you can query, you can sort the antibodies as you like and we really want to get people to the antibodies as quickly as possible. So, we don't have a lot of fluff on the website. Here's the different characterization methods that we do. Western Blot, ELISA, SPR for kinetics, Immunohistochemistry, we run it against the reverse phase array of the NCI60 cell lines, we do SPR Pairing and Immuno-MS spec. And the key I think is making all these characterization data available via the portal. And then lastly, I don't have any slides but this assay portal is taking that quantitative mass spectrometry assay and doing the same thing that we did for the antibody portal except in this case we have characterization data of mass spec assays. And this is kind of a new resource and we're getting that out to the community. I think 2012, the targeted mass spec was the Nature Methods, Method of the Year. So, there's quite a bit of interest in that method now. All right, so that's the main content that I had, I just wanted to-- I'm always looking for new people to come in to proteomics informatics space and I think here are some of the exciting challenges that we before us today. So, the first, as I mentioned, of doing parsimonious protein assembly, I think we need some, just really clever ways of doing that to improve that. Secondly, I didn't talk about it too much but capturing and displaying quality control data especially in these large datasets, you know, people want the data but it's really important to understand the quality of the data underneath that. And as a community, we haven't yet figured out how to succinctly display that in an understandable way. Next, when you're combining a proteomics dataset which has one false discovery rate with a genomic sequence dataset which might have different areas associated with it, how do you bring those different areas together and work that out? I think that has yet to be worked out. And then just generally, the integrated distribution resource of multi-omics data, you know, TCGA has a data portal, CPTAC has a data portal. To get the integrated set, you have to do a fair amount of work to track down all that data. So, I think CBIIT is certainly interested in facilitating that. And then finally, I think there's probably new ways to search for genomic aberrations in the protein data and there's probably a lot to learn there. So, in conclusion, I hope that I have persuaded you that proteomic technology is able to extensively characterize tumors, that the proteomic analysis provides distinct information about tumors, and at much work remains on the integration of proteomic data with other datasets. So finally, I'll acknowledge the people that did the work, much of this work was done by David Fenyo and Kelly Ruggles, the proteomic genomic analysis, and then Dan Leibler and his team at Vanderbilt developed the dataset. So, then the rest are just the other investigators associated with CPTAC and you'll be seeing much more from them in years to come. So, I thank you for your attention and happy to answer any questions.