Tip:
Highlight text to annotate it
X
>> The Speaker Series is a knowledge sharing forum featuring both internal
and external speakers on topics of interest to the biomedical informatics
and research communities.
We remind you that today's presentation will be available on the wiki
for the Speaker Series as a screencast with voiceover and also posted
on the Speaker Series YouTube playlist.
If you Google NCI Speaker Series, you'll find that wiki page.
Also, information about our future speakers is available via Twitter and on our blog.
Just Google NCI blog to find that
and our Twitter handle is @nci_ncip.
Today, I'm very happy to welcome Dr. Chris Kinsinger
who is Technology Program Manger
for NCI's Clinical Proteomic Tumor Analysis Consortium or CPTAC.
Dr. Kinsinger focuses on the expansion and coordination of open data access
and programmatic goals involving mass spectrometry, informatics, and biospecimens.
He completed his postdoctoral training at NIST
where he researched fragmentation pathways of peptide ions and mass spec
and holds a Ph.D. in chemistry from the University of Minnesota.
The title of his presentation is Proteogenomic Integration of CPTAC Data.
And with that, I'll turn the floor over to Chris.
>> OK, thank you Tony.
And thank you everyone for participating in the speaker series
on this icy morning of February.
My goals today are to raise awareness of data analysis,
highlights and challenges that the interface of genomics and proteomics.
And secondly, I'd like to convince you
that proteomics does offer novel insights into cancer biology.
So, the main program that I'm going to be talking about is the one
in which I manage which is called CPTAC
or the Clinical Proteomic Tumor Analysis Consortium.
So, the outline is I'm going to give you a brief introduction to CPTAC.
We'll talk about some of the methods used in proteogenomic analysis,
share some preliminary results from the tumor data that we have
which is tumors actually came from TCGA,
and then we did the proteomic analysis on top of them.
I'll talk about how to get access to CPTAC data,
some of the other resources that we have,
and then the conclusions and challenges.
So, just a little background on CPTAC, if you go back about 12 years now,
the human genome was published and the National Academy of Sciences have one
of their workshops to look at the future of omics technologies.
And at that time what they determined with the technology was
that the sequencing technology for DNA was pretty mature.
And so, they were ready to move forward into production mode.
The transcriptomics was up and coming and there are a lot
of array methods that were established.
And since then, obviously, we've had RNA-seq developed.
But proteomics was the least developed in terms of technology back in 2002.
So, the NCI took that and a lot of other input and in 2006,
launched what we know as The Cancer Genome Atlas.
But also at the same time through the same report
from the National Cancer Advisory Board started a proteomics program
that was called the Clinical Proteomic Technology Assessment for Cancer.
You'll notice that's the same acronym as the tumor analysis for cancer.
We kept the branding of CPTAC for both phases of this program.
But in 2006, we're really focusing
on technology standardization validation of those methods.
And because of the reproducibility issues that were apparent in the field.
So, what we did for the first five years
of 2006-2011 is really take a close look at the discovery technologies
in proteomics as well as how we can beef up the validation of proteomic assays.
And the main output I would say was to develop kind of a middle ground
which we call verification which is to use these mass spectrometers that do most
of the discovery work and to do target analysis with those instruments
so that you could focus in on a few peptides or few proteins
and get much reliable quantitative information,
to can move into then the validation phase.
In addition to that, we came up with some data release and sharing policy
for proteomics data, which it was a bit of a Wild West back then.
We also developed the antibody characterization program and I'll share more
about that at the end of the talk.
So, then when we got to 2011, of course we sought to reissue the program
but it really took a different direction rather than technology assessment.
We moved into more full-fledged, I guess pilot production tumor analysis.
So, we are able to take-- right now we're looking at three tumors
that TCGA has already analyzed.
We check the residual tissue and are doing the proteomic analysis of that.
And the goal here is to understand the molecular basis of cancer
through proteomic technologies and using the genomic data as well.
So, our consortium consisted of these five teams, the Broad Institute,
Johns Hopkins University, Washington University at Saint Louis,
Vanderbilt University, and Pacific Northwest National Lab.
And each of them have a number of other centers that they partner with as well
as the number of support centers that help in the data analysis as well
as collecting future specimens for the program.
And so, as I said, we're analyzing colorectal tissue,
breast tissue, and ovarian tissue.
We're doing about a hundred tumors per cancer type and the reason
where TCGA did 500, we're doing 100 because one, our through-put is not as fast
as what genomics technologies are right now.
And also the residual tissue was limited in number that was available to us.
So, this is the main workhorse technology for proteomics discovery.
This is shotgun mass spectrometry and all the results I'm talking
about today will be the colorectal data
which is the data set that's currently publically available.
And Dan Liebler of Vanderbilt produced these data in his lab.
So, what we do with a typical proteomics workflow is we start with the tissue
and then extract the protein at the same time digest the proteins
into peptides using the trypsin enzyme.
So, we have peptide residues that are about 7 to 25, 30 amino acids in length.
And then we run that through some two- dimensional chromatography.
It gets pretty fancy here.
So, we're not going to dwell on that because this is mainly an informatics talk.
So, you separate the peptides into different fractions and then run them
through the mass spectometry.
This is-- the Orbitrap is a common instrument that has pretty good resolution
and the way that you identify the peptide spectrum is you actually break
up the peptides into smaller fragment ions
and you get a spectrum of MS-MS fragment ions.
And so then we have software that can go into that
and match a spectrum with a known peptide.
And I'll discuss that a little bit more on the next slide.
So, then what you're left with is you get a list of peptides
and then of course you have to assemble those peptides into proteins,
and then you can get the quantitative information.
There's a few different ways to do quantitative information.
For this data set, we used the method called spectral counting
which is basically counting the number of times you identified a peptide
across the data set and you're making an assumption that that is correlated
with the overall abundance of that peptide in the sample.
OK. So, let's talk about how we actually do the peptide identification.
So, this is all we have.
We've got the tandem mass spectrum here which again is mass-to-charge ratio
on the X axis and ion intensity on the Y axis.
So, that's what we have and it doesn't really tell you anything
about what peptide is in that sample.
So, we have to start with what we know and what we do know is
to do proteomics you have to know the genome
of the organism that you're studying.
So, in this case, we're doing human tumors.
So, we have the human genome.
So, we usually start with something
like the Ensembl database or the RefSeq database.
And from that, we can find the exomes and translate those
into theoretical protein sequences.
And we call that a FASTA database of protein sequences.
And we do an in silico trypsin digest
to get the theoretical peptides we might expect to identify.
We can expand that by various post translational modifications
that we might expect to be in the sample.
However we need to be careful with that
because if you expand it too far then you're broadening your search base too
much and you're going to get a lot of false positive identifications.
So, there's different ways of filtering the peptides that you have.
And eventually, you get a list of fragment ions
that you would expect from a given peptide.
You got a theoretical spectrum.
And so then you're matching the theoretical spectrum with the real spectrum
and there's different scoring algorithms.
It's probably at least a hundred different score in algorithms
that graduate students have developed to match these
and you take the best matches and so forth.
Then there are ways to compute a false discovery rate
by using say a randomly generated database of non-real peptides
and seen how well those match your spectrum and then you get--
that's kind of how you do the false discovery rate.
So, this is what's going on.
Now, I also want to talk about the--
well, what I want to say about this is that there are a lots
of different peptide identification algorithms.
I think, you know, they're not perfect yet but the field has come along.
It's a nice tractical problem that graduate students like to work on.
The harder problem in proteomics is this assembly problem,
where you have a list of peptides and then--
now, you're tying to roll them up into various proteins.
But we know that the protein is not translated the same way each time.
For instance, VEGF consists of eight exons but you can a VEGF protein
without exon 6 and 7, or with only exon 6, or with exon 6 and 7.
So, we have these different isoforms of these proteins.
And what do you do if you don't have all the peptides?
And a lot of times, we don't have all the peptides for a protein.
Then we have situations where it might be Protein A and Protein B
and we've identified Proteins 1,
2, and 3, but none of these peptides are unique to Protein A
or Protein B. So, which one is really there, the Protein A or Protein B?
So, we have to work that out.
You could also have a situation where there's three peptides
that match the Protein A. And two of those three peptides match Protein B,
Is Protein B is actually there?
So obviously the likelihood that Protein A is there is pretty strong
but there might be some of Protein B there.
Then you have subsumable proteins where there are unique peptides that match Protein A
and C but there aren't any peptides that match to--
all right there aren't any unique peptides to match the Protein B. So,
again is Protein B really there?
Or what's the likelihood?
Of course we have these probabilities and compute this.
And I would just say I'm always recruiting new people to study this problem
because I think there's less work on this problem than some
of the other problems in proteomics.
So, then here's a real situation
where you have all these different peptides matching to many different proteins
and which proteins are actually there, which ones aren't there.
So, the algorithm that we're currently using is called ID picker
and it tries to do this in the most parsimonious manner.
So that you're not coming up with false positive proteins
that are actually not in your sample.
So, again, that's just a plug for more work to be done on the protein assembly piece.
All right. Well, lets move on to talking a little bit about the data.
First, I mentioned that we had this reproducibility issue in proteomics in the 2000s.
And so, that hasn't entirely gone away and so we--
at CPTAC we wanted to maintain our status as caring a lot
about reproducibility in high quality data.
So, what we did for the first year of the second phase of the program
in 2011 was qualify labs and platforms to be able
to have access to the TCGA samples.
And so, the way we did that was using some xenograft breast tumors
from the University of-- the Washington University in Saint Louis
where they took human breast tumors and grafted them into mice,
and then you can grow up large lots
of breast tumor material that's quite homogeneous.
And so we used that as our performance mixture.
So, every lab within CPTAC had to run this and the consortium agreed, OK,
yes these labs are ready to receive the samples
and these labs still need a little bit more work
and we have some hard conversations but eventually everybody got there.
So, we're happy about that.
And then we continued to use this material to run interstitially
through the CPTAC or through the TCGA tumors as we ran them for proteomics.
So, every 5th sample for the colorectal set had one of these xenograft samples.
And so, I think it took about nine months to get through a 100 tumors.
And so, you could have instrument variability of course over those nine months.
But we're able to look at, you know, the sample that we ran
the first week of the xenograft and the sample that we ran nine months later
and see much drift there had been in the instrument.
There is also-- yes, so it was a good--
a good normalization method for the platform itself.
This is the result of those CompRef samples across the sample set.
And it's a big complicated slide with a lot of numbers but the main thing
to show-- there were two types of samples.
One was a basal breast tumor and the other was a luminal breast tumor.
And so, we have very high correlation for the luminals of about 0.85
and with the basals-- to basal it was 0.88.
And if you do the correlation across the two samples it was 0.68.
So, the basal is matching really well with the basal
and luminal is matching really well with the luminal.
So, we're pretty convinced that the platform is quite stable
over the analysis time for these samples.
OK. So, let's talk about the tumors themselves.
So, there are colorectal tumors, about 55 percent male, 45 percent female.
Most of them came from the colon region rather than the *** region
about two-thirds colon, one-third ***.
At that time most of the people were still living.
So, this is the-- the hypermutated status.
Most of them-- most of the tumors were non- hypermutated.
But about-- what's that, 20 percent were hypermutated.
And also within-- colon cancer there's this microsatellite stability
versus instability.
And so, here's the breakdown of that where MSS is stable.
Microsatellite instability low is this yellow region here and the high,
MSI high is the red region there.
And so then you can see the breakdown of stages of tumors also.
So, all in all we had 90 colorectal samples.
Five of them were actually--
there were 95 total samples but five of them were duplicates
so we also had those duplicates,
then those looked quite reproducible-- reproducible as well.
So, we run these through the mass spectrometer.
We've got a lot of data, millions and millions of spectra.
And from those when we identified the peptides we came up with 94,000 distinct peptides.
And that's at a 0.1 false discovery rate.
So, I guess that means like one in 10,000 is likely to be misidentification.
So, we're looking at maybe 9 of those 94,000 on average would be false.
We roll up those 94,000 peptides, you get to about--
we find 7,500 different protein groups.
So, that's-- a protein group is a group of proteins that's indistinguishable,
which I talked about earlier.
So, there could be more than that, but that's--
that's how many we can find that are distinguishable.
Now, it's typically done in proteomics,
this 0.01 percent FDR, is quite stringent.
So, since we found these proteins then we went back to the data and looked
and said, "Well, what other peptides map to those proteins?"
So, we're not introducing any new proteins here.
But if we have peptides that didn't quite meet that high threshold
but they did-- they do map to the proteins that we found.
It's pretty likely that they would be there.
So, we allowed the false discovery rate to come down to one percent which is kind
of the gold standard in proteomics today.
And so, when we do that, then we get a 124,000 distinct peptides.
Still the 7,500 protein groups and these map back to 7,200 genes.
OK. Now, we can't quantify all of those proteins
because in some cases we just have a few spectra but for the ones
where we have more spectrum we're doing the spectral counting method then we can
quantify about half of those, 3,900 quantifiable proteins.
So, that's the dataset we're working
with when we do the quantification analysis that's coming up.
OK. So, let's shift gears now and talk about how we do proteogenomics.
There's a lot of different approaches to proteogenomics,
I'm just going to talk about one approach today.
But it really has to do with what would put in this FASTA database.
So, for using the RefSeq database then we're limiting ourselves to kind
of the reference human genome.
But for this sample set TCGA has calculated the DNA sequence,
they've done their RNA sequence.
So, we may as well use the variance that they've found
and the different splice variations and things
and put those into the FASTA database.
So, that's what our-- our investigators the--
New York University David Fenyo and Kelly Ruggles have done.
So, they've taken the non-tumor sample.
The germline sequence and identify the variance,
put those into the FASTA protein sequence.
They do the same mainly with the RNA-Seq.
When you-- if you take all the variance from the DNA, then again you--
you know, you've got to be careful.
You're expanding your search base quite a bit.
You're going to open yourself up to a lot of false positives.
So, a preliminary stage which this is,
just looking at the variance within the RNA-Seq
and putting those into the FASTA database.
And it's not in the FASTA database we're not looking for the peptides
in a way that we'll find them.
So, that's our limitation. OK.
So, then with that they look for alternate splicing,
they look for novel expression,
fusion genes, and of course single amino acid variants.
So, we'll look at the variants first.
So, on the left are the variants that are predicted from the RNA-Seq
which you can see is on the order of 15 to 25,000.
The blue are the germline variants, the purple are the somatic variants
and then on the right are the ones that we found so far in the proteomics data
which are on the order of 50 to 100 variants.
So, that's actually pretty common for human proteomics that you tend
about one percent of the peptides you find tend to be variants,
and 99 percent are non variants.
Here's another way to look at how these variants break down.
So, the TCGA somatic variants, there's 64 peptides aligned with those.
We also looked across all the somatic mutations in cancer,
that's the COSMIC database and there's 101 of those.
And then the other variants are in dbSNP,
most of them show up there, some 550 or so.
So, here's on the bottom left, this Venn diagram, you can see the--
how they overlap between those three different databases.
And then most of the variants that we see are only in one sample.
We do have some that show up in 2 to 10 and the variants that show up in more
than 10 samples tend to be from the dbSNP.
So, they've been identified before.
OK. Now, we'll talk about the splice variants that we found.
Now, there's a few different ways to look for splice variants.
You can have unannotated splice variants where you just skip an exon
and go right to the next exon.
You can have partially novel expression where you pick up in between exons
or in the middle of an exon.
Or you can have completely novel expression where both the upstream
and the downstream translation are starting outside
of an axon-- of a known axon.
And of course, we have fusion events where two axons are coming together
from different chromosomes.
And we already talked about variants.
So, here is the alternative splicing that we found in the colon.
So, again, there's 10 to 15,000 predicted overall from the TCGA analysis.
And so far, we've only-- we haven't found more
than one alternative splicing event in any sample except
for this one here where we found two.
So, you can see we're finding, you know,
six unique peptides that are splice peptides
for partially alternative splicing.
For unannotated, we find 7 different peptides.
For completely novel, we found 23 peptides that have that but never more than one in a sample.
So, I think there's a few explanations that could explain the discrepancy here.
One is alternative splicing might really interrupt the translation to proteins,
and so many of the genomic splice events are just not translated into protein.
Another is that they could be the splice peptides could be very low abundant
peptides and the mass spectrometer isn't sensitive enough to detect them.
And quite the third of course is that we're somehow not looking in the right place or looking at the right way.
And so, these are preliminary data.
So, I think we'll-- this is still kind of an unsolved mystery,
but I think we'll get to the bottom of it because as more and more people look into this.
OK. And then the fusions, there are very few fusions that identified
at the genomic level, and so far the proteomic level.
We have looked, but we haven't found any yet.
All right.
Well, now we're going to go on and talk about the proteomics data and not focus
on the proteomic genomic as much, but talked about this a couple of times before
and somebody always wants to know about the correlation
between the protein expression and the mRNA expression.
And so, this is what we've done to answer that.
There's a couple of different ways you can correlate the protein with the RNA.
You could get a Spearman correlation coefficient,
you know, across an entire sample.
And so, that's what's represented here on the left.
And when you do that, there's the probability density and the mean is about 0.47.
You could also make protein RNA pairs
across the entire sample set and rank them that way.
And when you do that, you know, now you're talking, I guess--
was it, 4,000, 3,900 proteins that were quantified.
And when you do that, we get a mean correlation coefficient of closer to 0.23.
And some-- when you do that,
you do find some negative correlation for some pairs.
So, I don't know if we learn that much from the average correlation.
I think it's clear that in some cases, protein and RNA correlate well,
in other cases, they don't correlate so well.
So, what's more interesting is can we identify the classes of proteins in RNA
that correlate well, then classes of genes that don't correlate well.
So, that's what this slide is about.
And what you can see is that a lot of the metabolic pathways have genes that tend to correlate well.
The arginine and proline metabolism, butanoate metabolism,
but then when you get to pathways that are much more involved
in protein production or oxidative phosphorylation,
that's when we see a lot more of these pairs that are anti-correlated.
So, again, that's preliminary, I think there's a lot more work to be done there
but certainly what I think a lot of people would be interested to look into.
OK. And then the kind of the culmination of the proteomic work is to look
at subtyping and can the proteomics--
when we subtype that do we get any new subtypes
that are not in the genomics data?
So, from TCGA, they identified three transcriptomic subtypes
which are called MSI/CIMP, Invasive, and CIN.
So, here's that which was kind of the feature figure
in the Nature article from 2012.
So, when we do the proteomics, we get about 77 of the samples cluster
into these five different clusters here, A, B, C, D, and E. OK.
And you can see how those map onto the TCGA transcriptomic subtypes
that especially the MSI/CIMP which have a lot of the hypermutated subtypes,
they kind of break apart into the B subtype
and C subtype of the proteomic level.
And then-- so, that's what is shown here.
Now, as I mentioned, a lot of the methylation subtypes show up in the subtype B
but not so much in subtype C. So,
we kind of teased that out, which is interesting.
Also, subtype B has a lot of the common mutations, the POLE mutation,
the BRAF mutation but it doesn't have interest
in the p53 mutation or the 18q loss.
Those show up much more in proteomic subtype E. So...and then this is just showing how it maps to some other clustering analyses
that have happened in the TCGA data from Sadanandam and De Sousa.
So, we don't fully know what this means clinically or biologically even
but it's certainly interesting and we're following
up to see how this plays out in other sample sets.
OK. So, that's the preliminary data from the colorectal dataset.
Now, I'm going to finish up here by talking about some of the other resources
that we have available through CPTAC, and really how we're distributing that information and, you know,
CBIIT is certainly interested in distributing resource information.
And so if you have any comments or advice, we'd be happy to hear that.
OK. All right. So-- yeah.
So, the main-- we're going to talk about is this three portals.
We have antibody portal, an assay portal that's probably going to live
in the next couple of weeks and then we have a data portal
where you can download the raw data of what I showed earlier this morning.
So, first, we'll talk about the data portal.
This-- you can get to this from proteomics.cancer.gov and then click
on the tab that says "Data Portal."
And this is where we released our raw mass spec data and the peptide-spectrum matches.
So, that's where it is right now.
The protein assembly is in the works, it's coming but we want to make sure
that it's robust before we release it publicly.
Also, the breast cancer data is coming soon.
It's just about through quality control.
That should be released in the next couple of weeks.
And I'll just throw out a teaser that that has phosphroproteomics data
which the colon data does not.
And our friends at ESAC and Georgetown managed this contract, so they oversee that.
So, right now, we've got about 11,419 raw data files in 2.2 terabytes.
That's going to increase by about 30 percent once the breast data comes out in the next few weeks.
We have generated some global interest and when we released the colon data,
people downloaded that like crazy.
But I think mainly the proteomic informatics community.
And then we have this NCI Antibody Portal which has been around for I think five
or six years now and we have about 280 antibodies that are linked--
that map to some 85, 90 antigens.
And what we're showing here is characterization data for antibodies
that we've developed through contracts at NCI.
And then you can acquire the antibodies through a third party
which is the Developmental Studies Hybridoma Bank at the University of Iowa.
Last year, we had 40,000 unique visitors,
which is really good for one of our websites.
And the trick was to get these links
into something called LinkOut which is part of PubMed.
I'm kind of happy about that, I figured that out.
So, the-- let's see-- yeah.
So, you can query, you can sort the antibodies as you like and we really want
to get people to the antibodies as quickly as possible.
So, we don't have a lot of fluff on the website.
Here's the different characterization methods that we do.
Western Blot, ELISA, SPR for kinetics, Immunohistochemistry,
we run it against the reverse phase array of the NCI60 cell lines,
we do SPR Pairing and Immuno-MS spec.
And the key I think is making all these characterization data available via the portal.
And then lastly, I don't have any slides but this assay portal is taking
that quantitative mass spectrometry assay and doing the same thing that we did
for the antibody portal except
in this case we have characterization data of mass spec assays.
And this is kind of a new resource and we're getting that out to the community.
I think 2012, the targeted mass spec was the Nature Methods, Method of the Year.
So, there's quite a bit of interest in that method now.
All right, so that's the main content that I had, I just wanted to--
I'm always looking for new people to come in to proteomics informatics space
and I think here are some of the exciting challenges that we before us today.
So, the first, as I mentioned, of doing parsimonious protein assembly,
I think we need some, just really clever ways of doing that to improve that.
Secondly, I didn't talk about it too much but capturing
and displaying quality control data especially in these large datasets, you know,
people want the data but it's really important to understand the quality of the data underneath that.
And as a community, we haven't yet figured out how to succinctly display that in an understandable way.
Next, when you're combining a proteomics dataset
which has one false discovery rate with a genomic sequence dataset
which might have different areas associated with it,
how do you bring those different areas together and work that out?
I think that has yet to be worked out.
And then just generally, the integrated distribution resource
of multi-omics data, you know,
TCGA has a data portal, CPTAC has a data portal.
To get the integrated set, you have to do a fair amount of work to track down all that data.
So, I think CBIIT is certainly interested in facilitating that.
And then finally, I think there's probably new ways to search
for genomic aberrations in the protein data and there's probably a lot to learn there.
So, in conclusion, I hope that I have persuaded you
that proteomic technology is able to extensively characterize tumors,
that the proteomic analysis provides distinct information about tumors,
and at much work remains on the integration
of proteomic data with other datasets.
So finally, I'll acknowledge the people that did the work,
much of this work was done by David Fenyo and Kelly Ruggles,
the proteomic genomic analysis, and then Dan Leibler
and his team at Vanderbilt developed the dataset.
So, then the rest are just the other investigators associated with CPTAC
and you'll be seeing much more from them in years to come.
So, I thank you for your attention and happy to answer any questions.