Tip:
Highlight text to annotate it
X
Male Speaker: -- talk about uncovering the pseudo-subclonal
structure of tumor samples with copy number variation of next-generation sequencing data.
Thank you.
Yi Qiao: First I'd like to thank the organizer for
granting me such a wonderful opportunity to share with you some of the work I have been
doing after I joined Gabor Marth Lab as their new graduate student. And as you know, probably,
all pretty much aware of, a tumor sample is always going to be mixed with a certain amount
of the surrounding normals and if you sequence the sample it is -- it is going, the sequence
ingredient is going to be a mixture of the DNA sample from both the tumor and normal.
And we really started to, this project, to look at the mixture ratio of normal in any
sequencing data by looking at the copy number information from what you what you can gain
from the BAM file [spelled phonetically], which is pretty straightforward to do.
You can count the read depths in a, for example, in this specific example it is a 10 kb moving
[spelled phonetically] window of the tumor BAM file. Also, you can count the same thing
with the pair normal. And if you do -- if you calculate the ratio between the tumor
and the normal read depths, then the ratio is going to capture the somatic copy number
variance that is happening in the tumor sample. And if you plot a histogram of the ratio,
then, if you presume [spelled phonetically] that the sample which is being called a tumor
is a pure homogeneous sample with some kind of deletion events, heterozygous [spelled
phonetically] deletion events, and you expect that, aside from the large peak around one
where it correspond to copy number neutral event, you would expect a peak around zero
point five, which corresponds to the heterozygous deletion event.
However, what is interesting is that we actually observed the peak at zero point seven, which
means the peak has been pushed towards where the copy number neutral ratio resides. And
since cell cannot really have non-integer level of copy number, the way to interpret
this is to think of the original sample being a mixture of both a tumor clone and the contaminating
normal clone. And based on a simple linear combination, can actually solve for the contamination
ratio in this case. For example, the contamination ratio, the tumor purity is 80 percent, that
will shift the line up, is what is observing in this case. And if you locate the peak and
identify where the copy number two and the copy number one, which shows as the black
and the blue line in this slides, you can actually calculate the contamination ratio.
And use that ratio again and the copy number neutral line location, you can actually predict
where the copy number three and four in this example, that, in this tumor, that will appear
in this histogram. And that fits the data extremely well. If you overlay these lines
with the ratio plot versus the chromosome location, you can see that these lines are
perfectly explaining some of the segments that you're observing here.
So we were really excited about the results and we applied the same technique to all the
other chromosomes from the same sample, this is only chromosome 19 of a specific TCGA sample,
we observed something quite odd. Different chromosomes show up to have different tumor
purity. And they tend to cluster together with very high differences among clusters
but very small differences within a cluster. Now, this can [spelled phonetically] happen,
because you can only make cells at a cellular level, how come different chromosomes have
different contamination ratios? Unless there is more than one subclone that is being found
that exist in the tumors clone. And there exists a hierarchical, that's where the tumor
heterogeneity comes in. We modeled this as a hierarchical subclone structure where the
most prevalent mutation should appear in all the tumor, de-masked [spelled phonetically]
tumor cells, which leaves out, like, for example, 20 percent, which would be the normal contamination.
But the less represented mutation will exist in a subset of the tumor subclone. And at
those locations where the cells don't have that minor mutation will act as if they were
normal, and that is why you're observing different tumor purity at different chromosomes.
And the same way applies for the even minor copy number. And so this is, this is a result,
if we are able to reconstruct the entire structure algorithmically in a way, actually, we come
up with a way a good following, which is to look for the -- to approach this problem from
bottom up to look for the most diverged subclone in this cell first and then make your way
up onto -- so for example, if you have a ratio, you have observed the ratio data which shows,
like, the one at in, with the white lines, then you can initialize a model which consists
of only normal genotype, which is presumably 100 percent of copy number two. And assuming
that this is your tumor sample, you can actually predict where the ratio would be. And granted
that is going to be one all over the place. And you can then calculate the differences
among the model, between the model and the actual data, and you can come up with numbers
that represents the contamination ratio. In this case they are different from each other.
So what you would do is that you break the normal, your normal clone into two subclones.
And the right leaf is going still to represent the normal subclone, but the left leaf is
going to contain the mutations which is snapped at an integer number to explain the differences
that you observed in the actual data, and also with the corresponding tumor purity.
One thing to notice, the most diverged cell will also have the -- will have both the more
prevalent and the less prevalent mutations. So when you initialize the copy number profile
of the smaller tumor sample, you have to count in for all the differences that you seeing
here. But in the -- with this model you can update your predictive value. And as you can
see, one of the segment is going to be explained perfectly. But you can update a number and
pretty much do the same thing again, but next time you break -- you always break the normal
subclone to account for the part that has not yet been explained. And -- well -- until
your data can fit your model well.
We can actually devise a score here which is pretty much the sum of the absolute value
of the differences between the model and the actual data. And this entire thing is iterated
over and over again until the score does not improve anymore. And if you backtrace the
leaf notes, that is exactly what you have seen in my previous slides, which is the structure
that we we'll learn from these algorithms.
However, it is important to keep in mind that given one ratio data, there is always -- there
is chances that there are more than one actual biological structure that corresponds to the
same ratio observation you have. For example, the model on the left and the model on the
right. We don't think that it is possible to distinguish between these two models based
on the copy number ratio data alone. So we -- the algorithm, choose to produce the model
on the right, just because it is a, we believe it is better representation of the actual
biological process. And as many of you who talked to me yesterday during the poster session
have mentioned, this looks strikingly like what we expect when there is a cancer stem
cell, that they sort of aggregate mutations themselves but along the way they produce
cells that constitute the entire tumor.
And that's pretty much the method with these simulations. If you start with a 40 percent
normal and a 60 percent mixture profile with the copy number profile as shown here in the
tumor sample, the method is able to predict 42 percent of normal and 58 percent of tumor,
which is fairly close. And we have incorporated a small error term in there that's count for
the error that you observed in the actual sequencing data, so that's why the number's
a little bit off, but overall it's doing a pretty good job.
If you change the simulation profile to include a third tumor subclone, the result is comparable,
like 20 percent of a subclone and 57 percent of the other subclone, whereas you have 23
percent of the normal. And we applied the method to some actual TCGA data. This is the
result that the algorithm comes from the data that I showed you earlier from the chromosome
pileup figure, and it looks pretty close to the one that we come up by hand. In -- this
is specifically ovarian cancer. This is another example shown with glioblastoma cancer, and
you can see that there are much higher heterogeneity compared to ovarian cancer. And there are
some, a bunch of other more examples on glioblastoma cancer.
So just give you a taste of how the program looks like. So if you zoom in the part that
the actual result is, I don't know if you can see, but the purple line actually represents
the observed ratio value subject to genome segmentation. And so that's pretty much representing
the somatic copy number events neurosample [spelled phonetically]. And the green line,
the dark green line, is what the final model is predicting, and they look pretty close.
There are even segments where you only see green line, that just means the green line
is perfectly overlapping with the purple line.
And, yes, in conclusion, the method is able to simultaneously produce -- simultaneously
estimate both the normal cell contamination ratio and some measurement over the tumor
heterogeneity, I think both of which will be very important for any downstream analysis.
And the algorithm is pretty fast. In the worst-case scenario, if you have as many different copy
numbers states as the number of chromosome locations you investigated, then at the most
you will need those number of iterations to explain the entire data. The actual speed
limiting step is counting the read depths. It takes roughly one days and a half to count
the read depths of a whole genome sequencing data, with 40 times medium coverage, and takes
roughly two minutes to come up with the subclone structure.
And it can be -- the information can be used as a prior for any downstream analysis. For
example, if you know that your certain percentage of normal contamination, you might require
less evidence to call a variant in your tumor sample. The method is actually designed to
be independent of the [unintelligible]. In this case I implemented my own very rudimentary
[unintelligible], but if you have your favorite [unintelligible] you can just plug it in and
it takes [unintelligible] the segmented [spelled phonetically] genome. The model it produce
is a biologically motivated model, but there are definitely other possibilities. It represents
a class of possible combinations based on copy number data alone.
So, which brings us to the future directions that we want to be able to -- I didn't say
here, but might be helpful to incorporate sequence data to try to differentiate between
a case where you have no overlapping mutations and overlapping mutations. And well, another
thing that we want to do is do validation, since we're a bioinfomatics lab we are actually
looking for, looking for collaboration with [unintelligible] groups to do some validation
work. And the current examples that I showed up there was tested on whole genome sequencing
data and the next [unintelligible] to test and probably make modifications to make these
algorithms work with capture data.
And -- yep -- I'd like to thank everybody in the lab. I'm pretty new to this lab and
everybody has shown tremendous amount of support for my work. And I'd also like to thank TCGA
for this opportunity to both getting my hands on some excellent data and the opportunity
to also present my work to the community. Thank you.
[applause]
Male Speaker: Questions? I'll start out with one question.
Very exciting, very nice work. How are we going to cope with representing the uncertainty?
You gave a good example where there were a couple of different interpretations --
Yi Qiao: Yes --
Male Speaker: -- of the data, and I think we're struggling
with that at Santa Cruz as we do similar work. We don't want to present one machine solution
as the absolute truth when we're -- when we're actually quite uncertain as to what the -- how
the data should actually be interpreted. Do you have any thoughts on --
Yi Qiao: Well, I sort of think about that --
Male Speaker: -- how to --
Yi Qiao: -- for example, given the example when I sort
of breaking--
[talking simultaneously]
Male Speaker [unintelligible]
Yi Qiao: Okay. So I was sort of breaking a tree [spelled
phonetically] structure --
Male Speaker: Yeah.
Yi Qiao: -- and most of the time it is going to be
different ways to break the left node and the right node with equally the same, you
know, goodness of explaining the data. And I think there might actually not be the way
to differentiate which one is correct --
Male Speaker: Right.
Yi Qiao: -- based on the copy number data alone.
Male Speaker: Absolutely. We will -- that's the point, we
will have uncertain situations where there's no one solution that we're confident in. There
could be other equally good solutions. So how can we communicate to the biological community
that fact?
Yi Qiao: We should first of all make that aware --
Male Speaker: Yes [laughs]--
Yi Qiao: -- announcing [spelled phonetically], and
then, most of the time I don't think that is going to matter. For example, it might
be equivalent if, for example, a location where there is a homozygous deletion --
Male Speaker: Yes --
Yi Qiao: -- but only 50 percent of the cell has that,
versus a heterozygous deletion where all the cells have that. If you break the genome into
pieces and do what the sequencing technology that we do nowadays, then that might not be
a difference at a DNA level.
Male Speaker: Right, but there's a very important difference
at a biological level --
Yi Qiao: Yes --
Male Speaker: -- and so --
Yi Qiao: -- yes.
Male Speaker: -- we need to know that until we --
Yi Qiao: Yeah, we might --
Male Speaker: Okay.
Yi Qiao: -- need to incorporate other data then.
Male Speaker: All right. So it's an issue. Any other questions?
Male Speaker: So, actually I have sort of related question
to [unintelligible]. So their data means [spelled phonetically] there's alternative scenario,
but suppose there's only one scenario. Do we [unintelligible] also [unintelligible]
confidence interval for the percentage of different tumors?
Yi Qiao: Yes, that is actually the thing that I actually
want to work on. Currently it is based on a linear model, but it might need to be changed
to incorporate a statistical solution. For example, what is a percentage that -- I think
it is more, the confidence is more, is going to play more in terms of determining the actual
integer number of the copy number in a subclone, versus, you know, it's going to play less
than actual ratio. So that's where the confidence is going to play an important role there.
Male Speaker: Thank you.
Male Speaker: Okay, let's thank Yi again.
Yi Qiao: Okay --