Tip:
Highlight text to annotate it
X
Male Speaker: And Bahram will talk about the TCGA computational
histopathology pipeline.
Bahram Parvin: Well, thank you very much for the invitation.
This is a joint collaboration between Hong Chong Ju Hau [spelled phonetically], Kim Aulbulg
[spelled phonetically], and Jerry Fontinae [spelled phonetically] and Sandy Borowski
[spelled phonetically], Key [spelled phonetically], who is a pathologist, Joe Gray [spelled phonetically],
Paul Spillman [spelled phonetically], and I'm Bahram Parvin.
So, as you know, the TCGA system actually is outputting a number of tissue sections
as well, the frozen embedded, paraffin embedded tissue sections. These tissue sections are
often quite large in terms of the, maybe they're about 40,000 pixels by 40,000 pixels upward
to 100k by 100k. So, they're downloaded from the NIH repository because they're very large.
They're broken up into blocks of 1k by 1k pixels. And, then, every block then is analyzed
for its nuclear characteristic. So, in this case, for example, a pinhole of this tissue
section could be somewhat heterogeneous or it could be homogeneous. Having computed these
nuclear morphology and the organization, then we then compute some kind of a distribution
function, which is then put into a database system, and then normalized across tissue
sections, and then, having had this information, then we proceed to do morphometric subtyping.
And, of course, if I can do the morphometric subtyping, you can actually compute the survival
functions per each subtype. And, of course, the next step is to link this information
with the molecular data, okay, which is transcript in this case.
So, let me give you a specific example. We start with a glioblastoma. There are roughly
about 146 patients, 380 tissue sections. It takes about a week of computing time to process
these images on a cluster of several hundred nodes, and one of the challenges over here
is that they're technical and biological variation. Every tissue section has been prepared by
a different laboratory, and that they are very large data sets.
So, in order to address these issues, what you've done is that we have developed a new
algorithm for characterizing tissue sections. Actually, there is a collection of algorithms
for that characterized tissue section. Of course, they need to be efficient to address
the issue of the large database, data sets, and then, having done that, we compute a number
of features and metafeatures on the data, and then, having done that, then you actually
go proceed with your feature selection or the mission [unintelligible] reduction, and
then proceed with the more familiar [spelled phonetically] molecular association.
So, in order to characterize the tissue sections on the block by block basis, what they learn
is that because of the technical and biological variation, you need to build a database of
dictionaries. So, there are a number of referenced images, but when you reference images that
have been hand segmented by an undergraduate student. As a result of that, we have a number
of filtered responses for the foreground and background model.
So, what do these reference images look like? They could have a highly pink background.
They can have a light pink background. Sometimes the background or foreground have similar
intensity. Or, the foreground could have a dark intensity at the background or could
almost be blue as well. So, you see, there's a quite a bit of heterogeneity in this. And,
so, what's happening here is that these log responses, the computer for the foreground
on macro model [spelled phonetically] are then represented as a Gaussian mixture model.
So, when a new image comes in, first, it's normalized against all the reference images,
and then, having had this bit of information, a local prior probability model, a global
probability model, is built based on the Gaussian mixture model, the global fitness term over
here, and then the other thing that we learn is that even within a 1k by 1k tissue section,
there's a quite a bit of heterogeneity. So, we add this local model as well, and then
there's, whatever labeling that takes place, it also needs to be spatially smooth. So,
we add the [unintelligible] constraint. The next result is a cost function, which is optimized
using the graph called formalism, and the end result is validated through true geometric
reasoning.
So, this is showing an example of the variation from, within a block. As you see the background
over here is more pink than down here, and this method of [unintelligible] detection
actually provides the necessary information to establish the local statistic for the local
probability map.
So, let's look at some result. Again, you see a collection of dark and less than dark
nuclear over here, nuclei over here, and the segmentation result. Here's a far more complex
example. So, now, let's move on to the representation, now that we can actually delineate a nuclear
feature. Again, there are some information of the computer from the block by block information.
So, these are structural features where the nuclei formation and nuclei are represented
as a graph and a number of features are represented or computed from the graph. Simultaneously,
we make [unintelligible] on a cell-by-cell basis, and you end up with a multidimensional
density function. Having had this information, we then normalize the cross tissue sections.
So, let's look at the morphometric data now. So, it appears that if I go through the GBM
data, maybe you end up with four different subtype. This is strictly based on morphometric
features. And, of course, having had this bit of information, now I can go to the gene
expression data and identify a group of genes that best described each morphometric subtype.
This is done through a sparse coding or the [unintelligible] and multivariate analysis.
And the ages between these spots basically indicates how closely those two genes predict
or infer or classify a subtitle.
So, let's do some sanity [spelled phonetically] check. This is across all the patients in
the GBM data, and, okay. And then, I lost my pointer. Okay. So, as you see, just doing
a sanity check, showing that basically the four different distribution in terms of cellularity
and the nuclear area, and one of these subtypes actually shows that it has a true survival
rate based on more aggressive therapy. Now, the reason for, I need to point out that basically
there are not that many patients in the GBM data set that have received therapy that are
less aggressive, and that's why this information could be a little bit noisy.
So, having, and as I said before, we actually can do the molecular association, and for
each subtype, basically, you obtain the set of genes that you subjected through pathway
or subnetwork enrichment analysis. And, among these things that starts signaling is, as
you see over here, is highly enriched, which basically is one of the pathways that are
necessary for maintenance and activity of the GBM, and it also inhibits the apoptosis.
So, the next question that we can ask is that since we're actually making measurements on
a cell by cell basis, what can we tell about this tissue composition? What we do is essentially
instead of doing the morphometric analysis on the patient by patient information, we
break the tissue section into blocks. So, now we are independent of the patient information,
and we do the subtype based on the blocks. So, this way, we can measure the composition.
So, here's the sanity check. There are basically, yes, four different subtypes across all the
GBM data sets.
So, now, for a given patient, I can have a highly homogeneous population or a highly
heterogeneous population. And, based on that, I can compute a heterogeneity index, and then
I can plot that information into multiple quad and basically identify patients that
belong to one subtype versus other subtypes.
So, from this plot, you can see that similarity and heterogeneity are anti-correlated, and
this sort of makes sense. So, what are the four subtypes? Maybe something which is necrotic?
Something which has low cellularity, middle, and high cellularity, and this is during the
survival analysis. So, for these two cases, we have some ID's and P value, again, bear
in mind that there are not that many patient that receive the less aggressive therapy.
But, kind of keep this in mind for the next slide. The other way we can slice the data
is just plot the data for cellularity and maybe nuclear size, and essentially divide
this population into four pieces. And this one pops out, which basically says that if
it's highly cellular and low nuclear size, they, that population of patients are basically
very small, similar aggressive therapies. So, one possible interpretation would be that,
you know, high similarity, low heterogeneity, highly proliferative cells, and therefore
response better to cell cycle inhibitors.
So, in conclusion, there are many ways to slice and dice the data. We've shown a couple
of different examples of that in terms of cellularity and nuclear size, but there are
many other features are computed and registered with our system. The example on metafeatures
is heterogeneity, and depending upon what features we use, we can have different biological
interpretation. We've shown new ways of actually linking this information to geonomic data,
and all our data is available through our website, which also supports a Google map
type of viewing for tissue section as well as segmentation result overlaid on top of
it. Thank you very much.
[applause]
Male Speaker: Questions? Well, maybe I can get started.
Have you compared or perhaps thought about comparing the heterogeneity index that you
derive with other indices from molecular data? For example, from snip data, you can also
derive a similar type of index of cellular heterogeneity. It would be interesting, I
think, to see whether they're correlated.
Bahram Parvin: No. That's something we haven't done. That's
a very good question, and that's probably an area of collaboration we've been talking
about, but that's definitely a right direction to go. So, that's one of the advantages of
having a tissue section analyzed, because it can actually offer you information on what
heterogeneity, whereas most other geonomic wide data is basically bulk measurements.
Male Speaker: Thank you. Please.
Male Questioner: Here. Hi. Hae you tried to estimate the number
of fields required for kind of reliable estimation of the type and heterogeneity of the sample?
Bahram Parvin: Number of fields?
Male Questioner: Yeah, the number of pictures taken from the
same section.
Bahram Parvin: So, basically every tissue section is about,
as I've said, about 40,000 pixel by 40,000 pixel upward to 100k by 100k, and every, so,
every 1k by 1k block is represented as an independent, okay? And then, there are about
100, 336, you know, such tissue sections, lots of data. So, there is no shortage of
data in this case for estimating heterogeneity.
Male Questioner: Yeah, but like leave out approach or something
like you tried to see, if you used less of them, how it grows to be stable?
Bahram Parvin: Okay. in terms of sampling and all that?
Male Questioner: Yeah.
Bahram Parvin: So, we actually tried it with the cross allylation.
This is done in two ways. One is, you know, leave one out. Also cross-validation. You
end up with four subtypes regardless of how you do it in terms of number of, you know,
tissue signatures, and then we validated that by actually, you know, building a library
of these blocks and looking at it visually to see if the signature is identical, that
they are actually following the same signature.
Male Questioner: Okay. thank you.
Bahram Parvin: Thank you.
Male Speaker: Linda?
Female Questioner: Great data. Have you, maybe you haven't gotten
that far. I don't know how many cases you've been able to, whether the 150 some odd cases
really represent the four subtype on the molecular level and the spectrum or genotypic distribution.
I'm wondering whether you have been able to establish any kind of association of the morphologic
feature that you are detecting or measuring with a genotype or subtype. On our initial
first pass analysis, for example, the small cell nuclei nature is associated with, for
example, p53 mutation, you know, that kind of molecular correlation.
Bahram Parvin: So, one of the --
Female Questioner: You may not have enough cases, so I just wonder
about that.
Bahram Parvin: Yeah. So, one of the subtypes, in one of the
subtypes, you do have [unintelligible] okay. So, we haven't correlated the mutation data.
We've only done the correlation with gene expression data, and p53 does show up in one
of the subtypes.
Male Speaker: Next question.
Male Questioner: When I see your imaging technique, when you
seem to pick all the nuclei regardless of whether it's tumor or stroma, and, so, including
all sorts of cell nucleus, does that affect your results?
Bahram Parvin: So, that's also a very good question. So,
now it gets into this business of what are the cell types you're picking up. so, right
now, we're picking up everything, okay? So, there could be other cell types. In fact,
if you look at some of the cell types are easy to pick up. for example, when you look
at the tissue sections, there are quite a bit of lymphocytes, okay? And if you go to
your gene expression data, you see those kind of [unintelligible] you know, activities as
well being represented in the gene expression data, but do we separate them? No, we don't.
not at this stage.
Male Speaker: Next question.
Male Questioner: Interesting to talk, thank you. One, I mean,
no. I worked with some pathologists that they, even for the same endoscope, you know, they
take [unintelligible] biopsy. They have different readings. So, I'm more wondering is that worth,
you know, use Alex [spelled phonetically] method, you know, you'd be with a random forest
with multiple biopsy for same patients. Would that be better to classify the survival or
not?
Bahram Parvin: It's a good question. So, the question is,
instead of looking at singular variable, maybe we can look at multiple variable. That's one
approach of, yes, that's a good idea. There is also another piece of information that
I didn't present here, and that is there are essentially two sets of informations that
are computed over here. One set of information comes from nuclear measurements, which is
the morphometric property is well less their organization [spelled phonetically]. And then
there is also another set of measurement that comes from patch base analysis, like this
region is apoptotic, and this region is kind of half and half of necrotic and apoptotic.
So, all that information needs to be combined in order to provide a more reliable, you know,
correlation and association.
Male Speaker: Thank you. Thank you. Okay. No more questions.