Tip:
Highlight text to annotate it
X
Georges Natsoulis: So today I'm going to talk to you about a
more lab-oriented project here where we're trying to do mass validation of variants,
identifying whole genome or exome sequencing and hopefully in a single lane of sequencing
and possibly even indexing the sample so you can do multiple pairs in one lane. So just
to clarify here, we're talking about validation. I mean confirming that the mutation is really
present in the sample as well as, once you know that it is, is it present in the larger
population of clinical samples which are often only available in FFPE form which presents
its own challenges.
So -- oh, okay. So as I said, most genomic projects start with a large number of potential
variants that you want to confirm, and then you proceed through multiple stages where
you sort of narrow down your list; and ultimately, you want to be going into your clinical population,
and in the end, you possibly end up with a clinical diagnostic. So the two methods, one,
we terming -- termed here OS-Seq is applicable -- is this not working? Okay. Anyway, on the
top there, you see OS-Seq applicable in the early stages of the -- of the process, mostly
on flash frozen, high quality DNA; and then the second method, single strand circularization
[spelled phonetically] there is another targeting method, and it has advantage of being applicable
to DNA that is not in double stranded form and may be partially degraded.
So the first method, instead of doing the capture in solution and then manipulating
the captured material, adding adapter and whatnot and then creating the sequencing library
and then going on to the flow cell of an Illumina sequencer, what we do is that we modify the
lawn [spelled phonetically] of the flow cell. And what you see is -- okay. So what you see
is [inaudible] okay, [inaudible] anyway, okay. So first step -- oh, okay. All right. So here
is the flow cell of an Illumina sequencer. This is the lawn, and we float in a population
of capture probe here. The green part is a 40 [spelled phonetically] base that's homologous
[spelled phonetically] to genomic sequences; so they could be hundreds or [unintelligible]
using thousands or even tens of thousands of these green ones, and there's a portion
that's common that pairs to the lawn.
First step, we extend from this position, and then now you've modified the lawn because
you float away the template, and you have a flow cell that contains, sticking up there,
thousands of different capture probes. First step. Second step, we now float genomic DNA
in there that has been ligated to an adapter, the second adapter that's going to bring the
second sequencing primer for the Illumina sequencing; and you do another extension.
So the black portion is the genomic DNA; so you do an extension, float away the template
again, and now you've got your genomic DNA between the appropriate adapters that are
suitable for bridge PCR, and then you can do a read one, read two. You can do period
sequencing.
So when you do that, the read two actually -- it reads -- it reads always the capture
sequence; so that's actually useful for binning [spelled phonetically] the [unintelligible]
pair reads and do, for example, assemblies afterward. But the read ones are, you know,
staggered like this; and they basically start wherever the break point was on your fragmented
DNA in the beginning. And you capture, you know, at a high, very high depth, a region
that's 500 to 1,000 bases downstream from your capture probe. So you also capture, I
think [spelled phonetically], a perfectly single stranded way which is -- can be useful
for certain applications such as structural varying [spelled phonetically] and validation
and all of that. Now, if you want double stranded -- which, of course, you do -- you can place
a capture probe upstream and downstream of the -- of the position that you're targeting,
and then now you get the reads here. The purple distribution is for one of the probes, and
then the blue is for the other and the sum of the two is here.
So this is what it looks like on this example is the KRAS gene; so this is an exome capture.
You see where the exons are that they're down there. You see high depth at the exons, and
you see a little bit of, actually, unspecific capture which sometimes like these chains,
you capture one strand, another strand pairs to it. In our method here, it's totally clean;
so you [unintelligible] specific capture and you vary a very high depth on the exon here.
And even here, you can see immediately the -- [unintelligible] here in this IGV [spelled
phonetically], but on a much bigger exon such as the APC exon 15 there -- so this is exon
capture up there; so you, you know, you can get plenty of data, but sometimes, you know,
the depth drops. Here, there's even regions where it drops almost so zero while here in
our seq method, the depth never drops under 100, and this here, of course, we have to
put plenty of capture probes because this exon is so big.
Uniformity of capture -- so this blue line was the first experiments we were doing. We're
using column synthesized oligos [spelled phonetically] to set the size of capture probes, and that's
the highest quality. The curve is very flat. We started doing microarray versions of the
capture probes. Initially, they were not as even, and now we've sort of improved it. This
is the green. So in the green here -- so in the blue, we have five -- 400 capture probes
that in the green here because we synthesized them on microarray, we have now 20,000 of
them, and they're almost as good as column synthesized and, of course, much cheaper.
So this method which basically works on fresh DNA, so from flash frozen tissue, has quite
a few advantages. So for us in the lab, it's very efficient work flow; so it's a lot easier
than exome capture. Low sample requirements; so we can start with less than microgram of
DNA, and we have high sensitivity and specificity because we basically are very high depth on
the regions that we are targeting. And so one application, obviously, is validation,
but you could use also for discovery if you have a long list of candidate genes and you
want to do mutation discovery in there. So the second method which, as I said, is the
advantage that it can start with DNA that's not double stranded; so single stranded or
FFPE material which would -- which is mostly single stranded. So we mix this with now in
solution with a population of capture probes.
The colored boxes here are 20 base regions that are homologous to the end of the [unintelligible]
that you're trying to target. And so these capture probes mediate a circularization event;
and because this is random DNA, you're going to have a tail on either side, and in the
mix of enzymes that we put in this circularization reaction, we have two enzymes that can degrade
both of these tails, five prime to three prime, three prime to five prime; and then ampligase
in there closes the circle. So we end up in a -- with a population of circles, hundreds
or thousands of circles, that can be re-amplified with a single pair of primer which is this
black part here.
So for a pilot demonstration that this was working, we picked 628 exons from a previous
experiment and -- which totals 123 Kb worth of DNA, and we looked at match samples from
normal FFPE tissue versus fresh -- so no tumors here. So, theoretically, we should get the
exact same result. Why we're doing this is because we want to see, number one, what's
our efficiency of capture from FFPE versus fresh DNA and, number two, are we seeing false
positive calls in the FFPE material because it's known that the FFPE can have DNA damage
which would result in some false positive calls.
So -- okay. So I just said that; so we're going to look at yield as well as specificity
of detection. So here is the same sort of evenness plot that you saw before; so the
method is not quite as even as the previous one but, of course, because it works on FFPE,
that's of high interest. So you see that even on the fresh DNA which is the blue, there's
about five to 10 percent of the regions that are not captured at the depth of 10 which
is our minimum for genotyping, but this is per probe; so, obviously, we can use more
than one probe for a region and then it would -- that would go higher. But you know, the
important thing here is that the blue -- the red curve is quite close to the blue, and
it drops a little earlier. So there's about five percent possibly, up to five to 10 percent
of the regions which are captured from the fresh that are not captured in FFPE; so those
are going to be false negative, obviously.
Now, for the specificity. So if I plot here, the percent variant -- so this is the same
DNA from the fresh [unintelligible] FFPE, right? So if I'm plotting here, the percent
non reference base in the FFPE DNA versus the fresh, you got a bunch of positions here
where you have high variant in both, and those are the true heterozygotes. The vast majority
of the points I write here, thousands on there on top of each other, there's no variant there.
It's reference. There's a few positions here which are colored black as opposed to the
other ones because you have higher percent variant, but it typically is a very high strand
bias, so those don't result in false positive calls. And there's a handful here where on
both strands, we see a variant base and only in the FFPE and not in the normal; so those
would result in a false positive call, but it's not very high; so we find that it's about
one per 10 to 15 Kb, and if you were to be sequencing genes, it'd be an error per five
to 10 genes. That's reasonable.
And the sensitivity is that we see about 85 percent of the heterozygotes that we detect
in normal. We also see them in FFPE. And that's, again, it's per probe, right? So if we're
putting -- when we put multiple probes per position, that number goes up. The classes
of artifacts observed is that there's a portion that's transitions, and they [unintelligible],
you know, probably due to deaminations [spelled phonetically] due to A and C to T, and there's
some transversion. So -- but if you look at this, basically, we have a consensus. All
we see is G or C going to A or T and not at a very high rate, one per 10 Kb. So we applied
this to our -- to a whole genome project that we are going in the lab. It's a gastric genome.
We sequenced the entire genome of the normal -- the tumor and the primary -- the metastasis
from -- so we did whole genome as well as exome. And out of this analysis, there was
386 variants, including SNPs and indels [spelled phonetically] as well as structural variants
and then most of them are coding but some of them are outside and quite far from genes
because the break point of structural variance.
So we devised a pool of capture probes to apply one method as well as the other method;
so from the fresh frozen tissue, we're going to do AC capture, and then we're going to
sequence it under GAII or a HiSeq. And we're going to confirm that from FFPE material which
we have -- we have both fresh NF 15 [spelled phonetically] material for the metastasis
that the conclusion is the same, applying the other method. And actually, we sequencing
that material under MiSeq [spelled phonetically] which are these new sequences that the Illumina
has put out because the fraction of the other ones and also the run time instead of a week
is more like a day; so that's -- it's very useful for this type of application, so the
whole thing sort of makes sense.
So -- and it basically works, and we could validate almost all those positions, and we
got the same results from both. And so here I'm only showing you an IGV plot on a particular
exon where there was a two base deletion in the metastasis that was not present in the
tumor, not present in the normal. And I'm blowing it up here; so you see it here. Of
course, you only see a few reads in this shook [spelled phonetically], but if you look at
the coverage, you have a coverage of close to a thousand in all three. So here we have
very high confidence that we have this deletion in the metastasis that's not present in the
tumor. Because of the high depth, obviously, if -- even if the tumor is not pure and -- which
is the case in our case -- if, like, even if it's like 20 -- 20 or 30 percent of the
cells that are tumoral [spelled phonetically], you can still detect the variant here. You
have high sensitivity.
Now we go to the -- confirm in the FFPE material of the metastasis, and you see it here; so
this is the ovarian metastasis; this is the FFPE versus normal -- that's flash frozen
-- and the tumor and the deletion is here. The reason why this profile it looks so different
from before is that here we have a population of amplicons; and because the MiSeq can do
150 base sequencing paired in, we can sequence the entire amplicon on both strands, coming
from either way. So this is end sequencing. That's why all the break points are on top
of -- stacked up on top of each other. So -- and another thing I would mention is that
this reaction was four plexed [spelled phonetically] in the MiSeq, and we got way more data than
we need; so I think that we could be doing 16 plex easily from a single round.
We made our -- the design of our oligos public on this website, oligogenome.stanford.edu,
and it was published recently in Nucleic Acid Research. And in conclusion, I would say that
we're moving towards trying to do validation of whole genome data in a single lane of sequencing.
And we here are offering two different methods. One would be mostly applicable early in the
process on high quality DNA, and it's the most even and gives the highest yield, but
we have this other method that's really good for follow-up studies in clinical samples.
Just acknowledgements here, Jason Buenrostro and Samuel Myllykangas are the ones that developed
the OS-Seq method; and Hua Xu developed the single strand circularization method; and
our funding -- NCI, NHGRI, and Doris Duke Foundation, Howard Hughes [spelled phonetically].
Thank you very much.
[applause]
Male Speaker: Questions?
Male Speaker: I had one, and it relates to the requirement
for high quality DNA for OSC because that -- is that a function of the length?
Georges Natsoulis: No. It's because we basically have to add
a -- the second adapter; so the constant portion of what we flow into the capture, it contains
one of the adapters and one of the sequencing primers of the standard Illumina. Okay? And
-- but the other one is attached by ligation, you know, a tailing [spelled phonetically]
and ligation to the -- to the -- to the DNA; and so the DNA needs to be double stranded.
Male Speaker: [inaudible]
Georges Natsoulis: So if it's not, then you've got to is [unintelligible]
start repairing, start doing a tailing, introduce biases and all of that.
Male Speaker: Okay.
Georges Natsoulis: I mean, it kind of works, but it doesn't work
as well because you have to turn the DNA into completely double stranded, and that's where
all the biases come up. While the other method, you're capturing -- your strand specific capture,
and you never have to turn into a double stranded.
Male Speaker: One over there. Matthew.
Male Speaker: So great talk. I'm curious about the implications
of your OSC capture method for alignment to the genome and kind of the advantages and
disadvantages of starting with a defined sequence at one end.
Georges Natsoulis: As you say, there's advantages and disadvantages.
So one advantage is you basically can bin by perfect or near perfect matches of the
P2 [spelled phonetically] read and then do assemblies which would be nice, you know,
if you sort of suspect a structural rearrangement [unintelligible] of that. One complication,
for example, is that the mid -- you don't have the mid pair information; and for reduplication,
now, you know, one end is fixed and the other -- only the other one is variable, so you
are more prone to having bottlenecking artifacts which need to be solved by other methods,
and we have some ideas of sort of random tagging and then to basically weed out of PCR duplicates.
But that would be one disadvantage. Yeah.
Male Speaker: And last one over here.
Male Speaker: Are you considering circularization also for
RNA, CDNA [spelled phonetically] analysis?
Georges Natsoulis: Well, there's all sorts of ideas that came
to mind, listening to the previous talks. I mean, there's plenty of applications for
this, right? So I'm presenting it as, you know, we can target out of a complex mixture
hundreds or thousands of regions, but maybe we could apply it to, say, CDNA material and
[laughs] do it on RNA and then look at alternative splice, for example. You're right.
Male Speaker: Great.
Male speaker: Okay. Thank you very much.