Massively Parallel Validation of Cancer Mutations And Other Variants... - Georges natsoulis

Georges Natsoulis: So today I'm going to talk to you about a more lab-oriented project here where we're trying to do mass validation of variants, identifying whole genome or exome sequencing and hopefully in a single lane of sequencing and possibly even indexing the sample so you can do multiple pairs in one lane. So just to clarify here, we're talking about validation. I mean confirming that the mutation is really present in the sample as well as, once you know that it is, is it present in the larger population of clinical samples which are often only available in FFPE form which presents its own challenges. So -- oh, okay. So as I said, most genomic projects start with a large number of potential variants that you want to confirm, and then you proceed through multiple stages where you sort of narrow down your list; and ultimately, you want to be going into your clinical population, and in the end, you possibly end up with a clinical diagnostic. So the two methods, one, we terming -- termed here OS-Seq is applicable -- is this not working? Okay. Anyway, on the top there, you see OS-Seq applicable in the early stages of the -- of the process, mostly on flash frozen, high quality DNA; and then the second method, single strand circularization [spelled phonetically] there is another targeting method, and it has advantage of being applicable to DNA that is not in double stranded form and may be partially degraded. So the first method, instead of doing the capture in solution and then manipulating the captured material, adding adapter and whatnot and then creating the sequencing library and then going on to the flow cell of an Illumina sequencer, what we do is that we modify the lawn [spelled phonetically] of the flow cell. And what you see is -- okay. So what you see is [inaudible] okay, [inaudible] anyway, okay. So first step -- oh, okay. All right. So here is the flow cell of an Illumina sequencer. This is the lawn, and we float in a population of capture probe here. The green part is a 40 [spelled phonetically] base that's homologous [spelled phonetically] to genomic sequences; so they could be hundreds or [unintelligible] using thousands or even tens of thousands of these green ones, and there's a portion that's common that pairs to the lawn. First step, we extend from this position, and then now you've modified the lawn because you float away the template, and you have a flow cell that contains, sticking up there, thousands of different capture probes. First step. Second step, we now float genomic DNA in there that has been ligated to an adapter, the second adapter that's going to bring the second sequencing primer for the Illumina sequencing; and you do another extension. So the black portion is the genomic DNA; so you do an extension, float away the template again, and now you've got your genomic DNA between the appropriate adapters that are suitable for bridge PCR, and then you can do a read one, read two. You can do period sequencing. So when you do that, the read two actually -- it reads -- it reads always the capture sequence; so that's actually useful for binning [spelled phonetically] the [unintelligible] pair reads and do, for example, assemblies afterward. But the read ones are, you know, staggered like this; and they basically start wherever the break point was on your fragmented DNA in the beginning. And you capture, you know, at a high, very high depth, a region that's 500 to 1,000 bases downstream from your capture probe. So you also capture, I think [spelled phonetically], a perfectly single stranded way which is -- can be useful for certain applications such as structural varying [spelled phonetically] and validation and all of that. Now, if you want double stranded -- which, of course, you do -- you can place a capture probe upstream and downstream of the -- of the position that you're targeting, and then now you get the reads here. The purple distribution is for one of the probes, and then the blue is for the other and the sum of the two is here. So this is what it looks like on this example is the KRAS gene; so this is an exome capture. You see where the exons are that they're down there. You see high depth at the exons, and you see a little bit of, actually, unspecific capture which sometimes like these chains, you capture one strand, another strand pairs to it. In our method here, it's totally clean; so you [unintelligible] specific capture and you vary a very high depth on the exon here. And even here, you can see immediately the -- [unintelligible] here in this IGV [spelled phonetically], but on a much bigger exon such as the APC exon 15 there -- so this is exon capture up there; so you, you know, you can get plenty of data, but sometimes, you know, the depth drops. Here, there's even regions where it drops almost so zero while here in our seq method, the depth never drops under 100, and this here, of course, we have to put plenty of capture probes because this exon is so big. Uniformity of capture -- so this blue line was the first experiments we were doing. We're using column synthesized oligos [spelled phonetically] to set the size of capture probes, and that's the highest quality. The curve is very flat. We started doing microarray versions of the capture probes. Initially, they were not as even, and now we've sort of improved it. This is the green. So in the green here -- so in the blue, we have five -- 400 capture probes that in the green here because we synthesized them on microarray, we have now 20,000 of them, and they're almost as good as column synthesized and, of course, much cheaper. So this method which basically works on fresh DNA, so from flash frozen tissue, has quite a few advantages. So for us in the lab, it's very efficient work flow; so it's a lot easier than exome capture. Low sample requirements; so we can start with less than microgram of DNA, and we have high sensitivity and specificity because we basically are very high depth on the regions that we are targeting. And so one application, obviously, is validation, but you could use also for discovery if you have a long list of candidate genes and you want to do mutation discovery in there. So the second method which, as I said, is the advantage that it can start with DNA that's not double stranded; so single stranded or FFPE material which would -- which is mostly single stranded. So we mix this with now in solution with a population of capture probes. The colored boxes here are 20 base regions that are homologous to the end of the [unintelligible] that you're trying to target. And so these capture probes mediate a circularization event; and because this is random DNA, you're going to have a tail on either side, and in the mix of enzymes that we put in this circularization reaction, we have two enzymes that can degrade both of these tails, five prime to three prime, three prime to five prime; and then ampligase in there closes the circle. So we end up in a -- with a population of circles, hundreds or thousands of circles, that can be re-amplified with a single pair of primer which is this black part here. So for a pilot demonstration that this was working, we picked 628 exons from a previous experiment and -- which totals 123 Kb worth of DNA, and we looked at match samples from normal FFPE tissue versus fresh -- so no tumors here. So, theoretically, we should get the exact same result. Why we're doing this is because we want to see, number one, what's our efficiency of capture from FFPE versus fresh DNA and, number two, are we seeing false positive calls in the FFPE material because it's known that the FFPE can have DNA damage which would result in some false positive calls. So -- okay. So I just said that; so we're going to look at yield as well as specificity of detection. So here is the same sort of evenness plot that you saw before; so the method is not quite as even as the previous one but, of course, because it works on FFPE, that's of high interest. So you see that even on the fresh DNA which is the blue, there's about five to 10 percent of the regions that are not captured at the depth of 10 which is our minimum for genotyping, but this is per probe; so, obviously, we can use more than one probe for a region and then it would -- that would go higher. But you know, the important thing here is that the blue -- the red curve is quite close to the blue, and it drops a little earlier. So there's about five percent possibly, up to five to 10 percent of the regions which are captured from the fresh that are not captured in FFPE; so those are going to be false negative, obviously. Now, for the specificity. So if I plot here, the percent variant -- so this is the same DNA from the fresh [unintelligible] FFPE, right? So if I'm plotting here, the percent non reference base in the FFPE DNA versus the fresh, you got a bunch of positions here where you have high variant in both, and those are the true heterozygotes. The vast majority of the points I write here, thousands on there on top of each other, there's no variant there. It's reference. There's a few positions here which are colored black as opposed to the other ones because you have higher percent variant, but it typically is a very high strand bias, so those don't result in false positive calls. And there's a handful here where on both strands, we see a variant base and only in the FFPE and not in the normal; so those would result in a false positive call, but it's not very high; so we find that it's about one per 10 to 15 Kb, and if you were to be sequencing genes, it'd be an error per five to 10 genes. That's reasonable. And the sensitivity is that we see about 85 percent of the heterozygotes that we detect in normal. We also see them in FFPE. And that's, again, it's per probe, right? So if we're putting -- when we put multiple probes per position, that number goes up. The classes of artifacts observed is that there's a portion that's transitions, and they [unintelligible], you know, probably due to deaminations [spelled phonetically] due to A and C to T, and there's some transversion. So -- but if you look at this, basically, we have a consensus. All we see is G or C going to A or T and not at a very high rate, one per 10 Kb. So we applied this to our -- to a whole genome project that we are going in the lab. It's a gastric genome. We sequenced the entire genome of the normal -- the tumor and the primary -- the metastasis from -- so we did whole genome as well as exome. And out of this analysis, there was 386 variants, including SNPs and indels [spelled phonetically] as well as structural variants and then most of them are coding but some of them are outside and quite far from genes because the break point of structural variance. So we devised a pool of capture probes to apply one method as well as the other method; so from the fresh frozen tissue, we're going to do AC capture, and then we're going to sequence it under GAII or a HiSeq. And we're going to confirm that from FFPE material which we have -- we have both fresh NF 15 [spelled phonetically] material for the metastasis that the conclusion is the same, applying the other method. And actually, we sequencing that material under MiSeq [spelled phonetically] which are these new sequences that the Illumina has put out because the fraction of the other ones and also the run time instead of a week is more like a day; so that's -- it's very useful for this type of application, so the whole thing sort of makes sense. So -- and it basically works, and we could validate almost all those positions, and we got the same results from both. And so here I'm only showing you an IGV plot on a particular exon where there was a two base deletion in the metastasis that was not present in the tumor, not present in the normal. And I'm blowing it up here; so you see it here. Of course, you only see a few reads in this shook [spelled phonetically], but if you look at the coverage, you have a coverage of close to a thousand in all three. So here we have very high confidence that we have this deletion in the metastasis that's not present in the tumor. Because of the high depth, obviously, if -- even if the tumor is not pure and -- which is the case in our case -- if, like, even if it's like 20 -- 20 or 30 percent of the cells that are tumoral [spelled phonetically], you can still detect the variant here. You have high sensitivity. Now we go to the -- confirm in the FFPE material of the metastasis, and you see it here; so this is the ovarian metastasis; this is the FFPE versus normal -- that's flash frozen -- and the tumor and the deletion is here. The reason why this profile it looks so different from before is that here we have a population of amplicons; and because the MiSeq can do 150 base sequencing paired in, we can sequence the entire amplicon on both strands, coming from either way. So this is end sequencing. That's why all the break points are on top of -- stacked up on top of each other. So -- and another thing I would mention is that this reaction was four plexed [spelled phonetically] in the MiSeq, and we got way more data than we need; so I think that we could be doing 16 plex easily from a single round. We made our -- the design of our oligos public on this website, oligogenome.stanford.edu, and it was published recently in Nucleic Acid Research. And in conclusion, I would say that we're moving towards trying to do validation of whole genome data in a single lane of sequencing. And we here are offering two different methods. One would be mostly applicable early in the process on high quality DNA, and it's the most even and gives the highest yield, but we have this other method that's really good for follow-up studies in clinical samples. Just acknowledgements here, Jason Buenrostro and Samuel Myllykangas are the ones that developed the OS-Seq method; and Hua Xu developed the single strand circularization method; and our funding -- NCI, NHGRI, and Doris Duke Foundation, Howard Hughes [spelled phonetically]. Thank you very much. [applause] Male Speaker: Questions? Male Speaker: I had one, and it relates to the requirement for high quality DNA for OSC because that -- is that a function of the length? Georges Natsoulis: No. It's because we basically have to add a -- the second adapter; so the constant portion of what we flow into the capture, it contains one of the adapters and one of the sequencing primers of the standard Illumina. Okay? And -- but the other one is attached by ligation, you know, a tailing [spelled phonetically] and ligation to the -- to the -- to the DNA; and so the DNA needs to be double stranded. Male Speaker: [inaudible] Georges Natsoulis: So if it's not, then you've got to is [unintelligible] start repairing, start doing a tailing, introduce biases and all of that. Male Speaker: Okay. Georges Natsoulis: I mean, it kind of works, but it doesn't work as well because you have to turn the DNA into completely double stranded, and that's where all the biases come up. While the other method, you're capturing -- your strand specific capture, and you never have to turn into a double stranded. Male Speaker: One over there. Matthew. Male Speaker: So great talk. I'm curious about the implications of your OSC capture method for alignment to the genome and kind of the advantages and disadvantages of starting with a defined sequence at one end. Georges Natsoulis: As you say, there's advantages and disadvantages. So one advantage is you basically can bin by perfect or near perfect matches of the P2 [spelled phonetically] read and then do assemblies which would be nice, you know, if you sort of suspect a structural rearrangement [unintelligible] of that. One complication, for example, is that the mid -- you don't have the mid pair information; and for reduplication, now, you know, one end is fixed and the other -- only the other one is variable, so you are more prone to having bottlenecking artifacts which need to be solved by other methods, and we have some ideas of sort of random tagging and then to basically weed out of PCR duplicates. But that would be one disadvantage. Yeah. Male Speaker: And last one over here. Male Speaker: Are you considering circularization also for RNA, CDNA [spelled phonetically] analysis? Georges Natsoulis: Well, there's all sorts of ideas that came to mind, listening to the previous talks. I mean, there's plenty of applications for this, right? So I'm presenting it as, you know, we can target out of a complex mixture hundreds or thousands of regions, but maybe we could apply it to, say, CDNA material and [laughs] do it on RNA and then look at alternative splice, for example. You're right. Male Speaker: Great. Male speaker: Okay. Thank you very much.