Filtering Strategies for Identifying Germline Variants that Cause Disease - Leslie Biesecker

>> Okay. Welcome back. I hope everybody is properly caffeinated and snacked. We will take it to the next step. So I am going to present sort of the phase 3, you heard from Jim Mullikin, some of the nuts and bolts how the DNA goes into the instruments, how data, raw data are generated and sort of early analyzed. You heard from Jamie about intermediate analysis. And some user tools that people can use to begin to manipulate the data. So what I want to present next are some general considerations. I come at this from the perspective of a practicing medical clinical human geneticist. And I think about what kind of projects ought to go into this pipeline and how I'm going to look at the data. So I'll give you some examples of how those things work. What every is asking themselves, what can I do with these tools that are so widely available and becoming so inexpensive, which of my patient should I bring to be sequenced using this technology and which of the samples that I have previously collected might be appropriate for that. And there's lots to choose from and a lot of accordances that you need to take into account to decide which projects you want to use and how you will analyze them to answer the questions that you want to approach. there's good news and bad here here as always. Exomes are not magic. They can work wonders and do thing absolutely not possible before. You can do things with small families that you could not do with traditional clones in the past. That's certainly an incredibly powerful option with exomes. You can if you will ununstick stuck cloning projectsch basically in the old days if you land in a huge regeneral and you western in the mood to sequence 3 or 400 genes using a capillary strum you can approach that with exomes. De novo dominant disorders were completely non-tractable using positional cloning because there was no mapping to be done. And other circumstances I'll talk about. I this summer attended the Gordon conference of genetics and genomics and there was discussion about how often does exome work for Mendelian disease projects. The consensus from a number of the speakers an presenters that the meeting which is a bunch of people doing good work in this area is it works about 30 to 40% of the time. So it's not magic. There are reasons why your project may not work. So hopefully this course and my talk will help you figure out which ones are more likely to work so we can increase that. We're subject to seeing publication bias. Not too many people want to publish unsuccessful exome gene identification projects. So we don't get to see those, don't get to see what can go wrong but I'll try to give you some background to think about that. Lots of reasons for that to happen. Another thing to think about too, it is not necessarily enough days to do a gene identification by exome sequencing to get yourself a good publication out of your efforts. So coupling your exome disease gene identification to additional genetic and functional data should help that a lot. I'll give you some examples of that. So whey I'm going to talk about in this talk is from a perspective of a practicing physician who wants to understand what's going on in a patient, what is exome and what is not because you have to think about that and exome is not the whole genome and what are those differences. The differences of exome sequencing when compared to positional cloning they each have strengths and weaknesses. And then some sort of how toes, I'm going to give examples of five projects, some from our lab, some from other people's labs, to give a flavor of different types of projects, recessive dominant sporadic de novo and mosaic case. Okay. So what is a whole exome sequence? This is somewhat of an unformat label because it is not all that whole. What it includes are the, it's supposed to include the definition of an exome is the sequence of all exons in the genome. There's problems here. First is that we don't currently know what all the genes are in the genome. The second is we do not necessarily know of all the axons of the genes we recognize to be there. Non-coding exons are not consistently targeted. Some kits and approaches target those, some do not. Not all the axons that you do target are effectively captured. If you do target it, not all the sequences that you generate can be aligned and accurate bases can be called from them. And that leaves you with a fraction of the exome that you're actually interrogating. Things missing from the exome sequence, some genes are just not there at all. You'll see an example how that tripped up some investigators in the study example I'm going to give later. Some parts of some genes may not be there. Almost all the non-genic control elements that control gene regulation are going to be missing from an exome sequence. Non-canoncal or deep intronnic splicing he willn'ts that are important areas for mutation in genes maybe missing. Structural DNA assessment, small inversions or movements of pieces of DNA within your genome are very poorly assessed by exome sequence so if your disease is caused by a mutation of that nature, you're in trouble. Copy number variation is not tractable using exome sequence. Mitochondrial DNA is not most mostly targeted but some information on that can be derived from an exome. Some microRNA and other small RNA molecules are included in exome sequence, some are not F you think your disease or trait can be caused by one of those DNA elements you have to think hard whether exome is your right approach or whether you shall go to whole genome as Jim discussed in the first talk. To set up the pros ab cons an strengths and weaknesses of exon sequencing versus positional cloning. In the second slide with exome sequencing you can get away, of using very small families, and sometimes single individuals to identify potentially causative mutations. Those of you know who practice positional cloning it is essential, it is nearly essential to have large families if you want to do myiotic mapping which is the first step of a positional cloning project. For exome sequencening general what you want is locust homogeneity, it's going to dramatically increase the sampleious need to sequence to find variants that you want to find because in most cases you're going to be looking across sample for variation in gene shared among the different affected persons or samples. The more loci you have that cause your trait, the harder it will be to do that. Locust heterogeneity is not -- when you're doing your myiotic maping you are localizing genes an can tell right away where your low -- if your family that you're analyzing is at a given locust or not using linkage inclusion or exclusion. So this is an issue that is tricky for exome sequence, it can cause difficulties in positional cloning but easily sorted out. Allelic heterogeneity in an exome project is a great thing. Because the overall sensitivity has been discussed shown by Jim Mullikin, if you have a 90% sensitivity for decking any given part of the exonly, were it to be that your trait is caused by a single mutation, in a single gene, there is a 10% chance that no matter how you flog that exome the variant will not be there. If you have locust allelic heterogeneity that spreads your risk across multiple sequencing targets and increases the chance that you are going to see it. And for positional cloning allelic heterogeneity not a big issue. The burden of one in the hand one in the bush is tricky to parse doing exome sequencingch you're going to be faced by a list of variants that are the potential cause of the traits that you're studying. Your task at that juncture is to make a decision whether to pursue a subset of variants that -- those variants identified verses pursuing any more depth, other variants that that may not have been initially detected or maybe problematic to detect in your samples. That's a trade-off and a critical decision juncture you have to consider moving through the process of gene ide ifcation in the pipeline. Positional cloning is not an issue. These are gene candidates most likely all over the genome. Positional cloning you have a region of the gene, region of the genome nailed down and you know you need to interrogate just about everything that's in that region to be thorough which you can't do in an exome sequence. Phenocopies are not a big you shall shoe in exome sequencing as you'll see in examples that you'll show, patient whose have a appear to have the phenotype but don't have it for the underlying genetic reason that you think is the cause. Phenocopy can be a huge issue in mapping because it can allow you to exclude or include mappings in your region. There are pros an cons and you knee to think ab your project, phenotype or trait and see how it it fits in here and how to approach it. A lot of talk also at the -- Gordon conference about type 1 error. The dispair Rajing term thrown around at the meeting is exome sequencing or whole genome sequencing as the potential of generating what are called just so stories. This is based on the Kipling book where fanciful stories that are circular reasoning that mean nothing can be beautifully drawn up and become pleatly wrong. That's really a manifestation of type 1 error. When you're interrogating 20,000 candidate genes for a trait the potential that any one variant that you find of being the cause, the prior probability of that variant being the cause is small. And can be led down a guarded path if you don't do studies to but tress and validate the findings and carry them forward. Without myiotic mapping which decreases your type 1 statistical error problem without myiotic mapping you're going to need additional sources of evidence for causation. In fact I'll show how you can fold mapping in linkage data later to support and reduce the probably of that error. First example of the five traits I'm going to talk about today. First because it's easiest to do because it's the smallest amount of the genome to interrogate, an X linked disorder, doesn't so much matter what the trait is but the general nature is important because the character again of your trait will determine what filters you want to apply and how you think about the results of that filtering that is the result of that sequencing project. This is an X linked trait that causes severe congenital anomalies, 1 % lethality which helped a lot in the filtering. Other than fact it was ultra rare and only 2 known families to be affected with this trait allows you to do good powerful things with your filtering with controls as Jamie described. In this particular case you have to go to sample availability quality an quantity of the DNA that you have and turned out we didn't have use DNA on the boys for exome so we sequenced the characters. What do these numbers look like? For X here are raw starting number of what the initial data onslaught will look lake thisyou need to deal with. So the exome capture in quote because we have a hard time from reviewers who thug thought we shouldn't call it an exome at all because it's the exon of sang chromosome. All the exons on X that are not in the pseudoautosomal regions of the X chromosome and genes are defined as the UCSC annotated coding axons so right there again you see a decrement of interrogation from everything on X, starting to notch our way downwards. The sequences or two samples were about 20 million reads off the aLumina instrument. Two third or three quarters of a billion base pairs of sequence. Of that sequence, just under half aligned back to the intended sequencing target of the experiment. Low but good because the coding region is a small fraction of the DNA that went into that experiment so incredibly strong enrichment so half aligned to those targeted exons with high coverage and as Jim mentioned with that wide dynamic range of coverage you see in exome sequencing you need to go deep to ensure you have lot of bases covered by 10X more sequencing because of the tail on the left side. We had about two million base pairs of 10X coverage sequence which is in that case 77%. I think the hit rate would be higher now with new versions of this kit. This is old technology, 18 months old. Thing change so fast here. Just as a note, I mentioned yes sequencing carrier females so you can use an autosomal base caller for calling the variant but if you do a X link project for males you need a difference caller because the algorithm is important for carrying variants in heterozygous males so the filler look like, it was base odden a number of criteria, the biology an geneticses of the trait. since sequencing carrier females we said it should be heterozygous. We knew that the trait was severe so we could set a relatively high filtering threshold for the kinds of predictedded changes in the genome. We didn't mild missense variation so we looked for non-synonymous intels which frame shift the open reading frame nonsense an those types of variants an splice should be on here also that's a typo. Again because this trait was so rare we could implement relatively stringent filters for presence of the punitive variant in controls. And we started out here with two samples, with 350 substitutions and this is already excludes a number of changes and thousands of variants have already been excluded by implementing some of these filters. And the heterozygous non-synonymous absent in DB SNP as yeah any mentioned is a dangerous thing to do, you have to be careful about that. More to say about that in a few slides. We sequence concurrently patients who had other disorders. We excluded variants in those samples as well. What you can see here is that this worked exceedingly well. So this was an easy project. Family one had no substitution variants that net all the filters and it had one variant that was a frame shifting intel that did lead to filters. Family 2 had a single variant that had no variants that were frame shifting in the exons on the X chromosome. Turned out this hit and this hit were in the same gene. So that's a nice tidy result. This was an example of what I was talking about earlier with a stuck project. This was a project that had fat in the laboratory -- sat in the laboratory for five or six years because it was mapped to a 25 to 30 megabase region of the X chromosome, that included 2 or 300 candidate genes. We just didn't have the wherewithal to sequence those using PCR and 3130 sequencing instrument. So with that, until the X exome came and we applied that tool and this worked. Now, this was a fortunate occurrence because this simple stringent filter that we implemented took us right to varying in a single gene. You could say what if that wasn't the case and we western quite so lucky what could you do? This is an example of what I was referring to earlier, you can use myiotic or linkage or haplotype data to exclude regions of the genome to reduce the number of variants you have to consider. If you use mapping data from a family like this, that took us to a region of the X chromosome that eliminated between 60 and 80% of the variants from consideration because they were excluded by linkage alone. Important thing here, it does not require those high thresholds that we used to apply to ourselves for ABANICIO linkage mapping, a log score of 3.0 for autosomal locust or 2.0 for X locust. Whatever linkage you can extract from the family you have will help eliminate variants. Any amount of cranking that down if first filters done give you a hint can be very, very helpful for you in pushing variant out of consideration. Again, but you have to be careful and the old cry Teer we used to use in positional cloning as to whether you should allow a single recombination to exclude a region or two, still apply because you can erroneously push things in and out of consideration. Again, if your disorder has fee know copies or ambiguous affection assignment of the patients you're studying. This particular case because this was a gene of known function with no available the only thing we could look for for supportive evidence was expression so we took this to some colleagues for mouse expression analysis and cloned out the mouse gene hybridize that to showed -- embryos and showed that wasn't expressed in the tissues one predict based on phenotype and humans which was nice, not the strongest supporting evidence but if you're on the X chromosome, the probability of a type 1 error is much less likely so not quite as robust of an evidence that is needed. Next to autosomal recessive. This is a case we worked on with a colleague, chuck VENDETI in our institute. This is a severe childhood onset metabolic acidOSIS disorder. Chuck's group did work to he can Chrud other loci that may have been other known loci that cause a phenotype. This was a cohort of patients all known causes had been excluded and this was an exome project where we sequenced only three samples in affected child and his two parents. So this single trio, led us to variants. This is autosomal traits now looking across the genome and this gives you the kind of numbers that Jim approximate Jamie both started with, 100,000 to 150,000 range for variants. And then we started implementing filters to crank the number down because obviously this is completely intractable so we could exclude a number of variants based on the most probably genotype based calling that Jim described. Then we use the zygosity status of the samples to exclude variants, the parents should have at least two. And the type of mutation that we want, this again, is something that you will need to play with with your projects. And generally what we recommend is starting on a relatively stringent level to see what you can fine and if you don't fine what you're looking for you start to relax criteria one at a time backing down your stringent to let more variants come through to see if that shows what you're looking for. We did a risky thing at the outset excluded variants that were indeed SNP an excluded as well variants homozygous an controls and that's in our set of adults which are presumably normal for this trait and you can see how much trouble in a few moneys that can get you into. Or we exclued allele frequency greater than 10% reasoning that the carrier rate couldn't be near that high if it's a rare recessive disorder. That led us with a dozen genes that the affectedded that two variant this is that gene and those were present one in each parent. Here is the list of genes. Then Jennifer Johnson in the group looked manually at this list and curated those an discovered that ACSF-3 was a gene predicted to have a mitochondrial leader sequence and was a good candidate for that reason. I want to mention here, we also had to but tress the genetic data, in this case but had seven unrelated affecteds, four general tick confirmation of variance in the gene. It's always a question when you start the outset of this project how many samples go into the exome pipeline versus how many issues hold back for manual genotyping PCR 3130 or capillary sequencing post hock. That's a decision that you can make based on cost accordances because Eomes are expensive but don't forget technician an post doc time is also a very expensive commodity as well so balancing that and the right number of exomes versus follow-up work is an important accordance. Whether and how to use DB SNP. As Jamie said, DB SNP a helpful potentially dangerous tool to use, it is a repository of genetic variation that is completely independent of any relationship with a variant to any particular disease. Individual variants in DB SNP can be pathologic and some variants are going into DB SNP from disease studies an even from clinical pathology labs. The cohort are in those studies maybe very, very difference from what you think they might be. We had a situation where we were annotating genomes for cardiac conduction defects and found over 2% of a sample set had the same variant in their samples. When we dug deep into the data it became clear that was a sample side of patients recruited for a pharmaceutical trial for cardiac conduction defects. Be careful what ewe use and can be tedious to dig down to to that level so we rely on our own internal control sets because owe know them better. So your cognitive DB SNP and filtering is it ra active so you can try some -- iterative so you can trial DB SNP early on but don't hesitate the back off on that filtering early if you done find what you want and use cut offs that are conservative. With generally shoot 5 to 10 times the estimated frequency of the disorder as a flesh hold saying if the frequency of any single variant is above that it's unlikely to be causative. Remember cystic fibrosis, a good example, cystic fibrosis has a allelic -- peculiar allelic distribution, 70% of alleles are a single allele. So you have to that population carefully so you wouldn't exclude a varyian based on commonality because it might have caused the variant. Okay. So we used our twin seek control set to me the carrier frequency identified in the study and you can see here these are the raw data, this is exported from my's program into a excel spreadsheet. You can see homozygous reference, heterozygous an hetero zy non-reference. You can notice in the control set we had a patient pop up who was homozygous for a variant. this was a variant found and thought to be pathologic in the patients with the methodologic acid disorder. We looked at this patient carefully, turns outs she was symptomatic and when we evaluated her biochemically she has this disease. So you have to be very careful, your controls, your control set may have the disease, this is one of the powers of exome and genome studies. You can find phenotypes that you didn't know you ought to be looking for in who is affected and who is not could be challenged by these data. So chuck's group went to do functional studies, enzymeATIC studies, localization studies that was the cause of this disorder. Next dominant. I gave an example of two papers that appeared recently on a rare disorder with skeletal dysplasia and other anomalies, thought to be a great Joe project because it has associated with it severe osteoporosis and it was hopeful that they were hopeful it would be generalizable to more common forms of osteoporosis. This is ultra rare. And autosomal dominant and institute relatively general filters, and they had several cases that were de novos. So this is another group that you essentially the same processes as concluding adjuvant capture and aLumina sequencing. 3 to 4 giga bases per sample was generated. They use one simplex case, only one affected in the family and one multi-plex family probe band so three samples went in the sequencing instrument. The filtering criteria relatively similar, non-synonymous nonsense slice an insertion deletions and they again instituted a relatively stringent DB SNP filter. As well as thousand genomes an controls. Their disorder was so rare it should appear and it was dominant, should appear zero times in any of these control samples. They came out with all three cases, the only gene that was mutated all three cases was not too. They went to but that with genetic data and Saenger sequenced them and 11 of 12 had same mutation for the genes. Seven that were simplex cases and they had two parents confirmed at de novo using annual sequencing, an interestingly enough this paper appeared in nature genetics and was published with biofunctional data because of its high priority as a skeletal phenotype. Sporadic de novo disorders, this is a genetic pattern essentially completely intrackable with positional cloning. The first one done was KABUKI syndrome or makeup syndrome, some people call this disorder. Again, Dismore if I can syndrome, skeletal phenotypes an intellectual disability, quite rare, most are simplex or apparently de novo. Lightly different approach on (indiscernible). As well they did a strategy that you might not think, they did not exploit the de novo aspect of this disorder. What they did instead is sequence across a number of affected probands to see if they can find a varyian in that way. Reading between the the lines I'm guessing they're concerned about the error rate of the platform leading to too many false positives for potential demo s. Selection platform which works quite well in addition to other ones you have heard today, was a -- on array selection approach followed by a sequencing relatively lower depths that we use, only to 40X coverage map of the regions, this is a good example, I applaud the author, they publish a filtering strategy that they use that did not work. I thought it was illustrative. What they ask from their ten pro bands, any N number of those ten, how many of them shared variants in genes with the given criteria, here is non-synonymous, slicing and nonsense variants, DB SNP in genomes absence in their own controls and absent in both data sets. So for single gene variants you can see they have 7,000 across those specimens. Present in two or more, variants in the same gene in two or more individual, narrowed it down and successfully 3, 4, 5, 6, 7, 8, 9. As they went across an down, the numbers go down because it's more and more fillerring. And that was a single variant and this variant was not the cause of that disease. So a good example of how filtering can leave you at a dead end, you have to then take a step back and say what's an alternative strategy, what are the potential explanations for not finding the gene and how should I do it next. They did another strategy, they went back and did really careful phenotyping and rank ordered their patients using clinical experts to rank the patients from the most severe to the least severe reasoning that they should be filtering based on the best strongest most clear cases. They also implemented here you notice on the previous slides no assessment of the functional consequence of the predicted change in genome. And they manual review of those data an rank-ordered the patients. They identified nonsense variants in this gene MLL-2, in four of four highest ranked i.e., the most severe cases, then I had a couple of hits in the lower cases so you can see why they failed with the first fillerring strategy, which is we're trying to be too inclusive, they had two problem, one, there's probably -- there is locust heterogeneity for this so if you filter across the variants you're going to stumble on that. Number 2, if you do not have great coverage of the genome in the original capture so they had overall 96% coverage but missing a significant number of exons in this particular gene and when they went back and did manual capillary sequencing, they found mutations in two of the cases that were missed by the next GEN sequencing. Then they when and did manual testing on that candidate gene in 43 cases, found mutations in more than half of them and then went back and did the de novo check for cases of DNA samples of both parents and in 12 of 12 cases both parents were negative for the mutation confirming that it was de novo. Very let regenius here, your friend is allelic heterogeneity. There was some locust heterogeneity, one thing that tripped them up in the earlier filtering. Here is an example, two groups working on this at the same time. A European group making public the fact they did not find the causative gene because the exome targeting kit didn't include MLL manufacture 2. If your targeting kit doesn't include your gene, spend all you want on sequencing and you won't fine a mutation. No functional data were included here. I'll go through briefly a mosaic disorder, highlight sampling issuesch this is a disorder of asymmetric overgrowth an includes pig men tear lesions vascular malformations, it's's never familial and has been described in discordant monozygotic twinsch this was matic, in the a germ line disorder which poses some challenges that explains the clinical observations but then for the sequencing tough take a difference approach. Almost all is done using peripheral blood mononuclear DNA samples from affected patients here one not necessary hi want to do that. If a disorder is mosaic you might not want to sequence any part of the patient. You might sequence an prepare what's affected with what's unaffected. For the Natalie for us as well there were a pair of discordant monozygotic twins we had access to and the unaffected twin was a great control. We sequenced DNA from skin biopsies and surgical specimens and it was from clinically affected and unaffected areas based on clinical judgment of those taking care of patients as well as tissues harvested in the operating room with clinician standing at the surgeon side designated specific tissues being affected or unaffected and no blood DNA was used in the sequencing study because there was no hematoty phenotype known so the affection status of blood couldn't be determined. Pretty similar to the briefius ones though what tissues were compared again was quite difference so we have the same set, non-synonymous nonsense splice an intels absent in DB SNP though Jamie showed how that could trip us up because when TCGA sequenced tumors they found the same variant we did in cancers an deposited that in DB SNP so that leads to a false negative result if you filter that out. Most affected unaffected intrapatient pairs we have between 100 an 300 sequences in the exome data, those have to be validated manually, one in fact persistent. You can see here the distribution of the mutations in blue, the non-mutation in white, all the patients have the same variant at different frequencies and different tissues. Look across hundreds of specimens you see a bias an affected tissues higher mutation Hahn the unaffected and the blood was born out to be not a fruitful play place to plan lack for this mutation. So following up with the patients, 29 of 31 had the same mutation. The mutation was more often found in the affected tissues than in the unaffected tissues, peripheral blood was not positive and it was absent in control. It was absent as a call in the thousand genomes project. You can dig deeper in the thousand genomes data and ask how many sequences include the variant you're interred in. We did that, we found a single sequence read of 30,000 at that position in the thousand genomes in fact had this. But of course that base wasn't called because that was below the base calling stringency. And functional data to show it was functioning as predicted by the mutation. Where are we in the big picture? You can see from these five examples for all different genetic models of Mendelian traits in humans, you can use exome sequencing to delineate the molecular etiology, it doesn't always work but it can work a lot of the time. And it can be a fruitful way to get projects moving again. Mall formation syndromes in humans, there's 2 1/2 thousand clinical entities in this textk book, this is an nonmized database for the syndrome, 4 1/2 thousand entities, few of these have known genetic etiology. Or little known at the natural history how to diagnose patients so it's a huge challenge in the clinical and basic science perspective to push through these, find the etiology an sure all of you have other classes of phenotypes that are hundreds if not thousands more projects that need to be undertaken along these lines. I think as well taking beyond ha we do here in a research environment, the exome and genome is going to become a clinical diagnostic tool. So I hope as all of you start to work with these data you think creatively about how to solve these problems and answer these questions and put these tools out there so that these things can be transformed not just *** sew tearic research tools but into the clinical so we can improve clinical diagnosis going forward. So in thinking how you do a project you have to think about the entire picture. All the way from what is the disorder I'm study, genetic aspects the clinical attributes, phenotypic attributes that allow me to determine what samples I should sequence. What filters I should apply to the data that come out of the instrument. The genetic data I need to buttress the results of the sequence, the nextGEN sequencing instrument and coupling that to functional assays downstream, that nail down the cause an effect relationship so that we can have robust associations of genotype in the diseases we're studying and not write a bunch of just so stories. Thank you very much for your time. We have one more talk then we'll take some questions.