Tip:
Highlight text to annotate it
X
>> Okay. Welcome back.
I hope everybody is properly caffeinated and snacked.
We will take it to the next step. So I am going to present sort of the phase
3, you heard from Jim Mullikin, some of the nuts and bolts how the DNA goes into the instruments,
how data, raw data are generated and sort of early analyzed.
You heard from Jamie about intermediate analysis. And some user tools that people can use to
begin to manipulate the data. So what I want to present next are some general
considerations. I come at this from the perspective of a practicing
medical clinical human geneticist. And I think about what kind of projects ought
to go into this pipeline and how I'm going to look at the data.
So I'll give you some examples of how those things work.
What every is asking themselves, what can I do with these tools that are so widely available
and becoming so inexpensive, which of my patient should I bring to be sequenced using this
technology and which of the samples that I have previously collected might be appropriate
for that. And there's lots to choose from and a lot
of accordances that you need to take into account to decide which projects you want
to use and how you will analyze them to answer the questions that you want to approach.
there's good news and bad here here as always. Exomes are not magic.
They can work wonders and do thing absolutely not possible before.
You can do things with small families that you could not do with traditional clones in
the past. That's certainly an incredibly powerful option
with exomes. You can if you will ununstick stuck cloning
projectsch basically in the old days if you land in a huge regeneral and you western in
the mood to sequence 3 or 400 genes using a capillary strum you can approach that with
exomes. De novo dominant disorders were completely
non-tractable using positional cloning because there was no mapping to be done.
And other circumstances I'll talk about. I this summer attended the Gordon conference
of genetics and genomics and there was discussion about how often does exome work for Mendelian
disease projects. The consensus from a number of the speakers
an presenters that the meeting which is a bunch of people doing good work in this area
is it works about 30 to 40% of the time. So it's not magic.
There are reasons why your project may not work.
So hopefully this course and my talk will help you figure out which ones are more likely
to work so we can increase that. We're subject to seeing publication bias.
Not too many people want to publish unsuccessful exome gene identification projects.
So we don't get to see those, don't get to see what can go wrong but I'll try to give
you some background to think about that. Lots of reasons for that to happen.
Another thing to think about too, it is not necessarily enough days to do a gene identification
by exome sequencing to get yourself a good publication out of your efforts.
So coupling your exome disease gene identification to additional genetic and functional data
should help that a lot. I'll give you some examples of that.
So whey I'm going to talk about in this talk is from a perspective of a practicing physician
who wants to understand what's going on in a patient, what is exome and what is not because
you have to think about that and exome is not the whole genome and what are those differences.
The differences of exome sequencing when compared to positional cloning they each have strengths
and weaknesses. And then some sort of how toes, I'm going
to give examples of five projects, some from our lab, some from other people's labs, to
give a flavor of different types of projects, recessive dominant sporadic de novo and mosaic
case. Okay.
So what is a whole exome sequence? This is somewhat of an unformat label because
it is not all that whole. What it includes are the, it's supposed to
include the definition of an exome is the sequence of all exons in the genome.
There's problems here. First is that we don't currently know what
all the genes are in the genome. The second is we do not necessarily know of
all the axons of the genes we recognize to be there.
Non-coding exons are not consistently targeted. Some kits and approaches target those, some
do not. Not all the axons that you do target are effectively
captured. If you do target it, not all the sequences
that you generate can be aligned and accurate bases can be called from them.
And that leaves you with a fraction of the exome that you're actually interrogating.
Things missing from the exome sequence, some genes are just not there at all.
You'll see an example how that tripped up some investigators in the study example I'm
going to give later. Some parts of some genes may not be there.
Almost all the non-genic control elements that control gene regulation are going to
be missing from an exome sequence. Non-canoncal or deep intronnic splicing he
willn'ts that are important areas for mutation in genes maybe missing.
Structural DNA assessment, small inversions or movements of pieces of DNA within your
genome are very poorly assessed by exome sequence so if your disease is caused by a mutation
of that nature, you're in trouble. Copy number variation is not tractable using
exome sequence. Mitochondrial DNA is not most mostly targeted
but some information on that can be derived from an exome.
Some microRNA and other small RNA molecules are included in exome sequence, some are not
F you think your disease or trait can be caused by one of those DNA elements you have to think
hard whether exome is your right approach or whether you shall go to whole genome as
Jim discussed in the first talk. To set up the pros ab cons an strengths and
weaknesses of exon sequencing versus positional cloning.
In the second slide with exome sequencing you can get away, of using very small families,
and sometimes single individuals to identify potentially causative mutations.
Those of you know who practice positional cloning it is essential, it is nearly essential
to have large families if you want to do myiotic mapping which is the first step of a positional
cloning project. For exome sequencening general what you want
is locust homogeneity, it's going to dramatically increase the sampleious need to sequence to
find variants that you want to find because in most cases you're going to be looking across
sample for variation in gene shared among the different affected persons or samples.
The more loci you have that cause your trait, the harder it will be to do that.
Locust heterogeneity is not -- when you're doing your myiotic maping you are localizing genes an can tell right away
where your low -- if your family that you're analyzing is at a given locust or not using
linkage inclusion or exclusion. So this is an issue that is tricky for exome
sequence, it can cause difficulties in positional cloning but easily sorted out.
Allelic heterogeneity in an exome project is a great thing.
Because the overall sensitivity has been discussed shown by Jim Mullikin, if you have a 90% sensitivity
for decking any given part of the exonly, were it to be that your trait is caused by
a single mutation, in a single gene, there is a 10% chance that no matter how you flog
that exome the variant will not be there. If you have locust allelic heterogeneity that
spreads your risk across multiple sequencing targets and increases the chance that you
are going to see it. And for positional cloning allelic heterogeneity
not a big issue. The burden of one in the hand one in the bush
is tricky to parse doing exome sequencingch you're going to be faced by a list of variants
that are the potential cause of the traits that you're studying.
Your task at that juncture is to make a decision whether to pursue a subset of variants that
-- those variants identified verses pursuing any more depth, other variants that that may
not have been initially detected or maybe problematic to detect in your samples.
That's a trade-off and a critical decision juncture you have to consider moving through
the process of gene ide ifcation in the pipeline. Positional cloning is not an issue.
These are gene candidates most likely all over the genome.
Positional cloning you have a region of the gene, region of the genome nailed down and
you know you need to interrogate just about everything that's in that region to be thorough
which you can't do in an exome sequence. Phenocopies are not a big you shall shoe in
exome sequencing as you'll see in examples that you'll show, patient whose have a appear
to have the phenotype but don't have it for the underlying genetic reason that you think
is the cause. Phenocopy can be a huge issue in mapping because
it can allow you to exclude or include mappings in your region.
There are pros an cons and you knee to think ab your project, phenotype or trait and see
how it it fits in here and how to approach it.
A lot of talk also at the -- Gordon conference about type 1 error.
The dispair Rajing term thrown around at the meeting is exome sequencing or whole genome
sequencing as the potential of generating what are called just so stories.
This is based on the Kipling book where fanciful stories that are circular reasoning that mean
nothing can be beautifully drawn up and become pleatly wrong.
That's really a manifestation of type 1 error. When you're interrogating 20,000 candidate
genes for a trait the potential that any one variant that you find of being the cause,
the prior probability of that variant being the cause is small.
And can be led down a guarded path if you don't do studies to but tress and validate
the findings and carry them forward. Without myiotic mapping which decreases your
type 1 statistical error problem without myiotic mapping you're going to need additional sources
of evidence for causation. In fact I'll show how you can fold mapping
in linkage data later to support and reduce the probably of that error.
First example of the five traits I'm going to talk about today.
First because it's easiest to do because it's the smallest amount of the genome to interrogate,
an X linked disorder, doesn't so much matter what the trait is but the general nature is
important because the character again of your trait will determine what filters you want
to apply and how you think about the results of that filtering that is the result of that
sequencing project. This is an X linked trait that causes severe
congenital anomalies, 1 % lethality which helped a lot in the filtering.
Other than fact it was ultra rare and only 2 known families to be affected with this
trait allows you to do good powerful things with your filtering with controls as Jamie
described. In this particular case you have to go to
sample availability quality an quantity of the DNA that you have and turned out we didn't
have use DNA on the boys for exome so we sequenced the characters.
What do these numbers look like? For X here are raw starting number of what
the initial data onslaught will look lake thisyou need to deal with.
So the exome capture in quote because we have a hard time from reviewers who thug thought
we shouldn't call it an exome at all because it's the exon of sang chromosome.
All the exons on X that are not in the pseudoautosomal regions of the X chromosome and genes are
defined as the UCSC annotated coding axons so right there again you see a decrement of
interrogation from everything on X, starting to notch our way downwards.
The sequences or two samples were about 20 million reads off the aLumina instrument.
Two third or three quarters of a billion base pairs of sequence.
Of that sequence, just under half aligned back to the intended sequencing target of
the experiment. Low but good because the coding region is
a small fraction of the DNA that went into that experiment so incredibly strong enrichment
so half aligned to those targeted exons with high coverage and as Jim mentioned with that
wide dynamic range of coverage you see in exome sequencing you need to go deep to ensure
you have lot of bases covered by 10X more sequencing because of the tail on the left
side. We had about two million base pairs of 10X
coverage sequence which is in that case 77%. I think the hit rate would be higher now with
new versions of this kit. This is old technology, 18 months old.
Thing change so fast here. Just as a note, I mentioned yes sequencing
carrier females so you can use an autosomal base caller for calling the variant but if
you do a X link project for males you need a difference caller because the algorithm
is important for carrying variants in heterozygous males so the filler look like, it was base
odden a number of criteria, the biology an geneticses of the trait.
since sequencing carrier females we said it should be heterozygous.
We knew that the trait was severe so we could set a relatively high filtering threshold
for the kinds of predictedded changes in the genome.
We didn't mild missense variation so we looked for non-synonymous intels which frame shift
the open reading frame nonsense an those types of variants an splice should be on here also
that's a typo. Again because this trait was so rare we could
implement relatively stringent filters for presence of the punitive variant in controls.
And we started out here with two samples, with 350 substitutions and this is already
excludes a number of changes and thousands of variants have already been excluded by
implementing some of these filters. And the heterozygous non-synonymous absent
in DB SNP as yeah any mentioned is a dangerous thing to do, you have to be careful about
that. More to say about that in a few slides.
We sequence concurrently patients who had other disorders.
We excluded variants in those samples as well. What you can see here is that this worked
exceedingly well. So this was an easy project.
Family one had no substitution variants that net all the filters and it had one variant
that was a frame shifting intel that did lead to filters.
Family 2 had a single variant that had no variants that were frame shifting in the exons
on the X chromosome. Turned out this hit and this hit were in the
same gene. So that's a nice tidy result.
This was an example of what I was talking about earlier with a stuck project.
This was a project that had fat in the laboratory -- sat in the laboratory for five or six years
because it was mapped to a 25 to 30 megabase region of the X chromosome, that included
2 or 300 candidate genes. We just didn't have the wherewithal to sequence
those using PCR and 3130 sequencing instrument. So with that, until the X exome came and we
applied that tool and this worked. Now, this was a fortunate occurrence because
this simple stringent filter that we implemented took us right to varying in a single gene.
You could say what if that wasn't the case and we western quite so lucky what could you
do? This is an example of what I was referring
to earlier, you can use myiotic or linkage or haplotype data to exclude regions of the
genome to reduce the number of variants you have to consider.
If you use mapping data from a family like this, that took us to a region of the X chromosome
that eliminated between 60 and 80% of the variants from consideration because they were
excluded by linkage alone. Important thing here, it does not require
those high thresholds that we used to apply to ourselves for ABANICIO linkage mapping,
a log score of 3.0 for autosomal locust or 2.0 for X locust.
Whatever linkage you can extract from the family you have will help eliminate variants.
Any amount of cranking that down if first filters done give you a hint can be very,
very helpful for you in pushing variant out of consideration.
Again, but you have to be careful and the old cry Teer we used to use in positional
cloning as to whether you should allow a single recombination to exclude a region or two,
still apply because you can erroneously push things in and out of consideration.
Again, if your disorder has fee know copies or ambiguous affection assignment of the patients
you're studying. This particular case because this was a gene
of known function with no available the only thing we could look for for supportive evidence
was expression so we took this to some colleagues for mouse expression analysis and cloned out
the mouse gene hybridize that to showed -- embryos and showed that wasn't expressed in the tissues
one predict based on phenotype and humans which was nice, not the strongest supporting
evidence but if you're on the X chromosome, the probability of a type 1 error is much
less likely so not quite as robust of an evidence that is needed.
Next to autosomal recessive. This is a case we worked on with a colleague,
chuck VENDETI in our institute. This is a severe childhood onset metabolic
acidOSIS disorder. Chuck's group did work to he can Chrud other
loci that may have been other known loci that cause a phenotype.
This was a cohort of patients all known causes had been excluded and this was an exome project
where we sequenced only three samples in affected child and his two parents.
So this single trio, led us to variants. This is autosomal traits now looking across
the genome and this gives you the kind of numbers that Jim approximate Jamie both started
with, 100,000 to 150,000 range for variants. And then we started implementing filters to
crank the number down because obviously this is completely intractable so we could exclude
a number of variants based on the most probably genotype based calling that Jim described.
Then we use the zygosity status of the samples to exclude variants, the parents should have
at least two. And the type of mutation that we want, this
again, is something that you will need to play with with your projects.
And generally what we recommend is starting on a relatively stringent level to see what
you can fine and if you don't fine what you're looking for you start to relax criteria one
at a time backing down your stringent to let more variants come through to see if that
shows what you're looking for. We did a risky thing at the outset excluded
variants that were indeed SNP an excluded as well variants homozygous an controls and
that's in our set of adults which are presumably normal for this trait and you can see how
much trouble in a few moneys that can get you into.
Or we exclued allele frequency greater than 10% reasoning that the carrier rate couldn't
be near that high if it's a rare recessive disorder.
That led us with a dozen genes that the affectedded that two variant this is that gene and those
were present one in each parent. Here is the list of genes.
Then Jennifer Johnson in the group looked manually at this list and curated those an
discovered that ACSF-3 was a gene predicted to have a mitochondrial leader sequence and
was a good candidate for that reason. I want to mention here, we also had to but
tress the genetic data, in this case but had seven unrelated affecteds, four general tick
confirmation of variance in the gene. It's always a question when you start the
outset of this project how many samples go into the exome pipeline versus how many issues
hold back for manual genotyping PCR 3130 or capillary sequencing post hock.
That's a decision that you can make based on cost accordances because Eomes are expensive
but don't forget technician an post doc time is also a very expensive commodity as well
so balancing that and the right number of exomes versus follow-up work is an important
accordance. Whether and how to use DB SNP.
As Jamie said, DB SNP a helpful potentially dangerous tool to use, it is a repository
of genetic variation that is completely independent of any relationship with a variant to any
particular disease. Individual variants in DB SNP can be pathologic
and some variants are going into DB SNP from disease studies an even from clinical pathology
labs. The cohort are in those studies maybe very,
very difference from what you think they might be.
We had a situation where we were annotating genomes for cardiac conduction defects and
found over 2% of a sample set had the same variant in their samples.
When we dug deep into the data it became clear that was a sample side of patients recruited
for a pharmaceutical trial for cardiac conduction defects.
Be careful what ewe use and can be tedious to dig down to to that level so we rely on
our own internal control sets because owe know them better.
So your cognitive DB SNP and filtering is it ra active so you can try some -- iterative
so you can trial DB SNP early on but don't hesitate the back off on that filtering early
if you done find what you want and use cut offs that are conservative.
With generally shoot 5 to 10 times the estimated frequency of the disorder as a flesh hold
saying if the frequency of any single variant is above that it's unlikely to be causative.
Remember cystic fibrosis, a good example, cystic fibrosis has a allelic -- peculiar
allelic distribution, 70% of alleles are a single allele.
So you have to that population carefully so you wouldn't exclude a varyian based on commonality
because it might have caused the variant. Okay.
So we used our twin seek control set to me the carrier frequency identified in the study
and you can see here these are the raw data, this is exported from my's program into a
excel spreadsheet. You can see homozygous reference, heterozygous
an hetero zy non-reference. You can notice in the control set we had a
patient pop up who was homozygous for a variant. this was a variant found and thought to be
pathologic in the patients with the methodologic acid disorder.
We looked at this patient carefully, turns outs she was symptomatic and when we evaluated
her biochemically she has this disease. So you have to be very careful, your controls,
your control set may have the disease, this is one of the powers of exome and genome studies.
You can find phenotypes that you didn't know you ought to be looking for in who is affected
and who is not could be challenged by these data.
So chuck's group went to do functional studies, enzymeATIC studies, localization studies that
was the cause of this disorder. Next dominant.
I gave an example of two papers that appeared recently on a rare disorder with skeletal
dysplasia and other anomalies, thought to be a great Joe project because it has associated
with it severe osteoporosis and it was hopeful that they were hopeful it would be generalizable
to more common forms of osteoporosis. This is ultra rare.
And autosomal dominant and institute relatively general filters, and they had several cases
that were de novos. So this is another group that you essentially
the same processes as concluding adjuvant capture and aLumina sequencing.
3 to 4 giga bases per sample was generated. They use one simplex case, only one affected
in the family and one multi-plex family probe band so three samples went in the sequencing
instrument. The filtering criteria relatively similar,
non-synonymous nonsense slice an insertion deletions and they again instituted a relatively
stringent DB SNP filter. As well as thousand genomes an controls.
Their disorder was so rare it should appear and it was dominant, should appear zero times
in any of these control samples. They came out with all three cases, the only
gene that was mutated all three cases was not too.
They went to but that with genetic data and Saenger sequenced them and 11 of 12 had same
mutation for the genes. Seven that were simplex cases and they had
two parents confirmed at de novo using annual sequencing, an interestingly enough this paper
appeared in nature genetics and was published with biofunctional data because of its high
priority as a skeletal phenotype. Sporadic de novo disorders, this is a genetic
pattern essentially completely intrackable with positional cloning.
The first one done was KABUKI syndrome or makeup syndrome, some people call this disorder.
Again, Dismore if I can syndrome, skeletal phenotypes an intellectual disability, quite
rare, most are simplex or apparently de novo. Lightly different approach on (indiscernible).
As well they did a strategy that you might not think, they did not exploit the de novo
aspect of this disorder. What they did instead is sequence across a
number of affected probands to see if they can find a varyian in that way.
Reading between the the lines I'm guessing they're concerned about the error rate of
the platform leading to too many false positives for potential demo s.
Selection platform which works quite well in addition to other ones you have heard today,
was a -- on array selection approach followed by a sequencing relatively lower depths that
we use, only to 40X coverage map of the regions, this is a good example, I applaud the author,
they publish a filtering strategy that they use that did not work.
I thought it was illustrative. What they ask from their ten pro bands, any
N number of those ten, how many of them shared variants in genes with the given criteria,
here is non-synonymous, slicing and nonsense variants, DB SNP in genomes absence in their
own controls and absent in both data sets. So for single gene variants you can see they
have 7,000 across those specimens. Present in two or more, variants in the same
gene in two or more individual, narrowed it down and successfully 3, 4, 5, 6, 7, 8, 9.
As they went across an down, the numbers go down because it's more and more fillerring.
And that was a single variant and this variant was not the cause of that disease.
So a good example of how filtering can leave you at a dead end, you have to then take a
step back and say what's an alternative strategy, what are the potential explanations for not
finding the gene and how should I do it next. They did another strategy, they went back
and did really careful phenotyping and rank ordered their patients using clinical experts
to rank the patients from the most severe to the least severe reasoning that they should
be filtering based on the best strongest most clear cases.
They also implemented here you notice on the previous slides no assessment of the functional
consequence of the predicted change in genome. And they manual review of those data an rank-ordered
the patients. They identified nonsense variants in this
gene MLL-2, in four of four highest ranked i.e., the most severe cases, then I had a
couple of hits in the lower cases so you can see why they failed with the first fillerring
strategy, which is we're trying to be too inclusive, they had two problem, one, there's
probably -- there is locust heterogeneity for this so if you filter across the variants
you're going to stumble on that. Number 2, if you do not have great coverage
of the genome in the original capture so they had overall 96% coverage but missing a significant
number of exons in this particular gene and when they went back and did manual capillary
sequencing, they found mutations in two of the cases that were missed by the next GEN
sequencing. Then they when and did manual testing on that
candidate gene in 43 cases, found mutations in more than half of them and then went back
and did the de novo check for cases of DNA samples of both parents and in 12 of 12 cases
both parents were negative for the mutation confirming that it was de novo.
Very let regenius here, your friend is allelic heterogeneity.
There was some locust heterogeneity, one thing that tripped them up in the earlier filtering.
Here is an example, two groups working on this at the same time.
A European group making public the fact they did not find the causative gene because the
exome targeting kit didn't include MLL manufacture 2.
If your targeting kit doesn't include your gene, spend all you want on sequencing and
you won't fine a mutation. No functional data were included here.
I'll go through briefly a mosaic disorder, highlight sampling issuesch this is a disorder
of asymmetric overgrowth an includes pig men tear lesions vascular malformations, it's's
never familial and has been described in discordant monozygotic twinsch this was matic, in the
a germ line disorder which poses some challenges that explains the clinical observations but
then for the sequencing tough take a difference approach.
Almost all is done using peripheral blood mononuclear DNA samples from affected patients
here one not necessary hi want to do that. If a disorder is mosaic you might not want
to sequence any part of the patient. You might sequence an prepare what's affected
with what's unaffected. For the Natalie for us as well there were
a pair of discordant monozygotic twins we had access to and the unaffected twin was
a great control. We sequenced DNA from skin biopsies and surgical
specimens and it was from clinically affected and unaffected areas based on clinical judgment
of those taking care of patients as well as tissues harvested in the operating room with
clinician standing at the surgeon side designated specific tissues being affected or unaffected
and no blood DNA was used in the sequencing study because there was no hematoty phenotype
known so the affection status of blood couldn't be determined.
Pretty similar to the briefius ones though what tissues were compared again was quite
difference so we have the same set, non-synonymous nonsense splice an intels absent in DB SNP
though Jamie showed how that could trip us up because when TCGA sequenced tumors they
found the same variant we did in cancers an deposited that in DB SNP so that leads to
a false negative result if you filter that out.
Most affected unaffected intrapatient pairs we have between 100 an 300 sequences in the
exome data, those have to be validated manually, one in fact persistent.
You can see here the distribution of the mutations in blue, the non-mutation in white, all the
patients have the same variant at different frequencies and different tissues.
Look across hundreds of specimens you see a bias an affected tissues higher mutation
Hahn the unaffected and the blood was born out to be not a fruitful play place to plan
lack for this mutation. So following up with the patients, 29 of 31
had the same mutation. The mutation was more often found in the affected
tissues than in the unaffected tissues, peripheral blood was not positive and it was absent in
control. It was absent as a call in the thousand genomes
project. You can dig deeper in the thousand genomes
data and ask how many sequences include the variant you're interred in.
We did that, we found a single sequence read of 30,000 at that position in the thousand
genomes in fact had this. But of course that base wasn't called because
that was below the base calling stringency. And functional data to show it was functioning
as predicted by the mutation. Where are we in the big picture?
You can see from these five examples for all different genetic models of Mendelian traits
in humans, you can use exome sequencing to delineate the molecular etiology, it doesn't
always work but it can work a lot of the time. And it can be a fruitful way to get projects
moving again. Mall formation syndromes in humans, there's
2 1/2 thousand clinical entities in this textk book, this is an nonmized database for the
syndrome, 4 1/2 thousand entities, few of these have known genetic etiology.
Or little known at the natural history how to diagnose patients so it's a huge challenge
in the clinical and basic science perspective to push through these, find the etiology an
sure all of you have other classes of phenotypes that are hundreds if not thousands more projects
that need to be undertaken along these lines. I think as well taking beyond ha we do here
in a research environment, the exome and genome is going to become a clinical diagnostic tool.
So I hope as all of you start to work with these data you think creatively about how
to solve these problems and answer these questions and put these tools out there so that these
things can be transformed not just *** sew tearic research tools but into the clinical
so we can improve clinical diagnosis going forward.
So in thinking how you do a project you have to think about the entire picture.
All the way from what is the disorder I'm study, genetic aspects the clinical attributes,
phenotypic attributes that allow me to determine what samples I should sequence.
What filters I should apply to the data that come out of the instrument.
The genetic data I need to buttress the results of the sequence, the nextGEN sequencing instrument
and coupling that to functional assays downstream, that nail down the cause an effect relationship
so that we can have robust associations of genotype in the diseases we're studying and
not write a bunch of just so stories. Thank you very much for your time.
We have one more talk then we'll take some questions.