Tip:
Highlight text to annotate it
X
Dr. Andy Clark: Thanks, Adam, it’s really a delight to be
here. I want to thank Eric for inviting me, and also wish you a happy birthday, and another
many years in the future of great success with the NISC. So, why population genomics?
What are we referring to here? This is simply taking the set of questions that population
genetics addresses and taking it to the genomic level. So population genetics is interested
in trying to infer the balance of the roles of the forces of mutation, drift, migration,
and selection, to make a statement about the way evolution works. In other words, it’s
not just a description of evolution, it’s sort of an attempt to really understand the
mechanism behind evolutionary change.
Unfortunately for population genetics, this means having to drag armies of students through
very tedious lectures dealing with these nasty statistics like pi, and theta, and rho. I’m
going to spare you much of this, but at least give you a little bit of the flavor of where
we can do this at a genomic level. Another question that one can address with population
genomics is having to do with past population demography. It turns out a lot of the issues
with the HapMap project with inference about complex traits and identifying genes associated
with complex traits are contaminated by confusion about the demographic history and population
structure, so we need to understand that.
In addition, having genome-wide data from multiple individuals allows us to do sort
of scan statistics, looking across the genome for heterogeneities in these various statistics
about levels of nucleotide diversity, recombination, and so forth. So the basic building block
for modern population genetics is built on the idea of having complete data, so sequence
data from multiple individuals ascertained in some uniform and random way, and so that
ideal dataset would look like this, where you have complete sequence with no errors,
of course, no missing data, and that’s the ideal. Of course, what we actually usually
get is something far from that, and what we want to do from those data in any case is
to estimate some of the primary parameters of how much variability is there, what’s
the sort of cause of that level of variability. The primary parameter that arises from every
which direction, if you look at theoretical population of genetics at all, it's astonishing
the degree of convergence on this parameter, theta, which is four times the population
size times the mutation rate. You can derive this as being a primary determinant of the
level of variability in a population, forward in time with a Fisher-Wright model, backward
in time with a coalescent from infinite sites, from infinite alleles model. It’s an amazingly
convergent sort of statistic as being a primary parameter.
So, one of the things we’d like to do is to estimate that from various sorts of genome
data. The particular kind of genome data I’d like to talk about is the very exciting short
read data that we’ve been hearing a little bit about already. We’ve heard several talks
dealing with short read data and the interest in using these short read technologies to
recover the sequence of the individual. Often this is for the purpose of identifying the
individual mutation or the set of mutations in that individual getting high accuracy for
an individual. And we’ve seen in those circumstances that it’s necessary to go to reasonably
high read depth, 10, 20, 30, as a read depth for those individuals. I want to look at the
opposite situation, where the data that are available are actually rather sparse, so this
is 10 different individuals where you’re looking at a coverage of well under 1x in
each individual, but you have many individuals. Can you do any kind of useful analysis with
that?
And it turns out, in the population genetics setting, the issue of estimating these sorts
of genome-wide parameters, this sort of data works very well, and I’m going to show you
some of the sort of directions to approach this. This was actually a problem that arose
at first, and the first that I was aware of it, with Solera genomics data, when they had
sequences from six individuals, and that sort of fell upon my lap to estimate some things
like nucleotide diversity and so forth from those six individuals, and we did a rather
crude ad-hoc method at the end. It wasn’t published because they weren’t releasing
the SNPs at the time, but it was a problem that I kicked around with Rasmus Nielsen right
at the beginning, and he was sort of enchanted by the problem. It’s actually quite interesting,
not only is the data very gappy, but, of course, there was a relatively high error rate. Typical
single sequence error rate for Solexa [spelled phonetically] reads might be on the order
of a percent or two. There’s also site-specific errors, so particular regions of the genome
might have different sequence error rates. And, in addition, if you’re sampling from
diploid individuals, you’re actually sampling from a [unintelligible] alleles of that individual.
So there’s a binomial sampling from each individual. It’s a very interesting sampling
problem.
And along with Rasmus Nielsen, Amos Helman [spelled phonetically], we have a paper submitted
that actually derives a Waterson [spelled phonetically] estimator in the face of all
those errors. And all that you need to do is you need to have a very good error model.
You have to know what are the determinants of error, what’s the rate of error, what’s
the sort of base neighborhood for those error rates. If I can specify an error rate model
and have that kind of data, I can estimate theta, actually, very well. So the particular
test dataset that we’re going to talk about was actually funded by Adam Felsenfeld as
a pilot project for one of those interminable committees one sits on for NIH, and this was
one that was dealing with selecting genome sequences, and, in particular, the idea of
using short read technologies, and that context came up for SNP finding across a whole genome.
How does one most effectively identify polymorphic sites across an entire genome without having
to fund another HapMap project for every organism? And so the idea of just throwing them into
454 Solexa was very appealing, and we proposed doing this with just 10 lines of fly, six
from North Carolina, four from Africa. 454 sequencing was done at the Wash U Genome Center.
Elaine Marcus was our primary contact, and she was terrific. And this was back in the
days of the GS 20, so one run was done for each of the 10 lines. It gave 3.4 million
reads, about 351 megabase pairs total of sequence. That alignment did look like this. This is
actually a region of chromosome 2 of the real data, and you can see that there are regions
that actually are gaps in the data, other regions where there’s a depth of about two
and a half average, across the whole project. About 74 percent of all of those reads had
a unique fit. This mosaic is one of the assemblers I’ll talk about in just a second, but it’s
a pretty interesting start to the data. The first thing to ask about is how homogeneous
is the depth, are we sampling some regions of the genome better than others, are there
gaps, and so forth. The North Carolina population, there were six lines, remember, so the depth
for North Carolina’s always going to look greater than the depth of Africa where there
were only four lines. And it was quite homogeneous across the -- this was the X chromosome --
except for occasional spikes one way or the other presumably for some kind of repetitive
element. In some cases we could clearly see that’s what it was. For these data there
was a reasonably good fit to the Lander-Waterman equation for coverage from a whole genome
assembly, whole genome shotgun.
I have to say, however, since doing this, the realization was that the power for discrimination
of the goodness of fit, the Lander-Waterman, was quite poor for these data because the
read depth was so low. You saw from Richard Gibbs’ slide quickly flashed by for the
Jim Watson data, that there was actually a bimodal distribution. There’s also been
something like 7x coverage of the C. elegans genome done by 454, and there’s a pronounced
excess fatness of the tails of the coverage distribution. So there are too many regions
of the genome with insufficient coverage, too many regions with excess coverage. And
those departures from the Lander-Waterman equation are really crucial for use of these
short read data for inferences of things like expression level by counting methods, so I
think this is a really important problem we need to get a handle on with these methods,
but what really determines coverage? The average coverage for the African lines is about 40
percent of the genome, the North Carolina lines, about 60 percent, so the [unintelligible]
was about three quarters. Most of the regions that were covered by one were covered by the
other, and so forth. So we could look at particular regions of the genome that had particularly
poor coverage, and so this is 10 kb fragments that had less than half coverage. And there,
again, it’s quite spiky. Particular regions of the genome are looking like they're falling
in those sort of gappy regions.
So the real problem, then -- the primary problem of this whole pilot was to infer polymorphic
sites. Where are there SNPs in the genome? And can we devise a sort of an inferential
method that would have a relatively low false positive rate and a reasonably good accuracy
of determining those SNPs? So this first began -- so we actually turned to some folks who
had actually already been thinking about this. Gabor Marth and Aaron Quinlan at Boston University
have been sort of working on problems of this sort with Sanger sequencing, and recently
turned their attention to short read technologies. And they realized right off the bat that the
critical thing was to understand error, so one of the lines that we sequenced was actually
the iso-1 stock, the stock that was done to something like 14x coverage by Sanger sequencing
of Drosophila melanogaster. That gave lots and lots of reads that did have errors in
them, and we could then build this error model. Pyro base [spelled phonetically] is then their
base caller that not only called bases but also called the confidence on the bases that
they devised. That’s actually currently -- online you can download this and start
to play with it. They're going to town with it.
So, anyway, they produced this multi-alignment of the 10 strains all across the reference
genome, the Drosophila melanogaster, and called some 660,000 SNPs across the whole genome.
About 1,200 of them were submitted for validation to the Washington Genome Center, and those
1,200 had a posterior Bayesian probability of being a SNP of about 90 percent, and 92
percent, in fact, validated. So it's actually looking pretty good. So this is not the 99
percent confidence of SNPs in each individual. It's, "Is this a polymorphism in that position
of the genome?" And that accuracy's pretty good. So there's actually a quite strong correlation
of the nucleotide diversity in particular regions. If it's very low diversity in Africa,
it will be very low diversity in North Carolina, and so forth. That's sort of expected. These
are populations that are derived one from the other. Basically, the fly population pretty
much followed the human population in migrating out of Africa, so we expect them to show that
kind of state. The divergence between species, so melanogaster versus simulans, is also correlated
with this level of diversity within Africa. So what we're asking here then is, "Is there
a simply heterogeneity in mutation rate? Are the regions where there's high diversity driving
that high diversity due to high elevated mutation rate in that part of the genome?" If so, you
would expect there to be an elevated divergence between species, and, in fact, you do see
this, to some extent.
However, this correlation is much stronger than this one, and so mutation doesn't drive
all of that difference. Something else must be going on, and one of the things that you
can do to get at what's going on is to actually look at the ratio of polymorphism to divergence.
So here's that level of polymorphism, all to be predicted by this parameter, theta.
The divergence is determined by twice the mutation rate times the time since the divergence
between the two species, and the ratio of those has Mu divide out, as you can see. And
so we ought to get something that is a distribution that depends only on those other factors,
namely, effective population size and the time since divergence. And when you do this,
you have to simulate then to get what's the expected level. And the expected level is
shown here under a neutral simulation. This is, on this axis, the diversity to divergence
ratio across the whole genome, and those 10 kb chunks across the whole genome, we get
this distribution. What we actually observe has a much greater variance, in other words,
some regions of the genome have much more diversity than expected under that neutral
simulation, others have much less than expected.
So that means the other parameters, effect of size and time since divergence must be
what are differing between those different regions of the genome, and those are precisely
parameters that are driven up and down by things like natural selection, by correlations
with recombination rate, and so forth. So there's some interesting heterogeneity across
the genome in these parameters of evolution. So one of the things that's often done with
data that address polymorphism across the whole genome is to look for signatures of
natural selection. You've seen this, certainly, many times over now with the human genome.
It's still an area of interesting research going on. But we can detect selective sweeps
by troughs in diversity. If there's a favorable mutation, it's going to drag that particular
variant up in frequency, replacing all the others, and, hence, reducing the local diversity
around that particular adaptive mutation. And we can look at this then across the whole
genome and ask, "Are there particular regions where there are big dips in nucleotide diversity?"
And it doesn't jump out at you very whoppingly [spelled phonetically] for this particular
dataset, although there are regions where there's a curious dip for both African and
the North American sample. And it is actually significant at the genome-wide scale. Some
of them are actually regions of the genome that have otherwise shown signatures already
that seemed bag-of-marbles on the X chromosome as one interesting candidate that's being
pursued in Chip Aquadro's lab, for instance.
We've also seen in the human data efforts to look at differences between populations
as being a means of identifying particular regions that might have undergone region-specific
natural selection. You can do the same with flies. This is the difference between African
and North Carolinian diversity, so are the regions where there's a big spike up or down
in the diversity and, in fact, we see them. Those are then again candidates that are nominated
for potentially interesting region-specific selection. On the issue of demography, flies
show the same sort of demography as humans, namely, there was a ancestral smaller population
that grew rapidly at some point in the past, and in the African population there was not
particularly much change since that time, but a very narrow bottleneck in expansion
into Europe and the Americas. And there's a scene in the site frequency spectrum in
African versus European flies. European flies have a big excess of rare variance because
of the nature of the variation that would make it through that bottleneck would show
the skew in the site frequency spectrum. So that was already published.
Do we see this with these data? It's not nearly so clear because we only have a depth of two
and a half, on average, but what you can ask is, "Are there differences between different
parts of the genome with respect to the relative frequency of different polymorphic sites?"
So one of the things you see is that there's a consistently reduced diversity in North
Carolina compared to Africa. Now, remember, there were only four African lines, six North
Carolina lines, and, nevertheless, there's more variability in the African lines again
for this demographic region. Reason is well-known, African populations have a larger effective
size, look like they're more diverse. If you look at the X versus the autosomes, the mean
for the X chromosome is about .004, the mean for the autosomes is about .006 for the North
Carolina populations, and you see that X is less diverse than the autosomes. This is seen
in almost every organism. The reason for this, of course, is there are fewer X chromosomes
than autosomes, so the X chromosome has a smaller effective size in that mutation selection
balance, theta ought to be smaller, and you end up with less diversity. Even if it were
strictly neutral, you'd end up with less diversity. But, in fact, the theoretical expectation
is the X ought to have three quarters the diversity of the autosomes, and it's more
like half, in this case.
Well, it turns out there's good theory for this. In a population that's undergoing a
bottleneck, you actually expect to see more severe reduction and diversity in the X compared
to the autosomes. This is a paper by John Pool and Rasmus Nielsen. If you compare them
now, in Africa, the X to autosome ratio is about 65 percent, which is again already lower
than the 75 percent expectation if it were neutral. For North Carolina, it's about 50
percent. North Carolina's a derived population. And you can see, in fact, there was a greater
reduction in diversity on the X compared to the autosomes, nicely consistent with that
expectation. The final, sort of, point I wanted to illustrate from these data at the sort
of genome-wide scale is that regions of low recombination, particularly around centromeres,
show dramatically reduced heterozygosity. We see that both in Africa and North Carolina,
if one actually looks then at the local recombination rate, intensity estimated as centimorgans
per megabase pair, so this is estimated again from many of the mapping experiments done
with flies over the years, so for the given local recombination rate, what's the diversity
in Africa and in North Carolina?
And you see a pronounced positive correlation. Now, this is widely described in the literature
as being attributable, due to the fact that there's a thing called the Hill-Robertson
effect. Regions of very low recombination are going to suffer from the fact that a positively
favored mutation is going to drag down diversity, so lower recombination will make a larger
region swept to fixation [spelled phonetically] and drop the diversity more than a region
of low recombination. And also regions of low recombination, if there's negative selections,
though deleterious mutations are occurring, it will also reduce the effective population
size to a greater extent when there's lower recombination rate so-called background selection
model. So some combination of those two is driving this positive correlation. You see
it in these completely independent samples of four individuals here and six individuals
here. We see this positive correlation between recombination rate and level of diversity.
One thing that might drive that is if recombination itself were slightly mutagenic so that when
recombinations occur you'll also get mutations. That would drive a positive correlation between
divergence and diversity. And, in fact, in flies we do not see that. This is the recombination
rate again on this axis against melanogaster simulans emergence, and there's no correlation.
So it seems like recombination is not inducing this positive correlation. It really is a
Hill-Robertson-like effect, the local sort of environment force, adaptive mutations is
favoring a greater reduction in diversity in regions of low recombination.
This made a paper back in 1999 with just 12 data points. Here are 30,000 data points.
It seems to be true still, and so that was begun in Aquadro back in '99. I wanted to
take the last couple of minutes to sort of shift gears, and this is another project that
was -- this was funded by NHGRI to look at the sort of comparative genomic lessons that
we learn from a dozen different Drosophila genome sequences. And this is a particularly
sociologically interesting project. It featured data from all the genome centers except Stanford,
I think, including Agencourt Biosciences. I guess NISC didn't contribute to it, but
many, many different groups over many years contributed to these data.
The choice of these 12 species was made on the basis of the fact that there's a huge
diversity in the ecologies and sort of lifestyles of these different flies, over 400 million
years of evolution spanned by that tree, so there's phenomenal saturation of mutation
at many, many sites in the genome. Manolis Kellis, who was at MIT, was in charge of the
analysis of the sort of annotation of the melanogaster genome and how we could improve
the annotation of the melanogaster genome using these data. And this is just to illustrate
something that I think you all know very well, which is that in protein coding regions, of
course, you expect to see more substitutions between species at synonymous sites, because,
after all, they would still retain the same amino acid sequence, you expect to see substitutions
that preserve the reading frame, and you expect to see substitutions that replace one amino
acid with another one that has very similar properties.
So a number of different software tools, sort of Exonify and so forth. They’re very, very
good at finding exons in the genome based on these sorts of signatures. So we can color
code them based on their sort of attributes through these substitutions along this 12
species alignment. So this is the 12 fly species. Do those substitutions smell like they are
protein coding substitutions? And color them green. If they're substitutions that are radical
nonsynonymous changes or frameshifts, they’re less likely to be protein coding, we can color
them in red. When we do this, we can identify regions of the genome that are more likely
to be protein coding, and we can use it then to improve our annotation of any given genome.
This is just to indicate the difference between that kind of strategy, and the strategy that’s
based on just simple conservation. So this is the track that’s from the UCSC browser,
the FastCons track, showing this sort of degree of conservation, it’s a very nice metric
for overall conservation of the sequence. And it’s showing again, so here’s the
region of this particular gene, CG9945, the boxes again being the exons, and you'll see
high conservation for most of those exons. But if you look carefully, you see there are
regions where there actually is high conservation, but we failed to annotate an exon. Subsequently
we see that it also has a very high probability of being a codon based on this evolutionary
model. And, in fact, we go back and see that, in fact, that was another transcript. There
is, in fact, an exon in that region of the genome. So that’s a novel sort of annotation
to the melanogaster genome that came about because of our comparison to these 12 species,
giving us much more power to detect these things.
We also see things like this where FlyBase was right, it says there’s not an exon there,
but we see very high conservation when we look at the protein coding signal. In fact,
the substitutions that do occur in that region look very un-exon-like [spelled phonetically],
and, in fact, there’s not an exon there. So this is again where there’s a big difference
between simple conservation in the protein coding region. So these sorts of methods resulted
in many new annotations of the melanogaster genome, particularly in protein coding regions,
some 413 cases of translation start changes, 912 cases of different splice signals, 240
cases of polycistronic genes, and so forth. So we gain considerable power in the annotation
of a genome by looking at this sort of comparative approach. Just two other stories with this
sort of comparative evolution analysis, one of them has to do again with this X versus
autosome comparison. Now, I made it sound simple. Comparing X and autosome, the X has
a smaller effective size, so it looks like it ought to have more drift than the autosomes,
because it’s a smaller effective size. Things bounce around stochastically more in a smaller
effective size. On the other hand, the X chromosome's hemizygous in males. So any mutation that
occurs that’s expressed in males, even if it’s a recessive mutation, is immediately
expressed.
So there’s no masking in the sense of being hidden -- for rarely [spelled phonetically]
or being hidden in males. There’s no such thing as recessivity, and so one expects then
that deleterious mutations ought to be more effectively screened by natural selection.
Recessive advantageous mutations ought to be more effectively, and more quickly identified
by natural selection, and dragged up in frequency. So these things lead in opposite directions.
The latter results in an expectation of the X chromosome ought to be evolving faster.
And so across the set of papers that have addressed this, of the evolution of X versus
autosome in different organisms, it’s a wildly chaotic literature with very poor consistency.
And if you look at the neutral divergence, so this is at synonymous sites, the divergence
at nonsynonymous sites, now with this full genome data and all these species, again you
see this sort of pattern where some lineages, it looks like the X is faster, other lineages,
the X is slower. And you can see why the literature has been so confused because even with whole
genome data, the pattern is still rather on knife's edge. Depending on changes in the
demography and other things, the X or the autosome looks like it’s going to be evolving
more quickly.
The one thing that’s absolutely universal is codon bias, which is the degree of codon
bias is greater on the X chromosome than the autosomes all the time. Now, codon bias refers
to the differential use of the synonymous codons. The fact that the codon bias is greater
in the X means that the population's better able to discriminate between these very weak
differences in selection between different alternative forms. And those are almost certainly
recessive differences. The fact that the X chromosome is detecting them is saying that
again it's probably because of this increased efficacy of natural selection to see variance
on the X chromosome than on autosomes, and remarkably consistent on every branch of this
tree except for a couple where there's just insufficient power.
One final point has to do with the evolution of innate immunity. This is the idea of taking
a pathway for sort of any process you can imagine and having it descend down this 12
species phylogeny. It's kind of exciting to imagine, so, "How is the pathway tuned, how
is there pressure from pathogens exerting itself on this pathway, do we see accelerated
evolution, and recognition, or effector molecules, how does this go?" And this is work of Tim
Sackton in my lab and a number of collaborators of the paper that's still actually just in
review in Nature Genetics.
And one of the punch lines that came from this is centered around the gene, Relish,
which is one of the transcription factors that results in a launching of transcription
of a number of the antimicrobial peptides. There's a inhibitor domain on Relish that's
joined by a linker. That particular linker region -- this is showing again sort of signature
positive selection on this axis against position along the gene -- most of the signatures of
positive selection are right in that linker region. Other proteins that are actually involved
in cleaving that linker, Dredd has a caspase domain that actually does that cleavage. And
again the caspase domain shows us excess signature positive selection. So it's a intriguing case
where in just the melanogaster lineage we're seeing multiple signatures of positive selection
on just that part of the pathway. So this is a paper with over 244 authors. It's going
to be in the November 8th issue of Nature. It's been a tremendous sort of a thrill, and
opportunity, and privilege really to be working with this group. And at that I'll just close
for questions. Thanks.
[applause]
Dr. Eric Green: Questions from the floor. So, Andy, what are
the challenges to take these approaches and then apply them to studies of human populations?
Obviously flies are going to be much simpler, and clearly we want to be able to do the kinds
of comparisons you were showing with conservation and better annotation of proteins. But it
must not be an easy generalization.
Dr. Andy Clark: Well, I mean, there are a number of things
that are different in the human situation because we're sort of coming at it in a different
context. We have so much data from the million base pair SNP typing platforms and so forth
that I think doing something like short read sequencing on top of those million SNPs will
really, really sort of leverage each other in a very exciting way. Some of the primary
issues of estimation of these parameters are less important, perhaps for human than for
sort of model organism studies. We're much more interested in the medical sort of questions
where it's so important to get real accuracy of individual calls. But that's where we have
-- in the context of the HapMap project and understanding that haplotype background, these
methods that allow one to impute missing data, if you combine that sort of imputation with
this short read sequencing, I think they could really, really fuel each other.