Population Genomics of Drosophila with Parallel Sequencing Methods - Andy clark

Dr. Andy Clark: Thanks, Adam, it’s really a delight to be here. I want to thank Eric for inviting me, and also wish you a happy birthday, and another many years in the future of great success with the NISC. So, why population genomics? What are we referring to here? This is simply taking the set of questions that population genetics addresses and taking it to the genomic level. So population genetics is interested in trying to infer the balance of the roles of the forces of mutation, drift, migration, and selection, to make a statement about the way evolution works. In other words, it’s not just a description of evolution, it’s sort of an attempt to really understand the mechanism behind evolutionary change. Unfortunately for population genetics, this means having to drag armies of students through very tedious lectures dealing with these nasty statistics like pi, and theta, and rho. I’m going to spare you much of this, but at least give you a little bit of the flavor of where we can do this at a genomic level. Another question that one can address with population genomics is having to do with past population demography. It turns out a lot of the issues with the HapMap project with inference about complex traits and identifying genes associated with complex traits are contaminated by confusion about the demographic history and population structure, so we need to understand that. In addition, having genome-wide data from multiple individuals allows us to do sort of scan statistics, looking across the genome for heterogeneities in these various statistics about levels of nucleotide diversity, recombination, and so forth. So the basic building block for modern population genetics is built on the idea of having complete data, so sequence data from multiple individuals ascertained in some uniform and random way, and so that ideal dataset would look like this, where you have complete sequence with no errors, of course, no missing data, and that’s the ideal. Of course, what we actually usually get is something far from that, and what we want to do from those data in any case is to estimate some of the primary parameters of how much variability is there, what’s the sort of cause of that level of variability. The primary parameter that arises from every which direction, if you look at theoretical population of genetics at all, it's astonishing the degree of convergence on this parameter, theta, which is four times the population size times the mutation rate. You can derive this as being a primary determinant of the level of variability in a population, forward in time with a Fisher-Wright model, backward in time with a coalescent from infinite sites, from infinite alleles model. It’s an amazingly convergent sort of statistic as being a primary parameter. So, one of the things we’d like to do is to estimate that from various sorts of genome data. The particular kind of genome data I’d like to talk about is the very exciting short read data that we’ve been hearing a little bit about already. We’ve heard several talks dealing with short read data and the interest in using these short read technologies to recover the sequence of the individual. Often this is for the purpose of identifying the individual mutation or the set of mutations in that individual getting high accuracy for an individual. And we’ve seen in those circumstances that it’s necessary to go to reasonably high read depth, 10, 20, 30, as a read depth for those individuals. I want to look at the opposite situation, where the data that are available are actually rather sparse, so this is 10 different individuals where you’re looking at a coverage of well under 1x in each individual, but you have many individuals. Can you do any kind of useful analysis with that? And it turns out, in the population genetics setting, the issue of estimating these sorts of genome-wide parameters, this sort of data works very well, and I’m going to show you some of the sort of directions to approach this. This was actually a problem that arose at first, and the first that I was aware of it, with Solera genomics data, when they had sequences from six individuals, and that sort of fell upon my lap to estimate some things like nucleotide diversity and so forth from those six individuals, and we did a rather crude ad-hoc method at the end. It wasn’t published because they weren’t releasing the SNPs at the time, but it was a problem that I kicked around with Rasmus Nielsen right at the beginning, and he was sort of enchanted by the problem. It’s actually quite interesting, not only is the data very gappy, but, of course, there was a relatively high error rate. Typical single sequence error rate for Solexa [spelled phonetically] reads might be on the order of a percent or two. There’s also site-specific errors, so particular regions of the genome might have different sequence error rates. And, in addition, if you’re sampling from diploid individuals, you’re actually sampling from a [unintelligible] alleles of that individual. So there’s a binomial sampling from each individual. It’s a very interesting sampling problem. And along with Rasmus Nielsen, Amos Helman [spelled phonetically], we have a paper submitted that actually derives a Waterson [spelled phonetically] estimator in the face of all those errors. And all that you need to do is you need to have a very good error model. You have to know what are the determinants of error, what’s the rate of error, what’s the sort of base neighborhood for those error rates. If I can specify an error rate model and have that kind of data, I can estimate theta, actually, very well. So the particular test dataset that we’re going to talk about was actually funded by Adam Felsenfeld as a pilot project for one of those interminable committees one sits on for NIH, and this was one that was dealing with selecting genome sequences, and, in particular, the idea of using short read technologies, and that context came up for SNP finding across a whole genome. How does one most effectively identify polymorphic sites across an entire genome without having to fund another HapMap project for every organism? And so the idea of just throwing them into 454 Solexa was very appealing, and we proposed doing this with just 10 lines of fly, six from North Carolina, four from Africa. 454 sequencing was done at the Wash U Genome Center. Elaine Marcus was our primary contact, and she was terrific. And this was back in the days of the GS 20, so one run was done for each of the 10 lines. It gave 3.4 million reads, about 351 megabase pairs total of sequence. That alignment did look like this. This is actually a region of chromosome 2 of the real data, and you can see that there are regions that actually are gaps in the data, other regions where there’s a depth of about two and a half average, across the whole project. About 74 percent of all of those reads had a unique fit. This mosaic is one of the assemblers I’ll talk about in just a second, but it’s a pretty interesting start to the data. The first thing to ask about is how homogeneous is the depth, are we sampling some regions of the genome better than others, are there gaps, and so forth. The North Carolina population, there were six lines, remember, so the depth for North Carolina’s always going to look greater than the depth of Africa where there were only four lines. And it was quite homogeneous across the -- this was the X chromosome -- except for occasional spikes one way or the other presumably for some kind of repetitive element. In some cases we could clearly see that’s what it was. For these data there was a reasonably good fit to the Lander-Waterman equation for coverage from a whole genome assembly, whole genome shotgun. I have to say, however, since doing this, the realization was that the power for discrimination of the goodness of fit, the Lander-Waterman, was quite poor for these data because the read depth was so low. You saw from Richard Gibbs’ slide quickly flashed by for the Jim Watson data, that there was actually a bimodal distribution. There’s also been something like 7x coverage of the C. elegans genome done by 454, and there’s a pronounced excess fatness of the tails of the coverage distribution. So there are too many regions of the genome with insufficient coverage, too many regions with excess coverage. And those departures from the Lander-Waterman equation are really crucial for use of these short read data for inferences of things like expression level by counting methods, so I think this is a really important problem we need to get a handle on with these methods, but what really determines coverage? The average coverage for the African lines is about 40 percent of the genome, the North Carolina lines, about 60 percent, so the [unintelligible] was about three quarters. Most of the regions that were covered by one were covered by the other, and so forth. So we could look at particular regions of the genome that had particularly poor coverage, and so this is 10 kb fragments that had less than half coverage. And there, again, it’s quite spiky. Particular regions of the genome are looking like they're falling in those sort of gappy regions. So the real problem, then -- the primary problem of this whole pilot was to infer polymorphic sites. Where are there SNPs in the genome? And can we devise a sort of an inferential method that would have a relatively low false positive rate and a reasonably good accuracy of determining those SNPs? So this first began -- so we actually turned to some folks who had actually already been thinking about this. Gabor Marth and Aaron Quinlan at Boston University have been sort of working on problems of this sort with Sanger sequencing, and recently turned their attention to short read technologies. And they realized right off the bat that the critical thing was to understand error, so one of the lines that we sequenced was actually the iso-1 stock, the stock that was done to something like 14x coverage by Sanger sequencing of Drosophila melanogaster. That gave lots and lots of reads that did have errors in them, and we could then build this error model. Pyro base [spelled phonetically] is then their base caller that not only called bases but also called the confidence on the bases that they devised. That’s actually currently -- online you can download this and start to play with it. They're going to town with it. So, anyway, they produced this multi-alignment of the 10 strains all across the reference genome, the Drosophila melanogaster, and called some 660,000 SNPs across the whole genome. About 1,200 of them were submitted for validation to the Washington Genome Center, and those 1,200 had a posterior Bayesian probability of being a SNP of about 90 percent, and 92 percent, in fact, validated. So it's actually looking pretty good. So this is not the 99 percent confidence of SNPs in each individual. It's, "Is this a polymorphism in that position of the genome?" And that accuracy's pretty good. So there's actually a quite strong correlation of the nucleotide diversity in particular regions. If it's very low diversity in Africa, it will be very low diversity in North Carolina, and so forth. That's sort of expected. These are populations that are derived one from the other. Basically, the fly population pretty much followed the human population in migrating out of Africa, so we expect them to show that kind of state. The divergence between species, so melanogaster versus simulans, is also correlated with this level of diversity within Africa. So what we're asking here then is, "Is there a simply heterogeneity in mutation rate? Are the regions where there's high diversity driving that high diversity due to high elevated mutation rate in that part of the genome?" If so, you would expect there to be an elevated divergence between species, and, in fact, you do see this, to some extent. However, this correlation is much stronger than this one, and so mutation doesn't drive all of that difference. Something else must be going on, and one of the things that you can do to get at what's going on is to actually look at the ratio of polymorphism to divergence. So here's that level of polymorphism, all to be predicted by this parameter, theta. The divergence is determined by twice the mutation rate times the time since the divergence between the two species, and the ratio of those has Mu divide out, as you can see. And so we ought to get something that is a distribution that depends only on those other factors, namely, effective population size and the time since divergence. And when you do this, you have to simulate then to get what's the expected level. And the expected level is shown here under a neutral simulation. This is, on this axis, the diversity to divergence ratio across the whole genome, and those 10 kb chunks across the whole genome, we get this distribution. What we actually observe has a much greater variance, in other words, some regions of the genome have much more diversity than expected under that neutral simulation, others have much less than expected. So that means the other parameters, effect of size and time since divergence must be what are differing between those different regions of the genome, and those are precisely parameters that are driven up and down by things like natural selection, by correlations with recombination rate, and so forth. So there's some interesting heterogeneity across the genome in these parameters of evolution. So one of the things that's often done with data that address polymorphism across the whole genome is to look for signatures of natural selection. You've seen this, certainly, many times over now with the human genome. It's still an area of interesting research going on. But we can detect selective sweeps by troughs in diversity. If there's a favorable mutation, it's going to drag that particular variant up in frequency, replacing all the others, and, hence, reducing the local diversity around that particular adaptive mutation. And we can look at this then across the whole genome and ask, "Are there particular regions where there are big dips in nucleotide diversity?" And it doesn't jump out at you very whoppingly [spelled phonetically] for this particular dataset, although there are regions where there's a curious dip for both African and the North American sample. And it is actually significant at the genome-wide scale. Some of them are actually regions of the genome that have otherwise shown signatures already that seemed bag-of-marbles on the X chromosome as one interesting candidate that's being pursued in Chip Aquadro's lab, for instance. We've also seen in the human data efforts to look at differences between populations as being a means of identifying particular regions that might have undergone region-specific natural selection. You can do the same with flies. This is the difference between African and North Carolinian diversity, so are the regions where there's a big spike up or down in the diversity and, in fact, we see them. Those are then again candidates that are nominated for potentially interesting region-specific selection. On the issue of demography, flies show the same sort of demography as humans, namely, there was a ancestral smaller population that grew rapidly at some point in the past, and in the African population there was not particularly much change since that time, but a very narrow bottleneck in expansion into Europe and the Americas. And there's a scene in the site frequency spectrum in African versus European flies. European flies have a big excess of rare variance because of the nature of the variation that would make it through that bottleneck would show the skew in the site frequency spectrum. So that was already published. Do we see this with these data? It's not nearly so clear because we only have a depth of two and a half, on average, but what you can ask is, "Are there differences between different parts of the genome with respect to the relative frequency of different polymorphic sites?" So one of the things you see is that there's a consistently reduced diversity in North Carolina compared to Africa. Now, remember, there were only four African lines, six North Carolina lines, and, nevertheless, there's more variability in the African lines again for this demographic region. Reason is well-known, African populations have a larger effective size, look like they're more diverse. If you look at the X versus the autosomes, the mean for the X chromosome is about .004, the mean for the autosomes is about .006 for the North Carolina populations, and you see that X is less diverse than the autosomes. This is seen in almost every organism. The reason for this, of course, is there are fewer X chromosomes than autosomes, so the X chromosome has a smaller effective size in that mutation selection balance, theta ought to be smaller, and you end up with less diversity. Even if it were strictly neutral, you'd end up with less diversity. But, in fact, the theoretical expectation is the X ought to have three quarters the diversity of the autosomes, and it's more like half, in this case. Well, it turns out there's good theory for this. In a population that's undergoing a bottleneck, you actually expect to see more severe reduction and diversity in the X compared to the autosomes. This is a paper by John Pool and Rasmus Nielsen. If you compare them now, in Africa, the X to autosome ratio is about 65 percent, which is again already lower than the 75 percent expectation if it were neutral. For North Carolina, it's about 50 percent. North Carolina's a derived population. And you can see, in fact, there was a greater reduction in diversity on the X compared to the autosomes, nicely consistent with that expectation. The final, sort of, point I wanted to illustrate from these data at the sort of genome-wide scale is that regions of low recombination, particularly around centromeres, show dramatically reduced heterozygosity. We see that both in Africa and North Carolina, if one actually looks then at the local recombination rate, intensity estimated as centimorgans per megabase pair, so this is estimated again from many of the mapping experiments done with flies over the years, so for the given local recombination rate, what's the diversity in Africa and in North Carolina? And you see a pronounced positive correlation. Now, this is widely described in the literature as being attributable, due to the fact that there's a thing called the Hill-Robertson effect. Regions of very low recombination are going to suffer from the fact that a positively favored mutation is going to drag down diversity, so lower recombination will make a larger region swept to fixation [spelled phonetically] and drop the diversity more than a region of low recombination. And also regions of low recombination, if there's negative selections, though deleterious mutations are occurring, it will also reduce the effective population size to a greater extent when there's lower recombination rate so-called background selection model. So some combination of those two is driving this positive correlation. You see it in these completely independent samples of four individuals here and six individuals here. We see this positive correlation between recombination rate and level of diversity. One thing that might drive that is if recombination itself were slightly mutagenic so that when recombinations occur you'll also get mutations. That would drive a positive correlation between divergence and diversity. And, in fact, in flies we do not see that. This is the recombination rate again on this axis against melanogaster simulans emergence, and there's no correlation. So it seems like recombination is not inducing this positive correlation. It really is a Hill-Robertson-like effect, the local sort of environment force, adaptive mutations is favoring a greater reduction in diversity in regions of low recombination. This made a paper back in 1999 with just 12 data points. Here are 30,000 data points. It seems to be true still, and so that was begun in Aquadro back in '99. I wanted to take the last couple of minutes to sort of shift gears, and this is another project that was -- this was funded by NHGRI to look at the sort of comparative genomic lessons that we learn from a dozen different Drosophila genome sequences. And this is a particularly sociologically interesting project. It featured data from all the genome centers except Stanford, I think, including Agencourt Biosciences. I guess NISC didn't contribute to it, but many, many different groups over many years contributed to these data. The choice of these 12 species was made on the basis of the fact that there's a huge diversity in the ecologies and sort of lifestyles of these different flies, over 400 million years of evolution spanned by that tree, so there's phenomenal saturation of mutation at many, many sites in the genome. Manolis Kellis, who was at MIT, was in charge of the analysis of the sort of annotation of the melanogaster genome and how we could improve the annotation of the melanogaster genome using these data. And this is just to illustrate something that I think you all know very well, which is that in protein coding regions, of course, you expect to see more substitutions between species at synonymous sites, because, after all, they would still retain the same amino acid sequence, you expect to see substitutions that preserve the reading frame, and you expect to see substitutions that replace one amino acid with another one that has very similar properties. So a number of different software tools, sort of Exonify and so forth. They’re very, very good at finding exons in the genome based on these sorts of signatures. So we can color code them based on their sort of attributes through these substitutions along this 12 species alignment. So this is the 12 fly species. Do those substitutions smell like they are protein coding substitutions? And color them green. If they're substitutions that are radical nonsynonymous changes or frameshifts, they’re less likely to be protein coding, we can color them in red. When we do this, we can identify regions of the genome that are more likely to be protein coding, and we can use it then to improve our annotation of any given genome. This is just to indicate the difference between that kind of strategy, and the strategy that’s based on just simple conservation. So this is the track that’s from the UCSC browser, the FastCons track, showing this sort of degree of conservation, it’s a very nice metric for overall conservation of the sequence. And it’s showing again, so here’s the region of this particular gene, CG9945, the boxes again being the exons, and you'll see high conservation for most of those exons. But if you look carefully, you see there are regions where there actually is high conservation, but we failed to annotate an exon. Subsequently we see that it also has a very high probability of being a codon based on this evolutionary model. And, in fact, we go back and see that, in fact, that was another transcript. There is, in fact, an exon in that region of the genome. So that’s a novel sort of annotation to the melanogaster genome that came about because of our comparison to these 12 species, giving us much more power to detect these things. We also see things like this where FlyBase was right, it says there’s not an exon there, but we see very high conservation when we look at the protein coding signal. In fact, the substitutions that do occur in that region look very un-exon-like [spelled phonetically], and, in fact, there’s not an exon there. So this is again where there’s a big difference between simple conservation in the protein coding region. So these sorts of methods resulted in many new annotations of the melanogaster genome, particularly in protein coding regions, some 413 cases of translation start changes, 912 cases of different splice signals, 240 cases of polycistronic genes, and so forth. So we gain considerable power in the annotation of a genome by looking at this sort of comparative approach. Just two other stories with this sort of comparative evolution analysis, one of them has to do again with this X versus autosome comparison. Now, I made it sound simple. Comparing X and autosome, the X has a smaller effective size, so it looks like it ought to have more drift than the autosomes, because it’s a smaller effective size. Things bounce around stochastically more in a smaller effective size. On the other hand, the X chromosome's hemizygous in males. So any mutation that occurs that’s expressed in males, even if it’s a recessive mutation, is immediately expressed. So there’s no masking in the sense of being hidden -- for rarely [spelled phonetically] or being hidden in males. There’s no such thing as recessivity, and so one expects then that deleterious mutations ought to be more effectively screened by natural selection. Recessive advantageous mutations ought to be more effectively, and more quickly identified by natural selection, and dragged up in frequency. So these things lead in opposite directions. The latter results in an expectation of the X chromosome ought to be evolving faster. And so across the set of papers that have addressed this, of the evolution of X versus autosome in different organisms, it’s a wildly chaotic literature with very poor consistency. And if you look at the neutral divergence, so this is at synonymous sites, the divergence at nonsynonymous sites, now with this full genome data and all these species, again you see this sort of pattern where some lineages, it looks like the X is faster, other lineages, the X is slower. And you can see why the literature has been so confused because even with whole genome data, the pattern is still rather on knife's edge. Depending on changes in the demography and other things, the X or the autosome looks like it’s going to be evolving more quickly. The one thing that’s absolutely universal is codon bias, which is the degree of codon bias is greater on the X chromosome than the autosomes all the time. Now, codon bias refers to the differential use of the synonymous codons. The fact that the codon bias is greater in the X means that the population's better able to discriminate between these very weak differences in selection between different alternative forms. And those are almost certainly recessive differences. The fact that the X chromosome is detecting them is saying that again it's probably because of this increased efficacy of natural selection to see variance on the X chromosome than on autosomes, and remarkably consistent on every branch of this tree except for a couple where there's just insufficient power. One final point has to do with the evolution of innate immunity. This is the idea of taking a pathway for sort of any process you can imagine and having it descend down this 12 species phylogeny. It's kind of exciting to imagine, so, "How is the pathway tuned, how is there pressure from pathogens exerting itself on this pathway, do we see accelerated evolution, and recognition, or effector molecules, how does this go?" And this is work of Tim Sackton in my lab and a number of collaborators of the paper that's still actually just in review in Nature Genetics. And one of the punch lines that came from this is centered around the gene, Relish, which is one of the transcription factors that results in a launching of transcription of a number of the antimicrobial peptides. There's a inhibitor domain on Relish that's joined by a linker. That particular linker region -- this is showing again sort of signature positive selection on this axis against position along the gene -- most of the signatures of positive selection are right in that linker region. Other proteins that are actually involved in cleaving that linker, Dredd has a caspase domain that actually does that cleavage. And again the caspase domain shows us excess signature positive selection. So it's a intriguing case where in just the melanogaster lineage we're seeing multiple signatures of positive selection on just that part of the pathway. So this is a paper with over 244 authors. It's going to be in the November 8th issue of Nature. It's been a tremendous sort of a thrill, and opportunity, and privilege really to be working with this group. And at that I'll just close for questions. Thanks. [applause] Dr. Eric Green: Questions from the floor. So, Andy, what are the challenges to take these approaches and then apply them to studies of human populations? Obviously flies are going to be much simpler, and clearly we want to be able to do the kinds of comparisons you were showing with conservation and better annotation of proteins. But it must not be an easy generalization. Dr. Andy Clark: Well, I mean, there are a number of things that are different in the human situation because we're sort of coming at it in a different context. We have so much data from the million base pair SNP typing platforms and so forth that I think doing something like short read sequencing on top of those million SNPs will really, really sort of leverage each other in a very exciting way. Some of the primary issues of estimation of these parameters are less important, perhaps for human than for sort of model organism studies. We're much more interested in the medical sort of questions where it's so important to get real accuracy of individual calls. But that's where we have -- in the context of the HapMap project and understanding that haplotype background, these methods that allow one to impute missing data, if you combine that sort of imputation with this short read sequencing, I think they could really, really fuel each other.