Composition And Dynamics of The Human Virome - Frederic bushman

Frederic Bushman: Well, thank you very much for inviting me to speak at this exciting meeting. It's been really interesting to hear the outcome of many of the microbiome projects and the -- hear the summaries from the people who did the work. So my title is Composition and Dynamics of the Human Virome. So I'll briefly introduce the planetary virome, the human virome, and a little bit about approaches you can take to studying this problem. Then I'll focus much of the lecture on what is the composition of the human virome and how does it change over time, with an emphasis on gut. And then at the end I'll touch on some open questions and future directions, introduce a little more data, and talk about challenges going forward. So the global virome is remarkable. In seawater, there are judged to be something like 107 viral particles per mil. Viruses outnumber their hosts by a factor of something like 10 in seawater, and near the end I'll be arguing that that's probably true in gut also. Estimates, multiplying this out, there's something like 1031 viral particles on Earth, which make viruses numerically the most successful biological entities on the planet, and also some of the coolest. So this picture has been developed by our own Lita Proctor of HMP, Forrest Rore [spelled phonetically], Curtis Suttle, and many other workers and this picture shows the EM -- some of the viruses that Lita caught in one of her studies of marine phages. The human virome is similarly gigantic. Perhaps most familiar will be the persistent latent infections that are extremely common: *** virus, for example; papillomavirus; many of the *** viruses -- in fact, almost all of the people in the U.S. population -- papilloma also very common; *** and HCV, something like 1 percent of the population. We're also very familiar with the viruses that infect us transiently, like cold viruses or flu. Sometimes we deliberately infect ourselves with viruses like the vaccinia live smallpox virus vaccine strain. But still, this is really the tip of the iceberg. The human genome itself is composed of something like 8 percent fragments of DNA that are discoverably remnants of retroviruses that infected the primate lineage on its way leading to humans. And so here's a bit of one of the human chromosomes shown here. Here's some genes. This track that says LTR is the long-terminal repeats of a retrovirus, and you can see there are many, many of them in this bit of the human genome. So our genomes themselves are, in part, viral. And on top of that, our bacteria and archaea have enormous numbers of viral predators, bacteriophage predators. Counts suggest something in the range of 1010 to 1011 viruses per gram of human stool, and although these numbers, I think, are kind of loose, it's -- there's no question that the size of the communities is totally gigantic. So, obviously, one of the things that's really launched this area has been the development of the new deep sequencing methods, and so you can apply those to studies of viruses in a variety of ways. One is to hunt for new pathogens; another is to watch viral evolution, for example, *** quasi species evolving in response to a new -- introduction of a new anti-viral agent and developing resistance. You can characterize where integrating viruses are integrating into genomes. And you can characterize complex and cultured communities of viruses. So what I'll be emphasizing for much of the talk is going to be this last one, using deep sequencing, mostly Illumina Hi-Seq to characterize the human virome and see what's out there. And many strong labs have contributed these kinds of studies: Forrest Rore, Suttle, Lipkin, Beatrice Hahn, and many, many others. So what is the composition of the human virome, and how does it change over time? So we've been studying this in the context of our Human Microbiome Demonstration Project focused on diet, genetic factors, and the gut microbiome in Crohn's disease. It's a pleasure to acknowledge my co-PIs: Gary Wu and Jim Lewis. Gary is here. Our main project -- our main study of Crohn's disease longitudinally in pediatric cases is just finishing up, and we'll tell you about that soon, but in this talk I'm going to focus on a set of papers from a really strong graduate student, Sam Minot, characterizing the virome piece of the gut microbiome. And we've been fortunate to have funding from NIDDK, HMP, and others. So we've been purifying viruses from stool, and it's kind of a general principle in these sorts of microbiome studies that exactly what you do has a very strong effect on the interpretations you can make at the end, and, you know, your choices have a lot of effect on the outcome. So the way we've been doing this, we've been highly purifying viral fractions. We've almost certainly been losing some of the viruses along the way. We already -- I'll be telling you about DNA viruses. For the most part, we haven't put much effort into studying RNA viruses. So we're only looking at a slice of the population, but because it's highly purified, we can point to each read and say, "That probably came from a virus." Even if it doesn't look like anything we've seen before, even if it doesn't look like anything in the databases, that's probably still something that -- sequenced that was inside a viral particle. So what we've been doing is taking stool, several homogenized, several filtration steps; then either purified by banding in cesium chloride or centricon ultrafiltration; then chloroformed to rupture membranes, degrade unprotected DNA with DNase to get rid of contaminating DNase -- DNA; and then break open the capsids and get out the DNA from inside the viral capsids; and then either 454 or Hi-Seq metagenomic sequencing. So this shows an example of one of Sam's studies. This is a 12 healthy subjects cross-sectional, about 40 billion bases of Hi-Seq data assembled using one of the De Bruijn graph methods, and then comparing the contigs that we see. So here's the contig spectrum with the length here of the contigs and the sequencing depth on the y-axis. So you can see we got lots and lots of contigs. This is all 12 subjects. Some of -- out to, like, 100 kilobases. Some are circular, they close at circles, which suggests we probably got all of the genome for those guys, so you can see we've got a lot of ones that are probably complete. The rest are a mixture of complete and partial. When we try to align these to databases, we see very little resemblance to anything that's in the databases for the vast majority. So that's kind of illustrated down here at the bottom. We did find one animal cell virus, a vanilla papilloma virus type 6B; not unexpected, very high coverage. Everything else we saw looked either -- had either low resemblance to bacteriophages in the database or no resemblance to anything at all. And so this shows some of the phage data where we're lining our reads and you can see little bits of alignment. Gray means perfect, red means mismatched, and you can see we're seeing patchy, sketchy matches, and these were the best of them. So, really, most of what we're seeing in these kinds of samples are new, and between humans we see very little resemblance also. So really striking: either mostly phage that we can identify, one eukaryotic virus, and then a lot of stuff we don't know what it is for sure, though we're guessing most of its phage. Something like 500 to 1,000 types per individual. Okay, so we can do a little better assigning the genes if we align genes within this data set. We've got so many genes down from the Hi-Seq data open reading frames that we could ask who resembles whom, we could find something like 25 percent of phage ORFs have a match to a database or for the very permissive threshold, but then within the datasets, 58 percent have at least one match 30 percent identity or better. So that allowed Sam to then ask, well, how about arrangements, multigenic regions? Do we see conservation of gene order and gene type into cassettes, which is well known to be a structural feature of bacteriophage genomes? And indeed we do. On the right you can see several of these cassettes where you can see multiple gene types being similar, and these are derived from different subjects. The subject numbers are on the left here. And here's a cassette with several different gene types seen over lots of subjects. So the mean proportion of contigs covered by cassettes looks like something like 27 percent, so when we look within the deep sequencing data itself, we can start to see some forms of order. We've also got several cases where we've sequenced whole stool DNA, mostly bacteria, assigned the genes and looked at the kinds that are there, and then similarly done the same for viruses or virus-like particles, assembled, assigned ORFs to the ontology, and then compared the two, and that's shown here. So you can see that bacteria, in yellow, often devote a lot of their coding capacity to carbohydrate metabolism, amino acid metabolism, translation, ribosomes, phage. Very little, if any, viruses; very little, if any. Viruses, on the other hand, in red, show a lot of genes -- a high proportion of their gene content devoted to replication or recombination and repair. So viruses are parasites, and you can really see it coming through in the metagenomic data in a comparison like this. Okay, so one thing we wanted to investigate was how does the human gut virome change over time? And in part this is investigating at the same time the question of why are humans so different from each other? You know, what is going on with the gut virome that might explain this? So we studied one human individual for two and a half years by a dense time series analysis of stool samples, and so that's sort of diagrammed here. We have a whole bunch of time points we studied, and importantly, Sam studied a number of time points twice. He took the same stool sample and did two separate purifications of viruses from those samples, sequenced and analyzed. So we have an internal measure of -- within sample variation or within time-point variation, and then we can compare that to between time-point variation, and, you know, ask if that's larger. So we purified viruses. Fifty-seven billion bases of sequence from Hi-Seq assembled with the De Bruijn graph method. We also did Hi-Seq analysis of stool DNA for three widely-spaced time points. So we have a look at the bacterial communities at least at three time points. Assemble -- we get something like in the range of 500 contigs, average of 82-fold sequence coverage. So the contig spectrum is shown here. Again, it's on the left. It's contig length by fold coverage. Some of these we now get up to million-fold coverage, these -- for these small circular guys. We're now well over 100 kilobases for some of these viruses, and now more of them -- a larger proportion seem to be circular, so those we're getting complete sequences for at least some of them. The middle panel shows Jaccard index on the y-axis; that's asking for resemblance in community membership versus time interval with long-time intervals, but between time points on the right and short ones on the left. And you can see that even after our longest time points, in the range of two and a half years, we're still seeing something like 80 percent of the community membership is still the same, as with the earlier time points. So, for the most part, the virome is hanging around. We're seeing the same forms all through the time series studied. We did a form of rare faction analysis here. We took our contigs, and then asked how many reads or how many samples did it take to get those contigs. So at least for the major contigs we were picking up, we seemed to be pretty saturated. But I now think this is a little misleading. I don't think we did that well as this might imply, and the reason is that we've recently done sequencing in the same communities with another method, using the wonderful Pacific Biosciences' single molecule sequencing approach, which brings in its own set of issues but at least the biases are different compared to Illumina. And we make -- we acquired 138 megabases of single molecule sequencing data, and when we compared to the Illumina contigs, we found only 30 percent overlap. So this is just in the last few weeks. We're still trying to put this all together into a single picture, but it seems quite clear that our numbers are headed upward for the different kinds as we layer in more sequencing methods. And this just shows that some PacBio contigs can link to Illumina contigs in this sort of dot matrix representation, and we have examples where Illumina contigs have linked several PacBio contigs. So the picture on this viral community I think is getting better, but it does really illustrate that how you measure has a pretty strong effect on what you find. And so we're trying to put together a hybrid assembly with this, and, who knows, maybe further methods to try to get a more complete picture. So with these contigs in hand and this longitudinal data, we're in a position to be able to ask, well, how did these communities change as -- over the two and a half year period studied? So one is simple accumulation of base substitutions. By different -- we broke it out into different viral groups. Some changed very little. The temperate phages didn't seem to evolve very fast. But this group, the microviridae, which includes phiX174, evolved very fast. These are single-stranded small genomes in the five kilobase range, single-stranded circular DNAs. It turns out it's known that single-stranded DNA viruses evolve more like RNA viruses, really fast, compared to the double-strand DNA viruses. And this shows a phylogenetic tree of some of the microviridae we caught, and these show a time series for several of these where the first sample's at the top, and time is preceding going downward, and you can see accumulation of base substitutions in the genome. Now, the champions here were over 4 percent the substitution over the time series studied, and that's taking account -- that's subtracting out that within time-point variation, this is an increase over time. And so this is sort of cool. We -- in the microviridae taxonomy, we see some species separated by smaller values, 3.5 percent base substitution differences, distinguished species for some members of this group, and we had a couple of viruses that changed more than that. So you could say we're coming into the range where we were watching speciation events in the gut virome over the two and a half years that we studied this. Oops. So we could also see longitudinal changes associated with the CRISPR systems. So, remember, we have bacterial DNA sequences, we have phage, so we could see that six viral genomes were targeted by bacterial CRISPRs. So this is just an example of one contig, and CRISPR spacers from bacteria targeted the virus in several positions. Probably most of you are familiar with the CRISPR system. It's kin to RNAI. The bacteria have spacer sequences that are transcribed and then used as recognition elements to destroy incoming genetic parasites. We had one example of a possible viral escape mutant, where we have a CRISPR targeting a sequence in blue. That sequence goes away in the population and an orange sequence takes over that has a point mutant in that CRISPR recognition site. So that might have been a bit of an escape event. So that's one form of change associate with CRISPR systems that we're seeing. Another, the viral -- the bacteriophages, the viral contigs themselves have CRISPR systems. So some of the phage are encoding CRISPRs themselves, and in one case, we had a CRISPR contig in one phage -- a CRISPR spacer in one phage that targeted another phage in the same person. So it was as though phages were fighting it out with each other using the CRISPR system, which was pretty cool to see. And so we could see longitudinal change associated with the CRISPR arrays also. We could watch them change over time for a couple of the phages. So a third form of variation is associated with the diversity-generating retroelements. These are these amazing reverse transcriptase base targeted hypermutagenesis systems first discovered by Jeffrey Miller and coworkers. So in bordetella, phage BP1 has a problem: Bordetella changes its surface code periodically. But the phage hyper-mutagenizes the gene for its tail fiber recognition moiety, so that once in a while, when the bordetella phase varies, there are phages in the population that actually can recognize the new protein and then proliferate. And so we saw systems that looked like Miller's where there's the major tropism determinate gene with this hypervariable region, the tail fiber protein. Nearby is a template region, identical region, and nearby is a reverse transcriptase. And it turns out the mechanism involves transcription of the template region, aeroprone [spelled phonetically] reverse transcription of that copy to make a DNA copy -- an absorption of that DNA copy into the MTD locus. So in just model independent studies, where we tracked along these phage genomes looking for regions of high variation, these things really fell into our laps. And so this shows the colors or reads. If there was no base changes, there'd be -- the columns would be the same color as in these template regions, but in these variable regions you can see extreme levels of variation, so extreme that in those short regions, every read's different from every other read in some cases. And these were associated with this kind of target template structure and a nearby reverse transcriptase gene. So we saw systems that looked like Miller's MTD. We saw others that were hyper-variablizing other types of coding regions, including IG family proteins, which is pretty cool. The vertebrate immune system is hypermutating such genes to make the T-cell receptor and immunoglobulins. Phage are hyper-mutagenizing these using a completely different method based on reverse transcriptase. And so there -- we could see these reverse transcriptases. They're a specific subset of the reverse transcriptase family. We could see them quite abundantly in these phage and associate them with these hypermutagenesis mechanisms. One of these seemed to be active in the longitudinal data that we got. So we could look at these kinds of DGR, diversity-generating retroelement systems, identify them, and for one of them we could say that we were seeing it be active over the two and a half year time period that we studied. Others we -- it's not clear if we didn't have enough power to be sure they were active, or whether they, in fact, were inactive, which raises an interesting question of whether the mutagenesis mechanism may be biologically regulated, and maybe there's something to figure out there. So circling back, why do humans harbor such huge viral populations, and why are humans so different from each other? Well, it's -- one of the most central messages for many of these microbiome studies has been that humans have different microbes that are colonizing them in a great deal of individual variation. And so naturally, the predators on those microbes, the phages, are likely to be different also for that reason, at least in part. But something we can add from this study is that at least some of the phage seem to be changing really, really fast, so that when a virus colonizes a human, it diversifies pretty quickly, and, you know, over the lifetime of an individual, there will be a lot of diversification. The phage populations do seem to be pretty stable, so we think that this rapid change may be another piece helping to understanding why human viral populations look so different from each other. So let me now go on to the open questions and future directions. One key thing in these sort of virome studies over many labs and sort of many related issues is the question of new -- finding new viruses and associating them with diseases. And so a number of really strong labs have carried out a lot of these sorts of studies, Lipkin, Craig, ***, Darise [spelled phonetically], ***, Realm [spelled phonetically], and many others. And so this is a big challenge because -- just because some virus is there, you don't know about cause and effect, as several speakers have mentioned. Did it cause a disease? Did the disease state make the individual susceptible to the virus? Did some third thing cause both of them? You really don't know, but, obviously, there's a lot that could be done to streamline and develop this process. So this just shows an example from our work, where we have fecal shotgun data on two severe combined immune deficiency kids, and one healthy kid. One of these kids was having GI problems, and when we sequence whole stool DNA, we find 20 percent of the reads are this little-studied bocavirus, a parvovirus. It was only discovered in 1995, benign as far as anyone knows, but boy, this is one heck of a lot of virus in this kid's gut, and the kid was having GI problems at the time. So this is just one of many possible illustrations of the general problem of associating a virus that you see by molecular methods with a disease state. And then other questions. What are the relative abundance of phages and their hosts in human feces, and then dynamics of predation following from that. So what I'm going to tell you about now is sort of a sketch for a calculation. We're trying to work our way through this. A provisional picture would be as follows. So, from purified phage DNA in one human, the deeply-studied one I described, we can see most of the viruses, or at least a lot of them, so we can recognize them, even though they mostly don't look like anything you've ever seen before. We also have whole stool DNA from this individual, so we can ask what proportion of the whole stool DNA is comprised by the viral DNA in the purified sample. So we find something like 5 percent of the DNA total looks like phage DNA, though this is headed upward. We haven't added in the PacBio data yet. So, however, the genomes of the bacteria and phage are much different in size, maybe differing by a factor of 100. So multiplying by that, we infer that the phage outnumber their host in gut by about fivefold, at least as a provisional first look, and, again, we have more work to do to make this all real. So if that's true, we can start to think about predation rates. So material moves through the gut continuously, so if phage are at a constant abundance, then they must be created at a rate to replace the ones that are getting washed out as material moves through the gut. So let's say the transit time for this individual was one day -- I'm making this up -- and estimate that there are 1010 bacteria per gram of stool, then there must be something like 5 x 1010 phage per gram. So if a phage burst contains 100 phage, then you have to kill off 5 x 108 bacteria to supply that -- the number of phages that you're measuring that do seem to be at a steady state. So that would say that something like 5 percent of all bacteria are killed per day by phage predation. So, again, all these numbers are very soft, and on ongoing project in the lab is to try to make these numbers more real, but it gives you a sense that maybe a substantial fraction of all the bacteria in the gut are getting killed off by phage daily. And then the last open question I want to sort of -- or new direction I want to kind of introduce we could call virome epigenetics. As you all very well know, the human genome is subject to CpG methylation. Recently, hydroxymethylation was discovered and generated a great deal of interest. Phage totally dwarfs that. There are dozens of kinds of covalent DNA modification that have been reported in prokaryotic viruses. This lists a few of them. They get very exotic, including alphaputracinalthiamine [spelled phonetically], dihydroxypentaluridine [spelled phonetically], here's glucosylated DNA, which is a characteristic of phage T4. There's a gigantic zoo of DNA modifications that are present in bacterial viruses. They are known to block attack by nucleases, contribute to gene control. We're guessing they have additional functions also. So what's cool with the PacBio method is that you can read out some of these modifications by their sequencing method. So this just illustrates their technology. It's single-molecule, immobilized polymerase, the template tracks through. What they noticed is if you have a modification, this red base here, you can get a characteristic change in the kinetics of incorporation, and different chemistries seem to have different effects on the interpulse interval and P-chitin [spelled phonetically]. So they have a way of reading out at least some of the modifications that are present in these viral genomes. So we've looked -- started to look through this for the deeply-sampled subject I described. We're seeing various kinds of modification shown by the bars on the sequences. For some of them we can assign recognition sites. It looks like we've got one form at least that's new because it's on G, and none of the other ones I showed you there were G. And as a general summary, looking over all the contigs, it seems like 80 percent of all the viral contigs are showing signs of covalent DNA modification. And so this looks like a really, really cool area to begin to explore and try to understand the functional significance. Okay, so that concludes what I wanted to tell you. I introduced the global virome and the human virome. There is one heck of a lot of viruses associated with our bodies. The human virome -- most viruses hang around longitudinally in the carefully-studied individual we looked at, but some specific viruses changed a lot over time, and that may help understand why humans are so different from each other. And open questions in future directions: assigning viruses to disease efficiently using metagenomic data, dynamics of predation, and then, lastly, viral epigenetics. And so it's a great pleasure to again acknowledge my colleagues Gary Wu and Jim Lewis, and Sam Minot lead a lot of the viral work that I described, and Tyson Clark at Pac Bio has helped us a lot also. So thank you very much for your attention. [applause] Male Speaker: Okay, so, quickly, I'd like to ask you your current thoughts on the role viruses play in horizontal gene transfer of microbiota. Frederic Bushman: Oh. Viruses are big players in horizontal gene transfer. There are typically three main mechanisms in prokaryotes: transduction, transformation, and mating. So temperate phages move genes around. They're known to be medically important because integrating phages can carry toxins, adhesions that modify the phenotype of their bacterial hosts. So they're among -- they're one of the major agents, but not the only major agent. Owen White: Okay, let's thank Ric and all of the speakers from this morning. [applause] We just have a few quick announcements that are important. The first is that at 1:30 sharp, the afternoon session will begin, so we're now into the lunch and poster. Speaking of the posters, today are the even number presenters and tomorrow are the odd numbers, and I think Friday are prime numbers or something. So one of the benefits of being here physically in person is that you get to vote on what you consider the best poster, after you've seen all of the posters. And there's an NSA-proof secret compartment on the back of your badge with a green piece of paper that is to be used for the voting. So, again, I want to personally thank this -- the speakers. I had a really good time hosting this morning's session, and we'll see you at 1:30. [applause]