Tip:
Highlight text to annotate it
X
Frederic Bushman: Well, thank you very much for inviting me
to speak at this exciting meeting. It's been really interesting to hear the outcome of
many of the microbiome projects and the -- hear the summaries from the people who did the
work.
So my title is Composition and Dynamics of the Human Virome. So I'll briefly introduce
the planetary virome, the human virome, and a little bit about approaches you can take
to studying this problem. Then I'll focus much of the lecture on what is the composition
of the human virome and how does it change over time, with an emphasis on gut. And then
at the end I'll touch on some open questions and future directions, introduce a little
more data, and talk about challenges going forward.
So the global virome is remarkable. In seawater, there are judged to be something like 107
viral particles per mil. Viruses outnumber their hosts by a factor of something like
10 in seawater, and near the end I'll be arguing that that's probably true in gut also. Estimates,
multiplying this out, there's something like 1031 viral particles on Earth, which make
viruses numerically the most successful biological entities on the planet, and also some of the
coolest. So this picture has been developed by our own Lita Proctor of HMP, Forrest Rore
[spelled phonetically], Curtis Suttle, and many other workers and this picture shows
the EM -- some of the viruses that Lita caught in one of her studies of marine phages.
The human virome is similarly gigantic. Perhaps most familiar will be the persistent latent
infections that are extremely common: *** virus, for example; papillomavirus; many of
the *** viruses -- in fact, almost all of the people in the U.S. population -- papilloma
also very common; *** and HCV, something like 1 percent of the population. We're also very
familiar with the viruses that infect us transiently, like cold viruses or flu. Sometimes we deliberately
infect ourselves with viruses like the vaccinia live smallpox virus vaccine strain. But still,
this is really the tip of the iceberg. The human genome itself is composed of something
like 8 percent fragments of DNA that are discoverably remnants of retroviruses that infected the
primate lineage on its way leading to humans.
And so here's a bit of one of the human chromosomes shown here. Here's some genes. This track
that says LTR is the long-terminal repeats of a retrovirus, and you can see there are
many, many of them in this bit of the human genome. So our genomes themselves are, in
part, viral. And on top of that, our bacteria and archaea have enormous numbers of viral
predators, bacteriophage predators. Counts suggest something in the range of 1010 to
1011 viruses per gram of human stool, and although these numbers, I think, are kind
of loose, it's -- there's no question that the size of the communities is totally gigantic.
So, obviously, one of the things that's really launched this area has been the development
of the new deep sequencing methods, and so you can apply those to studies of viruses
in a variety of ways. One is to hunt for new pathogens; another is to watch viral evolution,
for example, *** quasi species evolving in response to a new -- introduction of a new
anti-viral agent and developing resistance. You can characterize where integrating viruses
are integrating into genomes. And you can characterize complex and cultured communities
of viruses. So what I'll be emphasizing for much of the talk is going to be this last
one, using deep sequencing, mostly Illumina Hi-Seq to characterize the human virome and
see what's out there. And many strong labs have contributed these kinds of studies: Forrest
Rore, Suttle, Lipkin, Beatrice Hahn, and many, many others.
So what is the composition of the human virome, and how does it change over time? So we've
been studying this in the context of our Human Microbiome Demonstration Project focused on
diet, genetic factors, and the gut microbiome in Crohn's disease. It's a pleasure to acknowledge
my co-PIs: Gary Wu and Jim Lewis. Gary is here. Our main project -- our main study of
Crohn's disease longitudinally in pediatric cases is just finishing up, and we'll tell
you about that soon, but in this talk I'm going to focus on a set of papers from a really
strong graduate student, Sam Minot, characterizing the virome piece of the gut microbiome. And
we've been fortunate to have funding from NIDDK, HMP, and others.
So we've been purifying viruses from stool, and it's kind of a general principle in these
sorts of microbiome studies that exactly what you do has a very strong effect on the interpretations
you can make at the end, and, you know, your choices have a lot of effect on the outcome.
So the way we've been doing this, we've been highly purifying viral fractions. We've almost
certainly been losing some of the viruses along the way. We already -- I'll be telling
you about DNA viruses. For the most part, we haven't put much effort into studying RNA
viruses. So we're only looking at a slice of the population, but because it's highly
purified, we can point to each read and say, "That probably came from a virus." Even if
it doesn't look like anything we've seen before, even if it doesn't look like anything in the
databases, that's probably still something that -- sequenced that was inside a viral
particle.
So what we've been doing is taking stool, several homogenized, several filtration steps;
then either purified by banding in cesium chloride or centricon ultrafiltration; then
chloroformed to rupture membranes, degrade unprotected DNA with DNase to get rid of contaminating
DNase -- DNA; and then break open the capsids and get out the DNA from inside the viral
capsids; and then either 454 or Hi-Seq metagenomic sequencing.
So this shows an example of one of Sam's studies. This is a 12 healthy subjects cross-sectional,
about 40 billion bases of Hi-Seq data assembled using one of the De Bruijn graph methods,
and then comparing the contigs that we see. So here's the contig spectrum with the length
here of the contigs and the sequencing depth on the y-axis. So you can see we got lots
and lots of contigs. This is all 12 subjects. Some of -- out to, like, 100 kilobases. Some
are circular, they close at circles, which suggests we probably got all of the genome
for those guys, so you can see we've got a lot of ones that are probably complete. The
rest are a mixture of complete and partial.
When we try to align these to databases, we see very little resemblance to anything that's
in the databases for the vast majority. So that's kind of illustrated down here at the
bottom. We did find one animal cell virus, a vanilla papilloma virus type 6B; not unexpected,
very high coverage. Everything else we saw looked either -- had either low resemblance
to bacteriophages in the database or no resemblance to anything at all. And so this shows some
of the phage data where we're lining our reads and you can see little bits of alignment.
Gray means perfect, red means mismatched, and you can see we're seeing patchy, sketchy
matches, and these were the best of them. So, really, most of what we're seeing in these
kinds of samples are new, and between humans we see very little resemblance also. So really
striking: either mostly phage that we can identify, one eukaryotic virus, and then a
lot of stuff we don't know what it is for sure, though we're guessing most of its phage.
Something like 500 to 1,000 types per individual.
Okay, so we can do a little better assigning the genes if we align genes within this data
set. We've got so many genes down from the Hi-Seq data open reading frames that we could
ask who resembles whom, we could find something like 25 percent of phage ORFs have a match
to a database or for the very permissive threshold, but then within the datasets, 58 percent have
at least one match 30 percent identity or better. So that allowed Sam to then ask, well,
how about arrangements, multigenic regions? Do we see conservation of gene order and gene
type into cassettes, which is well known to be a structural feature of bacteriophage genomes?
And indeed we do. On the right you can see several of these cassettes where you can see
multiple gene types being similar, and these are derived from different subjects. The subject
numbers are on the left here. And here's a cassette with several different gene types
seen over lots of subjects. So the mean proportion of contigs covered by cassettes looks like
something like 27 percent, so when we look within the deep sequencing data itself, we
can start to see some forms of order.
We've also got several cases where we've sequenced whole stool DNA, mostly bacteria, assigned
the genes and looked at the kinds that are there, and then similarly done the same for
viruses or virus-like particles, assembled, assigned ORFs to the ontology, and then compared
the two, and that's shown here. So you can see that bacteria, in yellow, often devote
a lot of their coding capacity to carbohydrate metabolism, amino acid metabolism, translation,
ribosomes, phage. Very little, if any, viruses; very little, if any. Viruses, on the other
hand, in red, show a lot of genes -- a high proportion of their gene content devoted to
replication or recombination and repair. So viruses are parasites, and you can really
see it coming through in the metagenomic data in a comparison like this.
Okay, so one thing we wanted to investigate was how does the human gut virome change over
time? And in part this is investigating at the same time the question of why are humans
so different from each other? You know, what is going on with the gut virome that might
explain this? So we studied one human individual for two and a half years by a dense time series
analysis of stool samples, and so that's sort of diagrammed here. We have a whole bunch
of time points we studied, and importantly, Sam studied a number of time points twice.
He took the same stool sample and did two separate purifications of viruses from those
samples, sequenced and analyzed. So we have an internal measure of -- within sample variation
or within time-point variation, and then we can compare that to between time-point variation,
and, you know, ask if that's larger.
So we purified viruses. Fifty-seven billion bases of sequence from Hi-Seq assembled with
the De Bruijn graph method. We also did Hi-Seq analysis of stool DNA for three widely-spaced
time points. So we have a look at the bacterial communities at least at three time points.
Assemble -- we get something like in the range of 500 contigs, average of 82-fold sequence
coverage. So the contig spectrum is shown here. Again, it's on the left. It's contig
length by fold coverage. Some of these we now get up to million-fold coverage, these
-- for these small circular guys. We're now well over 100 kilobases for some of these
viruses, and now more of them -- a larger proportion seem to be circular, so those we're
getting complete sequences for at least some of them.
The middle panel shows Jaccard index on the y-axis; that's asking for resemblance in community
membership versus time interval with long-time intervals, but between time points on the
right and short ones on the left. And you can see that even after our longest time points,
in the range of two and a half years, we're still seeing something like 80 percent of
the community membership is still the same, as with the earlier time points. So, for the
most part, the virome is hanging around. We're seeing the same forms all through the time
series studied.
We did a form of rare faction analysis here. We took our contigs, and then asked how many
reads or how many samples did it take to get those contigs. So at least for the major contigs
we were picking up, we seemed to be pretty saturated. But I now think this is a little
misleading. I don't think we did that well as this might imply, and the reason is that
we've recently done sequencing in the same communities with another method, using the
wonderful Pacific Biosciences' single molecule sequencing approach, which brings in its own
set of issues but at least the biases are different compared to Illumina.
And we make -- we acquired 138 megabases of single molecule sequencing data, and when
we compared to the Illumina contigs, we found only 30 percent overlap. So this is just in
the last few weeks. We're still trying to put this all together into a single picture,
but it seems quite clear that our numbers are headed upward for the different kinds
as we layer in more sequencing methods. And this just shows that some PacBio contigs can
link to Illumina contigs in this sort of dot matrix representation, and we have examples
where Illumina contigs have linked several PacBio contigs. So the picture on this viral
community I think is getting better, but it does really illustrate that how you measure
has a pretty strong effect on what you find. And so we're trying to put together a hybrid
assembly with this, and, who knows, maybe further methods to try to get a more complete
picture.
So with these contigs in hand and this longitudinal data, we're in a position to be able to ask,
well, how did these communities change as -- over the two and a half year period studied?
So one is simple accumulation of base substitutions. By different -- we broke it out into different
viral groups. Some changed very little. The temperate phages didn't seem to evolve very
fast. But this group, the microviridae, which includes phiX174, evolved very fast. These
are single-stranded small genomes in the five kilobase range, single-stranded circular DNAs.
It turns out it's known that single-stranded DNA viruses evolve more like RNA viruses,
really fast, compared to the double-strand DNA viruses. And this shows a phylogenetic
tree of some of the microviridae we caught, and these show a time series for several of
these where the first sample's at the top, and time is preceding going downward, and
you can see accumulation of base substitutions in the genome.
Now, the champions here were over 4 percent the substitution over the time series studied,
and that's taking account -- that's subtracting out that within time-point variation, this
is an increase over time. And so this is sort of cool. We -- in the microviridae taxonomy,
we see some species separated by smaller values, 3.5 percent base substitution differences,
distinguished species for some members of this group, and we had a couple of viruses
that changed more than that. So you could say we're coming into the range where we were
watching speciation events in the gut virome over the two and a half years that we studied
this.
Oops.
So we could also see longitudinal changes associated with the CRISPR systems. So, remember,
we have bacterial DNA sequences, we have phage, so we could see that six viral genomes were
targeted by bacterial CRISPRs. So this is just an example of one contig, and CRISPR
spacers from bacteria targeted the virus in several positions. Probably most of you are
familiar with the CRISPR system. It's kin to RNAI. The bacteria have spacer sequences
that are transcribed and then used as recognition elements to destroy incoming genetic parasites.
We had one example of a possible viral escape mutant, where we have a CRISPR targeting a
sequence in blue. That sequence goes away in the population and an orange sequence takes
over that has a point mutant in that CRISPR recognition site. So that might have been
a bit of an escape event. So that's one form of change associate with CRISPR systems that
we're seeing.
Another, the viral -- the bacteriophages, the viral contigs themselves have CRISPR systems.
So some of the phage are encoding CRISPRs themselves, and in one case, we had a CRISPR
contig in one phage -- a CRISPR spacer in one phage that targeted another phage in the
same person. So it was as though phages were fighting it out with each other using the
CRISPR system, which was pretty cool to see. And so we could see longitudinal change associated
with the CRISPR arrays also. We could watch them change over time for a couple of the
phages.
So a third form of variation is associated with the diversity-generating retroelements.
These are these amazing reverse transcriptase base targeted hypermutagenesis systems first
discovered by Jeffrey Miller and coworkers. So in bordetella, phage BP1 has a problem:
Bordetella changes its surface code periodically. But the phage hyper-mutagenizes the gene for
its tail fiber recognition moiety, so that once in a while, when the bordetella phase
varies, there are phages in the population that actually can recognize the new protein
and then proliferate. And so we saw systems that looked like Miller's where there's the
major tropism determinate gene with this hypervariable region, the tail fiber protein. Nearby is
a template region, identical region, and nearby is a reverse transcriptase. And it turns out
the mechanism involves transcription of the template region, aeroprone [spelled phonetically]
reverse transcription of that copy to make a DNA copy -- an absorption of that DNA copy
into the MTD locus.
So in just model independent studies, where we tracked along these phage genomes looking
for regions of high variation, these things really fell into our laps. And so this shows
the colors or reads. If there was no base changes, there'd be -- the columns would be
the same color as in these template regions, but in these variable regions you can see
extreme levels of variation, so extreme that in those short regions, every read's different
from every other read in some cases. And these were associated with this kind of target template
structure and a nearby reverse transcriptase gene.
So we saw systems that looked like Miller's MTD. We saw others that were hyper-variablizing
other types of coding regions, including IG family proteins, which is pretty cool. The
vertebrate immune system is hypermutating such genes to make the T-cell receptor and
immunoglobulins. Phage are hyper-mutagenizing these using a completely different method
based on reverse transcriptase. And so there -- we could see these reverse transcriptases.
They're a specific subset of the reverse transcriptase family. We could see them quite abundantly
in these phage and associate them with these hypermutagenesis mechanisms.
One of these seemed to be active in the longitudinal data that we got. So we could look at these
kinds of DGR, diversity-generating retroelement systems, identify them, and for one of them
we could say that we were seeing it be active over the two and a half year time period that
we studied. Others we -- it's not clear if we didn't have enough power to be sure they
were active, or whether they, in fact, were inactive, which raises an interesting question
of whether the mutagenesis mechanism may be biologically regulated, and maybe there's
something to figure out there.
So circling back, why do humans harbor such huge viral populations, and why are humans
so different from each other? Well, it's -- one of the most central messages for many of these
microbiome studies has been that humans have different microbes that are colonizing them
in a great deal of individual variation. And so naturally, the predators on those microbes,
the phages, are likely to be different also for that reason, at least in part. But something
we can add from this study is that at least some of the phage seem to be changing really,
really fast, so that when a virus colonizes a human, it diversifies pretty quickly, and,
you know, over the lifetime of an individual, there will be a lot of diversification. The
phage populations do seem to be pretty stable, so we think that this rapid change may be
another piece helping to understanding why human viral populations look so different
from each other.
So let me now go on to the open questions and future directions. One key thing in these
sort of virome studies over many labs and sort of many related issues is the question
of new -- finding new viruses and associating them with diseases. And so a number of really
strong labs have carried out a lot of these sorts of studies, Lipkin, Craig, ***, Darise
[spelled phonetically], ***, Realm [spelled phonetically], and many others. And so this
is a big challenge because -- just because some virus is there, you don't know about
cause and effect, as several speakers have mentioned. Did it cause a disease? Did the
disease state make the individual susceptible to the virus? Did some third thing cause both
of them? You really don't know, but, obviously, there's a lot that could be done to streamline
and develop this process.
So this just shows an example from our work, where we have fecal shotgun data on two severe
combined immune deficiency kids, and one healthy kid. One of these kids was having GI problems,
and when we sequence whole stool DNA, we find 20 percent of the reads are this little-studied
bocavirus, a parvovirus. It was only discovered in 1995, benign as far as anyone knows, but
boy, this is one heck of a lot of virus in this kid's gut, and the kid was having GI
problems at the time. So this is just one of many possible illustrations of the general
problem of associating a virus that you see by molecular methods with a disease state.
And then other questions. What are the relative abundance of phages and their hosts in human
feces, and then dynamics of predation following from that. So what I'm going to tell you about
now is sort of a sketch for a calculation. We're trying to work our way through this.
A provisional picture would be as follows. So, from purified phage DNA in one human,
the deeply-studied one I described, we can see most of the viruses, or at least a lot
of them, so we can recognize them, even though they mostly don't look like anything you've
ever seen before. We also have whole stool DNA from this individual, so we can ask what
proportion of the whole stool DNA is comprised by the viral DNA in the purified sample. So
we find something like 5 percent of the DNA total looks like phage DNA, though this is
headed upward. We haven't added in the PacBio data yet. So, however, the genomes of the
bacteria and phage are much different in size, maybe differing by a factor of 100. So multiplying
by that, we infer that the phage outnumber their host in gut by about fivefold, at least
as a provisional first look, and, again, we have more work to do to make this all real.
So if that's true, we can start to think about predation rates. So material moves through
the gut continuously, so if phage are at a constant abundance, then they must be created
at a rate to replace the ones that are getting washed out as material moves through the gut.
So let's say the transit time for this individual was one day -- I'm making this up -- and estimate
that there are 1010 bacteria per gram of stool, then there must be something like 5 x 1010
phage per gram. So if a phage burst contains 100 phage, then you have to kill off 5 x 108
bacteria to supply that -- the number of phages that you're measuring that do seem to be at
a steady state. So that would say that something like 5 percent of all bacteria are killed
per day by phage predation. So, again, all these numbers are very soft, and on ongoing
project in the lab is to try to make these numbers more real, but it gives you a sense
that maybe a substantial fraction of all the bacteria in the gut are getting killed off
by phage daily.
And then the last open question I want to sort of -- or new direction I want to kind
of introduce we could call virome epigenetics. As you all very well know, the human genome
is subject to CpG methylation. Recently, hydroxymethylation was discovered and generated a great deal
of interest. Phage totally dwarfs that. There are dozens of kinds of covalent DNA modification
that have been reported in prokaryotic viruses. This lists a few of them. They get very exotic,
including alphaputracinalthiamine [spelled phonetically], dihydroxypentaluridine [spelled
phonetically], here's glucosylated DNA, which is a characteristic of phage T4. There's a
gigantic zoo of DNA modifications that are present in bacterial viruses. They are known
to block attack by nucleases, contribute to gene control. We're guessing they have additional
functions also.
So what's cool with the PacBio method is that you can read out some of these modifications
by their sequencing method. So this just illustrates their technology. It's single-molecule, immobilized
polymerase, the template tracks through. What they noticed is if you have a modification,
this red base here, you can get a characteristic change in the kinetics of incorporation, and
different chemistries seem to have different effects on the interpulse interval and P-chitin
[spelled phonetically]. So they have a way of reading out at least some of the modifications
that are present in these viral genomes.
So we've looked -- started to look through this for the deeply-sampled subject I described.
We're seeing various kinds of modification shown by the bars on the sequences. For some
of them we can assign recognition sites. It looks like we've got one form at least that's
new because it's on G, and none of the other ones I showed you there were G. And as a general
summary, looking over all the contigs, it seems like 80 percent of all the viral contigs
are showing signs of covalent DNA modification. And so this looks like a really, really cool
area to begin to explore and try to understand the functional significance.
Okay, so that concludes what I wanted to tell you. I introduced the global virome and the
human virome. There is one heck of a lot of viruses associated with our bodies. The human
virome -- most viruses hang around longitudinally in the carefully-studied individual we looked
at, but some specific viruses changed a lot over time, and that may help understand why
humans are so different from each other. And open questions in future directions: assigning
viruses to disease efficiently using metagenomic data, dynamics of predation, and then, lastly,
viral epigenetics.
And so it's a great pleasure to again acknowledge my colleagues Gary Wu and Jim Lewis, and Sam
Minot lead a lot of the viral work that I described, and Tyson Clark at Pac Bio has
helped us a lot also. So thank you very much for your attention.
[applause]
Male Speaker:
Okay, so, quickly, I'd like to ask you your current thoughts on the role
viruses play in horizontal gene transfer of microbiota.
Frederic Bushman:
Oh. Viruses are big players in horizontal gene transfer. There are typically
three main mechanisms in prokaryotes: transduction, transformation, and mating. So temperate phages
move genes around. They're known to be medically important because integrating phages can carry
toxins, adhesions that modify the phenotype of their bacterial hosts. So they're among
-- they're one of the major agents, but not the only major agent.
Owen White:
Okay, let's thank Ric and all of the speakers from this morning.
[applause]
We just have a few quick announcements that are important. The first is that at 1:30 sharp,
the afternoon session will begin, so we're now into the lunch and poster. Speaking of
the posters, today are the even number presenters and tomorrow are the odd numbers, and I think
Friday are prime numbers or something. So one of the benefits of being here physically
in person is that you get to vote on what you consider the best poster, after you've
seen all of the posters. And there's an NSA-proof secret compartment on the back of your badge
with a green piece of paper that is to be used for the voting.
So, again, I want to personally thank this -- the speakers. I had a really good time
hosting this morning's session, and we'll see you at 1:30.
[applause]