Multi - Omics of the human microbiome - Filling in the missing links - Janet jansson

Janet Jansson: First I'd like to thank Lita Proctor for the invitation, and also the organizing committee. This is great to have a chance to present our research, and also to have the opportunity to give my own opinion about some future directions in the area. So I'll be discussing multi-omics technologies, and giving a few examples of how we use that in our own research. But first I'd like to give a little bit of background. We already saw this slide earlier this morning. I think we've seen it a couple of times already. And I think it's important to think that the information we get from these DNA-based studies is only part of the way towards where we want to go. So if we look at these particular graphs where we can see the16S data that shows different body sites and the variation in the microbial community composition between different individuals; so, just a reminder that these are different persons, different body sites. You can see there's a considerable variability in the composition of the communities, whereas the functional genes are relatively constant at this very gross level of examination of function. So this is DNA-based cog level examination of function, and things look very similar. Sometimes, however, composition does matter. I did take out -- away a few slides where I was going to show this, but I know Alex Khoruts is going to give a talk about this later, so I won't talk about it. But we do have examples where we know, for example, through fecal transplantation that the community composition is very important for health, and so I'll let Alex talk about that. So this is actually one of my own photographs. I was just in Greenland a couple of weeks ago, and this is -- it's great to have your own iceberg to show. But what we know is just really the tip of the iceberg. We know information, and quite a bit of information, about the community composition, if we specifically focus on the gut environment. We know a lot about which microorganisms are there, which are the most dominant species, and how variable it can be across human populations. So this knowledge is starting to really consolidate. What we don't know is what lies underneath the water. We don't really know what these organisms are doing, just at a very basic level. Most of the microbes still, you know, they're not very well characterized. They're many that have not been cultivated. So in order to understand function, we do have access to omics tools nowadays. And for those of you that aren't familiar with these technologies, I kind of refer to this as the omics pipeline. So, depending on the information, you can get different types of information about expression. So, at the very beginning of the pipeline, this is where we really focused a lot of our work so far, and it's on the composition. So using 16S sequencing, for example, to understand the microbial community composition, so I would call that the microbiome. The next level is the metagenome, so sequencing not only the phylogenetic genes, but also the functional genes, so we know the gene composition. The next step would be to look at the RNA, so which of those genes are expressed. Then the proteins of those expressed genes, which ones are translated and form proteins. And then, finally, the metabolites or the metabolome. And the metabolites are important for carrying out a lot of the reactions in our bodies. So I'd like to think a bit about these different kinds of omics tools. So if we think genomics or metagenomics, this is information about gene content that has the potential for being expressed. So just because you see a functional gene, that doesn't mean it's being expressed. It has the potential to be expressed. And what is particularly problematic is that the DNA that you're looking at could be extracted from dead cells or from dormant cells. So the cells don't even need to be active or even alive in order to get DNA. I'm not saying that this information isn't important, but it's also important to keep in mind that this is a limitation. If we go to the next level and look at RNA, that provides you with a snapshot of activity at that moment in time when the RNA is extracted. It's also very important to understand that that's the expression profile at that -- with those given set of circumstances and conditions. And, as we do know, cells are experiencing a lot of different kinds of regulation, and so not all genes are going to be transcribed. But at least you get information about activity at that moment in time. If you do look at the proteins, or the metaproteome, which would be the community proteins or complementor proteins, that provides evidence, then, that this protein has passed all of the regulatory steps at the RNA level, and also has been translated in producing this protein. A caveat would be with that is you don't know if the protein is actually active in all cases if you do detect it. However, the genes must have been transcribed and translated to produce a protein. And I would say that that's better for assessment of microbial function, is to look at the proteins. This does require an annotated gene database, so then you do need the metagenome information anyhow because, otherwise, it's not possible to identify what the proteins are. So what do we know from model microorganisms? So I'm just going to kind of go back a little bit here. This is an E. coli cell. And a lot of work has been done on systems biology of single organisms. And so this is a reference from Corbin from PNAS in 2003, where they detected a positive relationship between protein abundance and transcript abundance during exponential growth of E. coli. So I think this is a very, kind of, it's a nice confirmation that the kind of information you get from RNA and protein is very consistent, and you can use both kinds of data sets, just exchange them. However, if you look at the single cell level, and this is a reference from 2010 from Taniguchi, they found no correlation between messenger RNA and protein levels in single E. coli cells at a particular point in time. So I think that even with a single organism, we're still really at the beginning of understanding, at a systems level, how we can use this different kinds of information to understand function. Now, there hasn't been that much done in the microbiome yet using these kinds of tools, but there has been quite a bit done in the marine environment. And what can we learn from other ecosystems? Well, this is a relatively -- or a very recent paper from Mary Ann Moran, and what she did was she looked at the amount of macromolecules in a single milliliter of seawater. And here you can see the amount of genes, transcripts, and proteins from the same milliliter of seawater. So you can see that the abundances vary dramatically, and this is a log scale. So this is a quote from Mary Ann Moran. She said that "the most important factor responsible for the poor messenger RNA yields compared to the protein correlations is the long half-life of proteins relative to messenger RNAs." And I like it that she actually did these calculations that a typical bacterial protein half-life is about 20 hours, which is about two orders of magnitude longer than a messenger RNA half-life. That means that most proteins persist in a bacterial cell long after the messenger RNA that encoded them have been degraded. So, this is important also to keep in mind if you're using a different omics method. And again, I think this is a reason that I like proteins because at least they're more -- they're going to give you more information about the history of activity of those cells. So this was the first publication using this relatively new technology to look at fecal sample, so to understand what the protein complement was in human fecal samples. And I have to say that this technology has really been a major revolution to the use of proteomics for microbiome studies, because, up until the point, everything was based on 2D gel separations and extraction of spots and sequencing nodes. However, this is a shotgun approach, so for the human, the sample is taken. In this case, we use differential centrifugation to extract the bacterial cells. The cells are lysed, the protein is extracted, and then directly digested with trypsin into fragments. Those fragments are separated by 2D-LC-MS/MS, so it's completely gel-free. And then collected on this colon, and using electrospray into very high mass accuracy mass spec. Now there are even better mass specs for this purpose now, but this was in 2009. So then you get these spectra, and those need to searched against your databases. And this is where the metagenome data and also reference genome data is extremely important because you need to have those annotated genes, and you rely on exact matches to understand what your proteins are. So those that can be identified, you can then predict their functions. But in the case of hypothetical proteins, it's also possible to look at the sequence and to be able to do a hypothetical protein identification. So this is an average of the metagenomes that we had available we had at the time. So just looking at average cog categories of function from the metagenome level and comparing that to the average cog categories for metaproteomes. And what we can see is, if you look at the metagenomes, it's a relatively even distribution of cog categories. But if you look at the proteomes, it's really enriched in certain functions. For example, translation, energy production, and carbohydrate metabolism, and these are functions you would expect to be dominating in the gut environment because they have to, for example, metabolize carbohydrates. So this is a good sign that the information we're getting from the proteome is more indicative of the function that is actually being carried out in that system. Another nice thing about doing the proteome is that, at first, I was -- I always say this, but as a microbiologist, I thought that human proteins were contaminants, but they actually turned out very useful because you can get a study of the microbiome interaction with the host by looking at the human proteins. So we get the human proteins for free, at least the proteins that are attached to the bacterial cells because we enrich the bacteria. And when we look at the human proteins, the largest groups of proteins are usually digested enzymes, and those involved in cell adhesion. However, we can see these very interesting proteins, including antimicrobial peptides. And this is just an example of one protein that was identified early on. It's a DMBT1, which is thought to play a role in cellular immune response. These little blue bars just show the peptides that were lined up along this protein, or the gene for the protein. So that's proteomics. What about metabolites? Well, metabolites are the ultimate proof of processes and pathways that have occurred. And it's really the final signature of metabolic processes. The thing that's different about looking at the metabolites compared to the other kinds of omics is that it's not so easy to key it to a particular organism. You don't have that way to track back to a gene. Instead, you're dependent on massive data correlations. So metabolomics is, of course, very important. When you consume food, the food is digested. If you can have rather insoluble carbohydrates, or more soluble polysaccharides, oligosaccharides, and depending on the organisms that encounter those in the intestine, you're going to have different kinds of metabolites that are produced. And you can have primary degraders and also hydrogen utilizers that are consuming the hydrogen that's produced, and eventually the metabolites, some of them are used by the community, but some of them are actually taken into the systemic system and can have impacts on the body. So that's a background about the omics technologies, and I will give you a couple of examples of projects that we've carried out where we used multi-omics approaches. The first is for IBD cohorts. We have a twin cohort. And also a longitudinal study where we looked at microbiomes, metagenomes, metaproteomes, and metabolomes. And the second is a dietary study, which is looking at microbiomes, metagenomes, metaproteomes, and metabolomes. This was an earlier study, so it was using the 454-sequencing platform, whereas we migrated over to Illumina for the second. So first I'll show you the example for inflammatory bowel disease, and this is a disease that has many different consequences for the body, and it has a very complex ideology. But one thing that is of interest for this meeting is that there's often been reported a dysbiosis, or an altered microbiome, in individuals that have inflammatory bowel disease compared to healthy persons. So this is just an example of publications that show -- that have reported dysbiosis. There are many more papers than this, so these are just a few examples. I know you can't read it, but that's fine. It just lists a lot of different bacterial species that are either reported to be more prevalent, higher in individuals that have inflammatory bowel disease, or lower in individuals with inflammatory bowel disease. And so just -- I summarized some of the key points here from the publications, and one is that dysbiosis in IBD is characterized by an overall decreased diversity of bacteria in the gut compared to healthy people, and typically, a greater relative abundance of proteobacteria, such as enterobacteriaceae. And another important point is there's often a loss of beneficial microbes, such as butyrate producers and other producers of short-chain fatty acids. So the study we did was to study twins. The reason for doing that, and I think we heard a beautiful example from Ruth Ley this morning, is that you have these genetically matched individuals, and therefore you can discount a lot of the confusing impact of the genetics in early childhood exposures when you're looking at dysbiosis. So it's a Swedish twin cohort, twin cohort, 46 twin pairs. They included healthy twins, those that had ulcerative colitis; both those that were discordant for the disease, so the healthy one is a smiley face and the sick one is not smiling, and then those were concordant, so both were sick, and the same for Crohn's disease. We had discordant twin pairs and concordant pairs. So these were all the tools that were used on the same samples. So it was the same fecal sample, and we used everything on the same samples. It included, also, biopsies taken from five locations. So we got a lot of information from these individuals, including all of the different parts of the pipeline. For the microbiome, at that time, we first started with a fingerprinting method called Terra AFLP. We used QPCR, but then we moved to pyrotag sequencing, the metagenome pyrotag sequencing. And then did the shotgun proteomics and metabolomics. So this is a Terra AFLP profile survey of 90 different children. The reason I like to show this is that just was one of our first indications that every single one of these children had an individual fingerprint. That's the only reason I'm showing this here. But when we looked at the identical twins, this was amazing to me, that they were so similar, and these were adults that lived apart for decades. And their Terra AFLP profiles of their fecal samples were very, very similar. And this also supports Ley -- Ruth Ley was talking about earlier today. By contrast, if we look at these discordant twin pairs, their fecal microbiomes were very different. So this, again, is an indication that there is a dysbiosis. There is something different in these individuals. So if we look at this pipeline again, and this is just looking data from one pair of healthy identical twins, and the correlations between the two individuals in the twin pair, we can see at the microbiome level, we have a very high correlation between the OTUs present, 0.9 r-squared. If we look at the proteome, we start to get more of a separation, some more individuality, r-squared of 0.396. And at the metabolome, even more individualized, r-squared of 0.301. So this means that at -- as you go through this pipeline, you start to get more and more individual characteristics. It gets to be more discriminating. So when we looked at the 16S data, we could see very distinct clusters. So this is all of the patients that have inflammation in the ileum, which I'll call ileal CD, and sometimes I abbreviate ICD. They clustered separately from those that had inflammation in the colon, so colonic CD, which I sometimes abbreviate CCD, and from those that were healthy, which are in green, and the blue had all ulcerative colitis. Now, this grouping was much more significant than this twin pair similarities. So even the healthy twin pair's similarities, okay, the healthy twins did cluster together, but disease was the major clustering factor over zygosity. So once we saw this data, we focused more on the disease comparisons. Now, the reasons we looked at the biopsies was to study the mucosa-associated microbiota. And here we found that these are just different locations ileal and distal colon, but what we basically saw was that we, again, saw, when we included the biopsies and fecal samples, we had this distinction between ileal Crohn's disease and those that were healthy and had colonic Crohn's disease. And also the individual biopsies and fecal samples clustered together. So when we look at the composition at a phylum level, from -- these were individuals that were healthy or had colonic Crohn's disease, just averages, ileal biopsies and fecal samples, we can see that there are differences in the biopsies compared to the fecal samples. If you look at the blue, which is lachnospiraceae, you can see that it's much greater in the biopsies of the healthy, which is H here, compared to the fecal samples. So there are differences. But still, when you look at one person, their biopsies cluster with their fecal samples. And so we were interested in seeing, well, what -- which particular organisms were higher or lower in abundance. Now, I already told you that some of these butyrate producers are known to be more abundant in healthy compared to those with some of the IBD phenotypes. And definitely with ileal Crohn's disease, this organism is basically absent in the biopsies, in either the ileum, or the colon, or in the fecal samples, compared to healthy and those with colonic Crohn's disease. So here again we see that separation between these different Crohn's disease phenotypes. Whereas other organisms were more abundant, and this is an example of E. coli that -- these are different biopsy locations, the five different locations were much more prevalent in Crohn's disease compared to healthy. And we found one ruminococcus albus that was higher in the biopsies in those with ileal Crohn's disease, compared to the other locations and to healthy. So a gap -- that was looking at single time point studies. A gap is really to look at this in a longitudinal scale. So previous studies have focused only on these single time points. This really provides limited insight, especially for something like IBD, where you can have a flare up, remission, and different things going on over time, drug therapy. IBD has active acquiescent disease states. Therefore, it's really important to have a temporal study to properly assess IBD. So, more recently, we did a longitudinal study with 139 subjects, and up to 10 time points were collected for these individuals every three months. And during that time we have, from our clinical collaborators, information about remission, drug therapy, et cetera. So what we do find when we look at all of this data, we still get this major clustering based on disease. It might be a little bit hard to see. But this is ileal Crohn's disease here in purple. Ulcerative colitis is the light blue, colonic Crohn's disease in darker blue, and healthy in green. So even when all of these time points are taken into account, we still get this clustering, but what is -- this is the super interesting thing here, if I can get it to work. So Rob Knight's group did this for me. This is looking, then, at the trajectory of these individuals over time. And so if you follow the orange and the yellow, so the healthy and the ulcerative colitis, they are starting to form a cluster here on one side, whereas the different IBD phenotypes are varying dramatically. They're jumping back and forth in this space over time. And so these -- each of these segments represents a three-month sampling period. And here you can see another healthy person is still continuing in this plane. So when it is finished rotating, here you can see the healthy and the ulcerative colitis are almost as flat as a pancake on that plane. This is where they rotate. But the IBDs are -- they exhibited different space. So I think this is really important to understand what's going on there. If you look at individual temporal dynamics, you can see these -- this is a healthy person. There is some variability. For example, there is a balloon in this bacteroides, I think, can't really see the color. But there's some difference over time, but not nearly what you see when you look at the IBD phenotypes. So here's an example where there's a real enrichment of enterobacter, and then the bacteroidaceae come in, and then lachnospiraceae, so it's a lot more dynamic. And this is -- it's individual, though. You have a different pattern for each person. So what we're currently doing is the metagenomes and metaproteomes for five of these patients at five time points. And so I don't have that data yet, but that is ongoing. So for -- I have to go faster. For our HMP Demonstration Project, we examined a subset of these pairs that had matched metagenomes and metaproteomes. And these are just showing the proteome similarities in the twin pairs, so we see a lot of similarity with the healthy twin pairs, but -- and with the colonic, but much less with the discordant twin pairs. And the metaproteomes, they cluster according to disease phenotype, and here, you can see that here as well. This is ileal Crohn's disease, healthy, and colonic Crohn's disease. And when you look at individual pathways, so this is the lowest phylogenetic level where we can identify the proteins and what they're assigned to, we can see that all of these pathways are less abundant in ileal Crohn's disease at the protein level. But there are some proteins, especially for outer membrane proteins, that are more abundant in Crohn's disease. That's just saying what I just said. If we look at the human proteins that we find in healthy, more proteins that function in mucosal integrity, and also, in the ileal Crohn's disease, a higher abundance of proteins involved in inflammatory response, this human alpha defense and pancreatic enzymes, so we think that's demonstrating a defective epithelial or a leaky gut symptom. Looking at the metabolites, so, again, just to emphasize, these are from the same samples, so the pellet was sent for proteome analysis and the fecal water was sent to Germany for mass spec analysis of the metabolites. Again, the same pattern, we see the clustering. Here, red is colonic Crohn's disease, blue is ileal Crohn's disease, and green is healthy. And so we get this very distinct clustering. And this is just showing some of the differentiating metabolites. We had so many differentiating metabolites. Over almost 8,000 metabolites significantly differed between these, and we had over 18,000 metabolites and most of them are unidentified. One example is bioessence biosynthesis. That was higher in Crohn's disease. And we think this may also be due to inhibition of bioacid absorption by inflammation. So I just want to mention this study. I won't have time to really go through it, but this is an ongoing dietary study funded by General Mills and NIH NIDDK. And we're looking at different high carbohydrate/low carbohydrate diets in a crossover study. And this is just showing the study and the different kinds of analysis. One thing we find with a resistant starch diet is that we get -- we do have more -- let's see -- with a high resistant starch diet, we do get the lower insulin resistance. But these are different patients. Now, we're interested in differences in the microbiome, and so with these different arms of the diet, when you do the crossover, there is definitely a significant difference between the high carb and low carb, and also with the high resistant starch and low resistant starch in both branches. And we find our favorite fecal bacterium is enriched in the high resistance starch diet. And these are metabolites that were detected, and we do find the metabolites separate according to high resistant starch diet. Okay, I'm going to have to finish here. So I need to mention where we should go from here. So I think the current grand challenge is how to analyze all of this multi-omics big data. I'm so thankful there's the call coming out for big data analysis because this is really an enormous amount of data, and we generate it and we want to correlate it. And what we want to avoid is this, interpreting the hairball, because often I get the data back and it -- this is what it looks like. That's an example, an anonymous hairball. So I think that what we need is more multi-disciplinary collaborations with microbial ecologists, clinicians, bioinformaticians, biostatisticians, to be able to really dig down in this data. We have a huge resource of data, but we need to be able to analyze it. And I'd like to conclude with acknowledgements. Thank you very much. [applause] Female Speaker: We have time for one question. And -- no? If not, then we can move on to our next speaker. We thank you, Dr. Janet Jansson, for an excellent presentation, and we're moving on to our next speaker, Dr. Dan Rudolf Littman, from NYU and the Skirball Institute, and he will talk about Approaches for Host Immune and Microbiome Studies.