Tip:
Highlight text to annotate it
X
Owen White: I've been -- I've seen Rob give presentations
like that several times, and -- with the moving ball through the PCA cloud, and I've been
trying to come up with a good name for that. And at first, I thought it could be just "Ping-Pong
Ball in Hail Storm," but I actually realized that it's just -- it's all the topics that
Rob wants to cover, and then the Ping-Pong ball is just touching each one of them as
he sort of proceeds forward.
Okay, I have a very, very gentle commentary for everyone, which is I respect and admire
you all tremendously, but I think that there's a way in which, when you've been talking about
your charge, which is to tell us a little bit about the gaps, you've been, in a sense,
a bit too polite, and I really -- I mean this sincerely, when Lita and I talked a lot about
what we wanted to accomplish with this meeting, we really had in mind that one of the things
that you would do -- and it's certainly happened. People are definitely touching on this. It's
certainly happened -- that we were hoping that we would get schooled in really, no kidding,
what are the obstacles to you getting your science done? And what -- if you just could
spend a little bit of time, like, really letting us know about what are the important things.
And I say this respectfully, Curtis, when he presented, he said, "Well, let's talk about
the gaps in our knowledge," and other people have talked about the gaps in that context.
And I -- and that's great, and I think it's just that we're all sort of getting to know
each other in this new field, and we're behaving in a very polite way, but I would -- I strongly
encourage people to -- if you have time, to definitely dedicate some thoughts in your
slides to being very direct about what types of things are preventing you from accomplishing
your research when you're starting to talk about gaps. And on that subject, I will do
what I usually do in most of my presentations, which is to show almost no data whatsoever,
and the entirety of the talk will be about some of the gaps, or limitations, or at least
concerns that I have about the field as it goes forward.
So to introduce myself, I just want to say that I am the PI of the Data Analysis and
Coordination Center, which was jumpstarted at the same time at the beginning of the HMP.
We do have funding for another year, so we're around. And we're trying to do the best that
we can to help meet the needs of the microbiome community, which is obviously getting a lot
larger. We supply lots of resources. We supply lots of training. We also have a computational
infrastructure that we really would like to encourage people to use. We've developed virtual
machine-based software that could be run out in the cloud. And that's it. That's my presentation
on the DACC. I'm not really going to talk about that much more. When I do talk about
other issues, feel free to editorialize with me later on about what might be issues that
the DACC could take on in our last year as we go forward.
So another way to introduce myself is to say that I was involved in this fabulous constellation
of working groups that were associated with the Human Microbiome Project, and you can
just kind of see them here. But there were a number of different groups that got together
for data analysis, and it's a pleasure to be presenting after Rob Knight and Curtis
Huttenhower, who contributed enormously to the data analysis of the HMP. And one of the
things that we did for the HMP in our data analysis was different ways to start examining
the community composition of the healthy human subjects that were studied. And this is by
way of saying I'm going to look at a different community composition. I would like to talk
a little bit about the community composition of all the scientists in this room. So that's
a new meaning for the term microbiome community composition.
And to do that I just first would like to make a point. This may look like I'm showing
you data, but I'm not actually doing that. This is a slide that demonstrates what -- the
size in terabytes of the sequence that was generated by the HMP, and the only point that
I really want to make from it is just by talking to two people at this meeting, they've already
warned me that there will be some very -- ,much larger bubbles that are coming. This was around
four terabytes. They've already told me that there's going to be some data sets coming
along that are a much larger size, but we can also expect that there will be lots of
small data sets, as described by a lot of the presenters here, and there's probably
going to be a great diversity of them, okay?
So, backing up, let me talk a little bit about sort of two extremes of the NIH approach for
how they get science done, and this is exemplified by Francis Collins's great legacy of establishing
these really large consortia. And some characteristics of these consortia is they tend to not to
be as hypothesis-driven in comparison to an R01 study, and they tend to be very, very
large engineering projects, like completing the genome of some macro species. And they
tend to be -- the source of funding tends to be from one institute, so it's much easier
to obtain kind of a top-down control about the way people go about things, like collecting
samples, or making the data available, or just, you know, getting very simple things
done, like the way all the DNA would be prepped, for example, in order to do a large sequencing
effort. So that's one approach: the consortium style of NIH research.
All right, here's a word cloud of some microbiome text, and let's just say what we do is really
diverse. We do a lot of different things, and that's the only thing I was trying to
demonstrate with this. And if you look at -- I won't -- you obviously can't read it.
We could -- I could -- happy to give this to people, but this is a table of the current
RFAs that have now emerged, all of which are associated with microbiome work. So there're
three -- there are four that are out of date. There're 20 that are out there to support
microbiome work. And the good news, as Francis mentioned, was that there are lots of different
institutes that are taking this up, okay? So this is sort of to contrast in terms of
how consortium-style science might be done. This is to convince you that -- this is a
table to just say that there're a lot of different diseases that are associated with this, and
Curtis talked about this, and others -- people -- others talked about it, so I don't really
have to go into that.
And then when I did a quick survey of all the session speakers and made a word cloud
from the different departments they came from, there were lots of different disciplines that
they all came from. So if I'm going to characterize the microbiome community, one of the things
that I'm going to say is that we are very diverse in research areas, and there's implications
to that. We're supported across lots of different institutes. We tend to be hypothesis-driven,
which is great. We're studying many diseases in many systems. We're multi-disciplinary,
and we are both big data generators, for some of you, and then there's also going to be
a diversity of, let's say, smallish data generators. So that's our characterization of the community,
and there're implications to that. And don't get me wrong; I love you all enormously. This
is -- I'm not working up to some criticism that you're somehow doing it wrong. I celebrate
the fact that you are coming from many different disciplines, you'll have many different levels
of expertise, you're asking many different questions, and you're hypothesis-driven; that's
great, okay?
So, contrast to the consortium style, and that's one way that we could go about it,
or we could talk about this diverse population. And one of the things that was attractive
about the consortium style is that there was common processes behind where a sample came
from. There were common approaches for the way that we dealt with data. It would all
get released. You know, there was meta data; it was pretty easy to make available to the
community. For these, when you did the cow genome it was obvious -- there was one straightforward
protocol for identifying that DNA and how it was getting produced. There was also a
centralized computing infrastructure; so all these members of a consortium were beneficiaries
of the large genome centers or other centers that supplied computing equipment to help
them get their job done. So that was really great, okay?
We're not so much like that. This group of people are not so much like that. You don't
have one computing infrastructure that you count on. You don't have one protocol that
you could count on for the way that you go about isolating DNA, and there're some implications
to that. So I'm going to humbly ask if there are ways in which we could be a consortium
of sorts that is coordinated in certain ways, and the only reason that you'd want to belong
to a consortium is if it was a benefit to you. So let me try to make an argument for
how it might be a benefit to you. If we, for example, engaged in some type of protocol
standardization, there might be some increased certainty about data, there might be an increased
usability, and also, more importantly, reusability. If we've got lots of R01 researchers out there,
and all of our budgets are really tight, wouldn't it be nice if we could count on being able
to combine the data from other experiments to increase the ultimate power of what we're
working with? And Rob just gave exquisite examples about how it would be nice to be
able to reuse data.
Wouldn't it be nice if we, as a large consortium, had a streamlined IRB process, okay? Wouldn't
it be nice, as a large consortium, if you had somewhere to go when you had computational
needs, and also, one of the consequences of you coming from different disciplines is you're
going to have different levels of expertise with knowing how to generate lots of those
beautiful figures that you've seen already. And I would like it if everybody in the room
knew exactly how to do that, and I would hazard that you don't, okay? And wouldn't it be nice
if there were some common training?
So this -- it was already said by Rob very well. Rob speaks very fast, and he skated
over some difficult issues, but if you want to get all the stool samples that are out
there, at some stage you're going to have to be presented with a thing called the Short
Read Archive, and this is an example of when you go to the Short Read Archive and you type
in something like "human microbiome stool," and you get these entries, and it says, "Results:
1 to 20 out of 746 pages." Okay?
There's a lot of problems associated with just trying to retrieve data from SRA, and
it's not that I want to kick SRA in the shins, it's that they have a very difficult job,
and they're being supplied with data that is incomplete, okay? So we have -- there's
that list, okay, and there are many ways in which you have a type of uncertainty about
that list, okay? You don't know a whole lot about its origin. You don't know where those
samples were prepped. You don't know what patients they came from. You don't know how
the library was made. If there were primers used, you don't know what those primers necessarily
were. There's other things, though. You don't know if there is some generous researcher
out there that collected those samples and put them in a refrigerator, and wants them
to be available. Same topic: They may have been recruiting subjects or volunteers for
a study, and those people may still be available to -- for downstream studies, and we don't
know that. We don't -- when you're going to this list, you don't necessarily know if the
data has a publication behind it, or if that publication was cited by other downstream
publications, okay?
Then this is a big one that concerns me: You simply don't know the quality of the data
in one -- from one study to another. When you're presented with that list, okay, you're
presented with a list, it came from stool, but it doesn't tell you if there were any
types of errors along the way in terms of generating it. It's very difficult to pull
out things like patient phenotype, and this is the one that really startles me a lot.
This is true. Right now there are lots of data sets that are in SRA that were associated
with a study, but you don't know which data set was associated with a disease, okay? And
that's astonishing to me, okay? So there are issues in terms of data uncertainty, and I
just want to call your attention to it.
Okay, so I'm arguing that there are some benefits to increased coordination, and I'll talk a
little bit about them, okay? I think that with improved submission standards and with
provenance, it would be possible to learn more about your data. If you wanted to know
if a sample was available, I would say that we might be able to do something like create
an investigator registry and to track things like biosamples. That registry might also
tell you about volunteers. With respect to publications, I'm going to show you an example
of being able to track investigator publications. There's also ways that we could do large-scale
QC on those different data, so you'd have information about it. We could get smarter
about improving the way that people are submitting data to dbGaP, and deal with this bogey of
essentially being able to track things like disease phenotype.
So I'm arguing to you that maybe if we start pulling together and operating as a group,
these things may happen with the existing data that's out there. We have a fighting
chance of that happening. So you can imagine me: I stared the data management for the HMP
in its ugly face. That was four centers that were getting together and trying to herd data,
and it was difficult enough, as I imagined this diversity, this wonderful, beautiful,
tremendous rainbow diversity of researchers out there that are going to be doing all these
different types of studies, it really worries me if we aren't trying to be a bit more coordinated
about this, and I just think that there are certain benefits to that. Has my accent changed
to sound more like Rachel Maddow?
[laughter]
I really feel like I'm channeling her right now.
[laughter]
So with respect to sequence submission, I humbly suggest to you that it might be useful
to have a data-coordinating center that was helping as a submission broker. That coordination
center -- there may be many different models for that sequence submission. We could just
give you improved submission tools. We could be giving you better submission tools for
sequence and metadata standards, and things like provenance. I was working on an effort
to help -- to get letters of support from different journals that were actually quite
interested in enforcing things like metadata standards, and having people submit them at
the same time as when they were getting their paper published. And also, they could describe
things like -- would we -- we could require investigators to describe things like sample
and volunteer availability. So let me put out there that there's lots of NIH staff here
who are also putting out their RFAs. You may want to be considering requiring that people
are making this happen, or at least describing if that's happening. That, to me, is a type
of recommendation that I'd like to see emerge out of this meeting.
The journals could also play a role in capturing protocols, but the nice thing about that is
that all of this -- this could all be described as supplemented data that would be hosted
at journals, which I think would be a good thing. It'd be sort of a distribution of where
that data resided, in addition to being at a coordination center.
Okay, now when I say the term investigator registry, I don't want you to think I'm using
-- I'm going to slap a radio collar on you and then require you to do a bunch of things.
I shouldn't really use the term investigator registry, but I would like to maybe say a
participant -- consortium participant list, or something. And I would argue that it'd
be useful to be tracking PIs based on, maybe, the literature that's out there, or the grant
funding that's out there, and contacting them and asking them do they have volunteers, or
do they have biological samples, do they have publications, do they have protocols? And
I think that the investigator registry should -- could also play a role in helping with
coordination of IRB approval.
So I'll just make a point that this is something called the NIH RePORTER, which is actually
a very nice system. It's obvious that there's a lot or resources that have gone into it.
I've contacted them and asked them if it's possible for an independent entity like a
coordinating center, or somebody else, to be adding more data to this, and they said,
"Yes, there are ways." It didn't seem like the greatest model in the world, but they
said that it would be possible. So you can imagine that this could basically be a good
use of the taxpayers' money. Here's this great system, and we could use it as something of
an investigator registry, and we could attach keywords to investigators and say, "Is this
person a member of the consortium or not?" And then you could pull down all the people
that were and get a lot of information. So that's just one thing that's out there.
I think that there are ways to capture names of investigators. There's a brilliant young
man named Illia [spelled phonetically], who works with Ari Forney [spelled phonetically],
and he's developed methods for culling information from PubMed by capturing data from abstracts,
and doing a little bit of text mining, and essentially creating networks of these investigators
that are associated with these investigators, or depending on what keywords you might find,
you might be able to pull out these are all the investigators or all the publications
that are associated with a vaginum [spelled phonetically] microbiome study, and I think
that that also could be the system, which is called CoPubNet could be capitalized on
to start sort of seeding the investigator registry that's out there. This is a network
that comes out of it of different people's names that are associated with a particular
research project.
I'll also make the point that there are plenty of people out there who have actually written
software for things like volunteer registries. There are some great efforts out there, and
we could be using software like this and using it to the benefit of this consortium if we
wanted to, so that's out there. I'd also like to point out this effort. This is an effort
by the CTSA program -- the Clinical and Translational Science Award program -- and it's called IRBshare.
And the IRBshare is one creative solution to the way that we might be dealing with IRBs.
There's probably people who have a lot more experience with IRBs than I do out in the
audience, so I don't want to insult anybody. There's -- I'm sure that there're plenty of
policy obstacles, and there might be some details of this that I'm not getting absolutely
right. I'm just suggesting to people that maybe we could be a bit more creative about
things.
And the way that IRBshare works is that if there's a local IRB -- well, this -- first
I should also say that this is strictly for multi-institutional sites -- multi-institutional
studies when there's just one study and multiple institutions, okay? And in that scenario,
normally each investigator has to go through their own IRB. What they do here at IRBshare
is if there's one IRB and they've gone through a process, they could submit all their documents
to this -- it's an actual server where they get all these documents, and then they've
got different people that sit on a centralized IRB and go through the review process, okay?
And so the common institutions can submit review documents. They sort of split up the
review process, and the nice thing is that it promotes consistency and compliance, and
kind of eases the process of IRB, because you can see, "Hey, this is how another institution
handled it." And so I'm just offering to people that there are things that, maybe as a consortium,
we could do.
Again, you could be presented with this list, and there's a really big issue, and Rob talked
about this, that you could have lots of samples, and you just don't know that much about the
metadata behind them. You don't know that much about the quality or many other things.
And we went through an exercise with all the demonstration projects for the HMP where Steve
Sherry at dbGaP looked at patient variables; that's what these are. These are different
patient variables that were collected and submitted to dbGaP, and he colored a column
if a particular study had a variable that was in common to some other studies. And what
-- the thing that was really astonishing is things, even like age, or height, or if the
people were smoking; all those variables were tracked in a different way, okay? So it was
very hard to retrieve: "Give me all the studies that were associated with people who had taken
a certain antibiotic."
And I found that surprising, and I just -- Rob already mentioned it: We have a standardized
system for describing metadata, and I think it should be exploited. And we also have something
called PhenX. PhenX is a study that's sponsored by the NIH that's meant to be dealing with
the process of converging on common variables across studies. So the way that they do this
is the establish a working group, the working group looks at a small number of measures
-- a workable set of measures, just like I was showing you with that previous slide from
Steve Sherry, and they get input from the research community, they review that data,
and then, eventually, they make a final set of measures. And they also have published
a tool called PhenX, which assists people with the process of marking up their data
dictionary from their clinical studies -- it's actual software that helps you do this and
helps you make sure that it's marked up the proper way, and then it helps you create the
submission documents to dbGaP, okay? This is another thing that, as a group, as a consortium,
we could be considering.
Okay, so, you know, imagine, if you're a member of this registry, imagine if you decided,
"Okay, I'm going to drink the Kool-Aid. I belong to this consortium." We could do things
like help you with management of IRB forms. We could help you -- if we have that Party
A is searching these researchers -- or these subjects with these variables, and you gave
us your IRB and told us what variables you were interested in, we could alert you to
the cases where there were study participants who would like to participate in your study.
All -- we could help you with, you know, tracking publications. We could help you manage stuff
into SRA. This is the sort of scenario we could be working towards.
That last thing I just want to mention very quickly, and Rob already talked about this,
this is a slide that was created by a group of people that were trying to standardize
assays for a completely different domain. And the point was is that there's a lot of
give and take, but they established working groups where eventually they made harmonized
protocols, which does not mean identical protocols, harmonized protocols that meant lots of labs
could be generating data that made -- the resulting data was comparable between studies.
And I would argue that we should establish some working groups that were really setting
about the business of doing this, and Rob gave some great examples of how, for specific
domains, like with protocols having to do with stool, protocols having to do with different
body sites, we could establish some harmonized protocols.
Okay, so I'm just going to skate forward really quickly. I just want to say that at one point
we performed an email poll asking the community what type of analyses methods they needed.
The most popular response was metagenomic assembly. So we held a two-hour webinar given
by two luminaries in the field on assembly, and 109 people sat in on that seminar to listen.
It was a fantastic success, and I just want to bring that up. That's one of the things
that the DACC is doing, but that gives you an idea of the hunger that's out there for
training, and there needs to be more of it.
Okay, so for -- very quickly, if I were going to identify gaps, I would say we have gaps
in training, okay? You're experts coming from lots of different fields, and you may not
know how to do some of the statistical analyses that you saw presented today, or you may not
have the computational equipment to be able to do it. So we need more training.
I would argue that some -- I shouldn't use the term PI Registry. We should have some
type of consortium centralization site that is tracking lots of information of this. There
are lots of different types of QC that we could be doing on all the data that's in SRA.
Another big gap that we have is, and I only heard one person really mention it so far,
is that there aren't a lot of resources out there to help people do processing or adding
value to the data that exists at SRA. We clearly could be going a long way with harmonization
of protocols. Imagine if all the data that you were looking at was stamped with a bunch
of standardized protocols that it was derived from, and you could, with confidence, make
some of the PCA comparisons that -- Rob was showing, and it would just simply be adding
power to your own data.
There are gaps in terms of how people go about being able to submit data to the different
repositories that are out there, and I'd really like to see some progress there.
Okay, so I don't have an attribution slide. I just want to give thanks to all of you and
everybody in the audience for hearing me out, and we'll just take it from there. And there's
the names, so...
[applause]
Female Speaker: Okay, thank you. I think we have time for
one or two questions. Don't be shy.
Owen White: Don't be shy.
Female Speaker: Don't be shy. Okay, there you go, Owen.
Male Speaker: So one of the things I thought was missing
there was the ability of -- to get help in analysis of data. So when you generate a data
set and have never seen it before, it would be very helpful if there were people that
say, "Yeah, we could Skype with you one day, and help you go through the data and analyze
it." And we just had a case where we hadn't done arrays for a long time. We got some IT
people who showed us ways to analyze the arrays we never would have thought of, because we
hadn't done that for a while. So I think setting up some type of technology analysis data set
of people who would volunteer to do that would be very helpful.
Owen White: Well -- so, thank you. I'm sorry, I obviously
didn't emphasize that enough. I agree strongly. I think that we could literally have help
desks where people were able to contact some place and get assistance. At the DACC website
-- I'm not saying it has to be us, but at the DACC website we have things called walkthroughs
that are these step-by-step processes that tell users how to perform analyses. I think
we need to -- at meetings like this, we need to have breakout meetings where, you know,
the latest cool publication that's out there, you're just being walked through, and people
are describing how they generated the figure -- the information in Figure 2. I think all
of those things should be happening, so I agree emphatically there should be more training.
Male Speaker: I also wanted to state about the standardization,
I think it's very important. I was involved with the Cytokine and Interferon Society,
where we were worried that people were reporting cytokine levels in all different papers and
using all different kits.
Owen White: You bet.
Male Speaker: And so what did it mean when you used a Bio-Rad
kit, or you -- and so we went to the companies, and we went to the journals, and said, "They
all have to be standardized against WHO standards if they're available," and that way you could
interpret data between papers and -- to mean something.
Owen White: I'm totally with you. My concern is we have
this wonderful -- we have a Cambrian explosion of RFAs that are coming out from the different
ICs, the different institutes here at NIH. And I'm just really hoping that they start
to each, as individuals, understand that it'd be nice to standardize across the entire NIH
for the reasons -- exactly what you're saying. One more question.
Male Speaker: So I may be missing something, so maybe you
can bring me up to speed. But my experience in trying to find a control spike to put into
genomic or -- 16S or metagenomic data sets, is that there is one available from the BEI
and the HMP. But, in practice, when you request, although it's free, you're told very specifically
you can only have one, and that's got to last you for the year. And so if you actually start
using the sample in all of your library preps, it doesn't really -- at least in my experience,
wasn't really a practical solution to what I think is a very important problem. So I
wonder if you could comment about the importance of a control spike that people would be putting
into these data, into these experiments, and then also what it means that we can only request
it once a year.
Owen White: So I'm with you 100 percent. I don't have
first-hand experience to be able to account for the -- what sounds like a ridiculous policy
associated with a reagent, but I will certainly use this bully pulpit to say I think that
an -- just a small amount of resources should be put towards a -- some type of working group
that's generating reagents like that, and they absolutely have to be made available
without any encumbrance whatsoever, preferably with some very, very nice publications to
go along with them to tell people about the -- what -- the value, and what they could
be doing with reagents of that kind. So I can't quite account for that, what sounds
like an odd story. Yeah, go right ahead.
Maria Giovanni: So, this is Maria from NIAID, and we're the
ones who actually sponsor BEI. So I would be very interested in talking to you if there
is an issue with the amount of reagent you get, because, I mean -- and we can talk offline
about it, because I think that, you know, maybe it was set up that way, and maybe we
can change things. We can be flexible with things like that. So I don't know what the
issue is, but we can talk about it.
Owen White: Okay, I think we're now moving on to a question-and-answer
period, and I'm saying this very sincerely: I think that there's a way in which we're
all getting to know each other. We know that there's a lot of different people from a lot
of different backgrounds in the room, but I really do strongly want to encourage people
to speak frankly about the concerns you have for us to be able to go forward and just succeed
at our job. And so, Lita, I think --
Female Speaker: Maybe we should thank the speakers.
Owen White: Oh, well, let's thank me.
Female Speaker: I think we should thank -- we should thank
Owen and Rob for two very good talks. Thank you.
[applause]