Large Data Management, Data Standards, Data Sharing - Owen white

Owen White: I've been -- I've seen Rob give presentations like that several times, and -- with the moving ball through the PCA cloud, and I've been trying to come up with a good name for that. And at first, I thought it could be just "Ping-Pong Ball in Hail Storm," but I actually realized that it's just -- it's all the topics that Rob wants to cover, and then the Ping-Pong ball is just touching each one of them as he sort of proceeds forward. Okay, I have a very, very gentle commentary for everyone, which is I respect and admire you all tremendously, but I think that there's a way in which, when you've been talking about your charge, which is to tell us a little bit about the gaps, you've been, in a sense, a bit too polite, and I really -- I mean this sincerely, when Lita and I talked a lot about what we wanted to accomplish with this meeting, we really had in mind that one of the things that you would do -- and it's certainly happened. People are definitely touching on this. It's certainly happened -- that we were hoping that we would get schooled in really, no kidding, what are the obstacles to you getting your science done? And what -- if you just could spend a little bit of time, like, really letting us know about what are the important things. And I say this respectfully, Curtis, when he presented, he said, "Well, let's talk about the gaps in our knowledge," and other people have talked about the gaps in that context. And I -- and that's great, and I think it's just that we're all sort of getting to know each other in this new field, and we're behaving in a very polite way, but I would -- I strongly encourage people to -- if you have time, to definitely dedicate some thoughts in your slides to being very direct about what types of things are preventing you from accomplishing your research when you're starting to talk about gaps. And on that subject, I will do what I usually do in most of my presentations, which is to show almost no data whatsoever, and the entirety of the talk will be about some of the gaps, or limitations, or at least concerns that I have about the field as it goes forward. So to introduce myself, I just want to say that I am the PI of the Data Analysis and Coordination Center, which was jumpstarted at the same time at the beginning of the HMP. We do have funding for another year, so we're around. And we're trying to do the best that we can to help meet the needs of the microbiome community, which is obviously getting a lot larger. We supply lots of resources. We supply lots of training. We also have a computational infrastructure that we really would like to encourage people to use. We've developed virtual machine-based software that could be run out in the cloud. And that's it. That's my presentation on the DACC. I'm not really going to talk about that much more. When I do talk about other issues, feel free to editorialize with me later on about what might be issues that the DACC could take on in our last year as we go forward. So another way to introduce myself is to say that I was involved in this fabulous constellation of working groups that were associated with the Human Microbiome Project, and you can just kind of see them here. But there were a number of different groups that got together for data analysis, and it's a pleasure to be presenting after Rob Knight and Curtis Huttenhower, who contributed enormously to the data analysis of the HMP. And one of the things that we did for the HMP in our data analysis was different ways to start examining the community composition of the healthy human subjects that were studied. And this is by way of saying I'm going to look at a different community composition. I would like to talk a little bit about the community composition of all the scientists in this room. So that's a new meaning for the term microbiome community composition. And to do that I just first would like to make a point. This may look like I'm showing you data, but I'm not actually doing that. This is a slide that demonstrates what -- the size in terabytes of the sequence that was generated by the HMP, and the only point that I really want to make from it is just by talking to two people at this meeting, they've already warned me that there will be some very -- ,much larger bubbles that are coming. This was around four terabytes. They've already told me that there's going to be some data sets coming along that are a much larger size, but we can also expect that there will be lots of small data sets, as described by a lot of the presenters here, and there's probably going to be a great diversity of them, okay? So, backing up, let me talk a little bit about sort of two extremes of the NIH approach for how they get science done, and this is exemplified by Francis Collins's great legacy of establishing these really large consortia. And some characteristics of these consortia is they tend to not to be as hypothesis-driven in comparison to an R01 study, and they tend to be very, very large engineering projects, like completing the genome of some macro species. And they tend to be -- the source of funding tends to be from one institute, so it's much easier to obtain kind of a top-down control about the way people go about things, like collecting samples, or making the data available, or just, you know, getting very simple things done, like the way all the DNA would be prepped, for example, in order to do a large sequencing effort. So that's one approach: the consortium style of NIH research. All right, here's a word cloud of some microbiome text, and let's just say what we do is really diverse. We do a lot of different things, and that's the only thing I was trying to demonstrate with this. And if you look at -- I won't -- you obviously can't read it. We could -- I could -- happy to give this to people, but this is a table of the current RFAs that have now emerged, all of which are associated with microbiome work. So there're three -- there are four that are out of date. There're 20 that are out there to support microbiome work. And the good news, as Francis mentioned, was that there are lots of different institutes that are taking this up, okay? So this is sort of to contrast in terms of how consortium-style science might be done. This is to convince you that -- this is a table to just say that there're a lot of different diseases that are associated with this, and Curtis talked about this, and others -- people -- others talked about it, so I don't really have to go into that. And then when I did a quick survey of all the session speakers and made a word cloud from the different departments they came from, there were lots of different disciplines that they all came from. So if I'm going to characterize the microbiome community, one of the things that I'm going to say is that we are very diverse in research areas, and there's implications to that. We're supported across lots of different institutes. We tend to be hypothesis-driven, which is great. We're studying many diseases in many systems. We're multi-disciplinary, and we are both big data generators, for some of you, and then there's also going to be a diversity of, let's say, smallish data generators. So that's our characterization of the community, and there're implications to that. And don't get me wrong; I love you all enormously. This is -- I'm not working up to some criticism that you're somehow doing it wrong. I celebrate the fact that you are coming from many different disciplines, you'll have many different levels of expertise, you're asking many different questions, and you're hypothesis-driven; that's great, okay? So, contrast to the consortium style, and that's one way that we could go about it, or we could talk about this diverse population. And one of the things that was attractive about the consortium style is that there was common processes behind where a sample came from. There were common approaches for the way that we dealt with data. It would all get released. You know, there was meta data; it was pretty easy to make available to the community. For these, when you did the cow genome it was obvious -- there was one straightforward protocol for identifying that DNA and how it was getting produced. There was also a centralized computing infrastructure; so all these members of a consortium were beneficiaries of the large genome centers or other centers that supplied computing equipment to help them get their job done. So that was really great, okay? We're not so much like that. This group of people are not so much like that. You don't have one computing infrastructure that you count on. You don't have one protocol that you could count on for the way that you go about isolating DNA, and there're some implications to that. So I'm going to humbly ask if there are ways in which we could be a consortium of sorts that is coordinated in certain ways, and the only reason that you'd want to belong to a consortium is if it was a benefit to you. So let me try to make an argument for how it might be a benefit to you. If we, for example, engaged in some type of protocol standardization, there might be some increased certainty about data, there might be an increased usability, and also, more importantly, reusability. If we've got lots of R01 researchers out there, and all of our budgets are really tight, wouldn't it be nice if we could count on being able to combine the data from other experiments to increase the ultimate power of what we're working with? And Rob just gave exquisite examples about how it would be nice to be able to reuse data. Wouldn't it be nice if we, as a large consortium, had a streamlined IRB process, okay? Wouldn't it be nice, as a large consortium, if you had somewhere to go when you had computational needs, and also, one of the consequences of you coming from different disciplines is you're going to have different levels of expertise with knowing how to generate lots of those beautiful figures that you've seen already. And I would like it if everybody in the room knew exactly how to do that, and I would hazard that you don't, okay? And wouldn't it be nice if there were some common training? So this -- it was already said by Rob very well. Rob speaks very fast, and he skated over some difficult issues, but if you want to get all the stool samples that are out there, at some stage you're going to have to be presented with a thing called the Short Read Archive, and this is an example of when you go to the Short Read Archive and you type in something like "human microbiome stool," and you get these entries, and it says, "Results: 1 to 20 out of 746 pages." Okay? There's a lot of problems associated with just trying to retrieve data from SRA, and it's not that I want to kick SRA in the shins, it's that they have a very difficult job, and they're being supplied with data that is incomplete, okay? So we have -- there's that list, okay, and there are many ways in which you have a type of uncertainty about that list, okay? You don't know a whole lot about its origin. You don't know where those samples were prepped. You don't know what patients they came from. You don't know how the library was made. If there were primers used, you don't know what those primers necessarily were. There's other things, though. You don't know if there is some generous researcher out there that collected those samples and put them in a refrigerator, and wants them to be available. Same topic: They may have been recruiting subjects or volunteers for a study, and those people may still be available to -- for downstream studies, and we don't know that. We don't -- when you're going to this list, you don't necessarily know if the data has a publication behind it, or if that publication was cited by other downstream publications, okay? Then this is a big one that concerns me: You simply don't know the quality of the data in one -- from one study to another. When you're presented with that list, okay, you're presented with a list, it came from stool, but it doesn't tell you if there were any types of errors along the way in terms of generating it. It's very difficult to pull out things like patient phenotype, and this is the one that really startles me a lot. This is true. Right now there are lots of data sets that are in SRA that were associated with a study, but you don't know which data set was associated with a disease, okay? And that's astonishing to me, okay? So there are issues in terms of data uncertainty, and I just want to call your attention to it. Okay, so I'm arguing that there are some benefits to increased coordination, and I'll talk a little bit about them, okay? I think that with improved submission standards and with provenance, it would be possible to learn more about your data. If you wanted to know if a sample was available, I would say that we might be able to do something like create an investigator registry and to track things like biosamples. That registry might also tell you about volunteers. With respect to publications, I'm going to show you an example of being able to track investigator publications. There's also ways that we could do large-scale QC on those different data, so you'd have information about it. We could get smarter about improving the way that people are submitting data to dbGaP, and deal with this bogey of essentially being able to track things like disease phenotype. So I'm arguing to you that maybe if we start pulling together and operating as a group, these things may happen with the existing data that's out there. We have a fighting chance of that happening. So you can imagine me: I stared the data management for the HMP in its ugly face. That was four centers that were getting together and trying to herd data, and it was difficult enough, as I imagined this diversity, this wonderful, beautiful, tremendous rainbow diversity of researchers out there that are going to be doing all these different types of studies, it really worries me if we aren't trying to be a bit more coordinated about this, and I just think that there are certain benefits to that. Has my accent changed to sound more like Rachel Maddow? [laughter] I really feel like I'm channeling her right now. [laughter] So with respect to sequence submission, I humbly suggest to you that it might be useful to have a data-coordinating center that was helping as a submission broker. That coordination center -- there may be many different models for that sequence submission. We could just give you improved submission tools. We could be giving you better submission tools for sequence and metadata standards, and things like provenance. I was working on an effort to help -- to get letters of support from different journals that were actually quite interested in enforcing things like metadata standards, and having people submit them at the same time as when they were getting their paper published. And also, they could describe things like -- would we -- we could require investigators to describe things like sample and volunteer availability. So let me put out there that there's lots of NIH staff here who are also putting out their RFAs. You may want to be considering requiring that people are making this happen, or at least describing if that's happening. That, to me, is a type of recommendation that I'd like to see emerge out of this meeting. The journals could also play a role in capturing protocols, but the nice thing about that is that all of this -- this could all be described as supplemented data that would be hosted at journals, which I think would be a good thing. It'd be sort of a distribution of where that data resided, in addition to being at a coordination center. Okay, now when I say the term investigator registry, I don't want you to think I'm using -- I'm going to slap a radio collar on you and then require you to do a bunch of things. I shouldn't really use the term investigator registry, but I would like to maybe say a participant -- consortium participant list, or something. And I would argue that it'd be useful to be tracking PIs based on, maybe, the literature that's out there, or the grant funding that's out there, and contacting them and asking them do they have volunteers, or do they have biological samples, do they have publications, do they have protocols? And I think that the investigator registry should -- could also play a role in helping with coordination of IRB approval. So I'll just make a point that this is something called the NIH RePORTER, which is actually a very nice system. It's obvious that there's a lot or resources that have gone into it. I've contacted them and asked them if it's possible for an independent entity like a coordinating center, or somebody else, to be adding more data to this, and they said, "Yes, there are ways." It didn't seem like the greatest model in the world, but they said that it would be possible. So you can imagine that this could basically be a good use of the taxpayers' money. Here's this great system, and we could use it as something of an investigator registry, and we could attach keywords to investigators and say, "Is this person a member of the consortium or not?" And then you could pull down all the people that were and get a lot of information. So that's just one thing that's out there. I think that there are ways to capture names of investigators. There's a brilliant young man named Illia [spelled phonetically], who works with Ari Forney [spelled phonetically], and he's developed methods for culling information from PubMed by capturing data from abstracts, and doing a little bit of text mining, and essentially creating networks of these investigators that are associated with these investigators, or depending on what keywords you might find, you might be able to pull out these are all the investigators or all the publications that are associated with a vaginum [spelled phonetically] microbiome study, and I think that that also could be the system, which is called CoPubNet could be capitalized on to start sort of seeding the investigator registry that's out there. This is a network that comes out of it of different people's names that are associated with a particular research project. I'll also make the point that there are plenty of people out there who have actually written software for things like volunteer registries. There are some great efforts out there, and we could be using software like this and using it to the benefit of this consortium if we wanted to, so that's out there. I'd also like to point out this effort. This is an effort by the CTSA program -- the Clinical and Translational Science Award program -- and it's called IRBshare. And the IRBshare is one creative solution to the way that we might be dealing with IRBs. There's probably people who have a lot more experience with IRBs than I do out in the audience, so I don't want to insult anybody. There's -- I'm sure that there're plenty of policy obstacles, and there might be some details of this that I'm not getting absolutely right. I'm just suggesting to people that maybe we could be a bit more creative about things. And the way that IRBshare works is that if there's a local IRB -- well, this -- first I should also say that this is strictly for multi-institutional sites -- multi-institutional studies when there's just one study and multiple institutions, okay? And in that scenario, normally each investigator has to go through their own IRB. What they do here at IRBshare is if there's one IRB and they've gone through a process, they could submit all their documents to this -- it's an actual server where they get all these documents, and then they've got different people that sit on a centralized IRB and go through the review process, okay? And so the common institutions can submit review documents. They sort of split up the review process, and the nice thing is that it promotes consistency and compliance, and kind of eases the process of IRB, because you can see, "Hey, this is how another institution handled it." And so I'm just offering to people that there are things that, maybe as a consortium, we could do. Again, you could be presented with this list, and there's a really big issue, and Rob talked about this, that you could have lots of samples, and you just don't know that much about the metadata behind them. You don't know that much about the quality or many other things. And we went through an exercise with all the demonstration projects for the HMP where Steve Sherry at dbGaP looked at patient variables; that's what these are. These are different patient variables that were collected and submitted to dbGaP, and he colored a column if a particular study had a variable that was in common to some other studies. And what -- the thing that was really astonishing is things, even like age, or height, or if the people were smoking; all those variables were tracked in a different way, okay? So it was very hard to retrieve: "Give me all the studies that were associated with people who had taken a certain antibiotic." And I found that surprising, and I just -- Rob already mentioned it: We have a standardized system for describing metadata, and I think it should be exploited. And we also have something called PhenX. PhenX is a study that's sponsored by the NIH that's meant to be dealing with the process of converging on common variables across studies. So the way that they do this is the establish a working group, the working group looks at a small number of measures -- a workable set of measures, just like I was showing you with that previous slide from Steve Sherry, and they get input from the research community, they review that data, and then, eventually, they make a final set of measures. And they also have published a tool called PhenX, which assists people with the process of marking up their data dictionary from their clinical studies -- it's actual software that helps you do this and helps you make sure that it's marked up the proper way, and then it helps you create the submission documents to dbGaP, okay? This is another thing that, as a group, as a consortium, we could be considering. Okay, so, you know, imagine, if you're a member of this registry, imagine if you decided, "Okay, I'm going to drink the Kool-Aid. I belong to this consortium." We could do things like help you with management of IRB forms. We could help you -- if we have that Party A is searching these researchers -- or these subjects with these variables, and you gave us your IRB and told us what variables you were interested in, we could alert you to the cases where there were study participants who would like to participate in your study. All -- we could help you with, you know, tracking publications. We could help you manage stuff into SRA. This is the sort of scenario we could be working towards. That last thing I just want to mention very quickly, and Rob already talked about this, this is a slide that was created by a group of people that were trying to standardize assays for a completely different domain. And the point was is that there's a lot of give and take, but they established working groups where eventually they made harmonized protocols, which does not mean identical protocols, harmonized protocols that meant lots of labs could be generating data that made -- the resulting data was comparable between studies. And I would argue that we should establish some working groups that were really setting about the business of doing this, and Rob gave some great examples of how, for specific domains, like with protocols having to do with stool, protocols having to do with different body sites, we could establish some harmonized protocols. Okay, so I'm just going to skate forward really quickly. I just want to say that at one point we performed an email poll asking the community what type of analyses methods they needed. The most popular response was metagenomic assembly. So we held a two-hour webinar given by two luminaries in the field on assembly, and 109 people sat in on that seminar to listen. It was a fantastic success, and I just want to bring that up. That's one of the things that the DACC is doing, but that gives you an idea of the hunger that's out there for training, and there needs to be more of it. Okay, so for -- very quickly, if I were going to identify gaps, I would say we have gaps in training, okay? You're experts coming from lots of different fields, and you may not know how to do some of the statistical analyses that you saw presented today, or you may not have the computational equipment to be able to do it. So we need more training. I would argue that some -- I shouldn't use the term PI Registry. We should have some type of consortium centralization site that is tracking lots of information of this. There are lots of different types of QC that we could be doing on all the data that's in SRA. Another big gap that we have is, and I only heard one person really mention it so far, is that there aren't a lot of resources out there to help people do processing or adding value to the data that exists at SRA. We clearly could be going a long way with harmonization of protocols. Imagine if all the data that you were looking at was stamped with a bunch of standardized protocols that it was derived from, and you could, with confidence, make some of the PCA comparisons that -- Rob was showing, and it would just simply be adding power to your own data. There are gaps in terms of how people go about being able to submit data to the different repositories that are out there, and I'd really like to see some progress there. Okay, so I don't have an attribution slide. I just want to give thanks to all of you and everybody in the audience for hearing me out, and we'll just take it from there. And there's the names, so... [applause] Female Speaker: Okay, thank you. I think we have time for one or two questions. Don't be shy. Owen White: Don't be shy. Female Speaker: Don't be shy. Okay, there you go, Owen. Male Speaker: So one of the things I thought was missing there was the ability of -- to get help in analysis of data. So when you generate a data set and have never seen it before, it would be very helpful if there were people that say, "Yeah, we could Skype with you one day, and help you go through the data and analyze it." And we just had a case where we hadn't done arrays for a long time. We got some IT people who showed us ways to analyze the arrays we never would have thought of, because we hadn't done that for a while. So I think setting up some type of technology analysis data set of people who would volunteer to do that would be very helpful. Owen White: Well -- so, thank you. I'm sorry, I obviously didn't emphasize that enough. I agree strongly. I think that we could literally have help desks where people were able to contact some place and get assistance. At the DACC website -- I'm not saying it has to be us, but at the DACC website we have things called walkthroughs that are these step-by-step processes that tell users how to perform analyses. I think we need to -- at meetings like this, we need to have breakout meetings where, you know, the latest cool publication that's out there, you're just being walked through, and people are describing how they generated the figure -- the information in Figure 2. I think all of those things should be happening, so I agree emphatically there should be more training. Male Speaker: I also wanted to state about the standardization, I think it's very important. I was involved with the Cytokine and Interferon Society, where we were worried that people were reporting cytokine levels in all different papers and using all different kits. Owen White: You bet. Male Speaker: And so what did it mean when you used a Bio-Rad kit, or you -- and so we went to the companies, and we went to the journals, and said, "They all have to be standardized against WHO standards if they're available," and that way you could interpret data between papers and -- to mean something. Owen White: I'm totally with you. My concern is we have this wonderful -- we have a Cambrian explosion of RFAs that are coming out from the different ICs, the different institutes here at NIH. And I'm just really hoping that they start to each, as individuals, understand that it'd be nice to standardize across the entire NIH for the reasons -- exactly what you're saying. One more question. Male Speaker: So I may be missing something, so maybe you can bring me up to speed. But my experience in trying to find a control spike to put into genomic or -- 16S or metagenomic data sets, is that there is one available from the BEI and the HMP. But, in practice, when you request, although it's free, you're told very specifically you can only have one, and that's got to last you for the year. And so if you actually start using the sample in all of your library preps, it doesn't really -- at least in my experience, wasn't really a practical solution to what I think is a very important problem. So I wonder if you could comment about the importance of a control spike that people would be putting into these data, into these experiments, and then also what it means that we can only request it once a year. Owen White: So I'm with you 100 percent. I don't have first-hand experience to be able to account for the -- what sounds like a ridiculous policy associated with a reagent, but I will certainly use this bully pulpit to say I think that an -- just a small amount of resources should be put towards a -- some type of working group that's generating reagents like that, and they absolutely have to be made available without any encumbrance whatsoever, preferably with some very, very nice publications to go along with them to tell people about the -- what -- the value, and what they could be doing with reagents of that kind. So I can't quite account for that, what sounds like an odd story. Yeah, go right ahead. Maria Giovanni: So, this is Maria from NIAID, and we're the ones who actually sponsor BEI. So I would be very interested in talking to you if there is an issue with the amount of reagent you get, because, I mean -- and we can talk offline about it, because I think that, you know, maybe it was set up that way, and maybe we can change things. We can be flexible with things like that. So I don't know what the issue is, but we can talk about it. Owen White: Okay, I think we're now moving on to a question-and-answer period, and I'm saying this very sincerely: I think that there's a way in which we're all getting to know each other. We know that there's a lot of different people from a lot of different backgrounds in the room, but I really do strongly want to encourage people to speak frankly about the concerns you have for us to be able to go forward and just succeed at our job. And so, Lita, I think -- Female Speaker: Maybe we should thank the speakers. Owen White: Oh, well, let's thank me. Female Speaker: I think we should thank -- we should thank Owen and Rob for two very good talks. Thank you. [applause]