The Challenges to Aggregating And Analyzing Data Sets from Sequencing Studies - Francis collins

Francis Collins: Boy does this have déjà vu written all over it. It's great to be here to welcome all of you to what I think is a very important meeting, to talk about how we're going to make the most of some technological advances that have enormous potential for teaching us about human biology in genetics and a wide variety of exciting applications. I wanted the chance to come and say just a couple words of welcome because, from the perspective of all of NIH, I think the meeting you're initiating right now and going through tomorrow afternoon has great significance for our ability to make the most of this remarkable moment where we have, by various count, and Adam Felsenfeld will give you more details about this, something in excess of 70,000 individuals for whom whole-exome or whole-genome sequence will be in hand by the end of this calendar year. It's pretty amazing to just say that and not have anybody fall off their chair, but -- because a few years ago that would have been almost unthinkable and now it's like, "Well, yeah, okay." Well, the "okay" part is how do we make sure that that data, which is at the moment scattered in quite a few places with lots of different ways that people can get access or not, could be a symbol together, so that the whole would be greater than the sum of the parts, and we would have a chance collectively across NIH and across other scientific sectors to make the most of what we could learn from this exceptionally powerful data. There are of course many kinds of questions that could be posed if we had that data set in front of us. Some of us, I'm thinking David *** and Teri Manolio and Eric Green and Lisa Brooks and David Altshuler and I, and maybe a few others, I'm not sure I got the whole roster, spent Thursday and Friday of last week at a meeting in Boston, which was co-sponsored by NIH and industry with multiple pharmaceutical companies represented there at a relatively high level, to talk about how one could utilize this kind of genome sequence data to do a better job of identifying the right targets for the next generation of therapeutic development, effectively imagining being able to take advantage of the human knockout project that nature has already carried out to identify individuals with heterozygous or homozygous loss of function of virtually all of the protein-coding loci, maybe some of the non-protein-coding loci as well, and figure out what their phenotypes are in anticipation that that would be a great way of predicting what the consequence would be if you developed an antagonist against that particular product and it could not only tell you what the likely efficacy would be of such a drug development effort, but perhaps also whether there would be toxicity or not since the natural experiment is already potentially in hand. That was an exciting conversation, driven of course by some examples like PCSK-9 which everybody likes to refer to where this has been reduced to practice already and clearly is presenting opportunities for therapeutic development that are quite powerful. And the question is how generalizable could that be, if one had the data in front of you, the opportunity to go back and carry out phenotyping on those individuals whose genotypes particularly strike you as interesting. That's one of the things one might talk about in terms of having access to this kind of data on a large-scale fashion in an accessible database. There are many others, certainly the ability to understand biology from the perspective of gene-gene and gene-environment interactions would be greatly assisted by having data of this sort assembled in one place with standardization, not only of the DNA sequence, which of course needs to be of the highest quality and isn't always, let's be honest, as well as having environmental information and phenotypic information collected in a fashion that will be most useful. And frankly I think that's even harder but something that we should begin to tackle. Clearly this kind of approach is happening across NIH, almost all of the 27 institutes and centers have something going in the direction of utilizing genome sequence information to try to answer questions, but it has not been fully pulled together in the way that everybody now agrees it needs to be. And that's why we are grateful to all of you, and I guess especially to Mike Boehnke and Wylie Burke as your co-chairs who agreed to shepherd this enterprise, for being here because we do expect this is not a meeting to give your standard talk, not at all. It's a genome meeting. I know that never happens at genome meetings. It is a meeting to really roll up your sleeves and try to figure out what are the barriers to getting this particular outcome to happen and how can we knock those down, and do so in a way that is both scientifically rigorous and highly respectful of the concerns about confidentiality and privacy, which we must pay attention to if we are going to maintain the confidence of those who have given biological samples for us to learn from. So the problems will be numerous, certainly some of the hard issues you'll be talking about will be data access policies: how can they be maximized to benefit science while preserving privacy confidentiality? Comparability of variant data: how do you put together some of these datasets that are actually collected in different ways and have different ways of recording the information that's been collected on the participants? Particularly, what do we do about phenotype and environmental data? How is that best displayed in a fashion that you could compute across multiple datasets? What about the simple problem of the computing power? The kinds of questions being asked here are going to be very challenging, and some of them very expensive in terms of cycles. And what analysis tools do we not have that we are going to wish we did and how could we start down the pathway of getting them sooner rather than later? Those are just a few of the issues that I know you will be wrestling with. So again it's really wonderful to be able to be here to issue a word of welcome on behalf of all of NIH, but I do want to thank NHGRI and Eric and Lisa and others at the -- the NHGRI staff who have worked very hard to pull together the details. And I'm counting on this to be a meeting that has a lot of substantive outcomes that will lead us forward to get to that goal that I think we all can see ahead of us, but it's still a bit blurry and there are still some potholes in the road, and the challenge is to figure out how to get to that goal in a fashion that does the most to advance the cause of biomedical research. So that's all I wanted to say. I'm going to invite Eric Green to come up and say a few words, and if you have questions then for Eric and me, I'll stay for a little bit. Unfortunately there are other fires burning back on the NIH campus which I will need to go to attend to in a bit. But Eric, why don't you come forth?