Group 4 - Clinical comparative genomics at scale

Male Speaker: So we were probably the smallest group, I think, of the breakout sessions, comparative genomics and evolution, but, notwithstanding we had a pretty lively, sometimes confusing discussion which I think we finally cemented near the end. We started out with kind of revisiting, I think, some of the things we think are so important and foundational about this program that has existed at NHGRI almost since its inception. And the kind of the fundamentals which I think we were all in agreement is on essentially that evolution is the unifying principle by which everything that we are doing rests. So, studies of variation, studies of mutation, are essentially fundamental and how those processes occur require really comparative genomics. The idea of genotype-phenotype correlations, which is what most of the people in this room are interested in, also benefits and I would say it probably makes most sense in the context of evolution, and so it provides us an unbiased framework for discovery and prioritization of regions and, I would argue, that as we move perhaps into interrogating non-coding sequences, regulatory sequences and trying to understand them, that the comparative genomics and evolutionary aspects would become more and more important. And I guess the last important point is that NHGRI really has blazed a trail in terms of this research. In terms of mammalian and vertebrates in specifically there’s the expertise, we have the computation, we have the resources in terms of libraries and other types of things, and we have the consortia. So, the track record and the ability to do this type of work really surpasses any other institute at NIH. So, we began by first, kind of -- and I’ll do this very quickly -- just reviewing the accomplishments and sorry to those that we don’t list as an accomplishment but there have been many in this area over the last 15 years. Sixty vertebrate genomes have been sequenced in some form or fashion and aligned with the human data revealing about 3 million specific evolutionary conserve segments so about 4.5 percent of our genome. I think the important point to think about is that this is work in progress. In many cases, the genomes are not assembled or they’re just used essentially to align to the human reference and so we don’t, in many cases, have stand-alone, high-quality or even reasonable working draft assemblies from many of the genomes. So, I just looked at the average N50 contig length for primate genomes and it’s on the order of about 25 kilobases. Point number two: one of the missions that we’ve had for many years is essentially to reconstruct evolutionary history of every base in the human genome. We are not there yet. We’ve made some good strides in this area but we’re lacking critical species in terms of high-quality. We’re lacking prosimians; we don’t have a high-quality tarsier reference genome, for example. There’s one and a half million gaps in the tarsier assembly right now. Less than 50 percent of that sequence can be aligned to the human reference. So, if you think this is fait accompli, you’re wrong. We’re not done in terms of this project, at least in terms of how we set out. We’ve done moderately well. I changed this from deep catalogues to been [spelled phonetically] catalogues. There have been efforts by the Grade A Genome Project, rhesus macaque, African green monkey to begin to survey some of the genetic variation that exists within these species. There’s been studies that have inbred Drosophila strains to understand, really, a product provider framework for quantitative genetic trait studies, and so these have been good. More could be done in this particular area. And the last point, which we had some debate whether this was part of our purview but we think this is fundamentally comparative in nature because it involves comparative genomes both past and present, is the fact that we’ve had -- and I think this is an achievement -- the Human Reference Genome Consortium, whose mission it has been to continually improve the reference of the human genome as we go forward. So, for many of you, you might think of this as the housekeeping exercise to kind of finish off the gaps. It’s much more than that. The regions that are being tackled right now and it’s a targeted approach for specific regions are regions that are incredibly diverse. Think of MHC, think of T cell receptor regions, think of regions around signal duplications, highly dynamic, gene rich, important in terms of human health and highly variable. There is more variation in the three megabase stretch -- tens of KB, hundreds of kilobases of sequence variation between different haplotypes which has not been catalogued because of the complexity of that type of variation. So, as a result, some of the holes -- and we kind of really echoed the first goal of the last group is that we have not yet comprehensively assessed all genetic variation in any single genome. So, it’s not just a question of allele frequency spectrum, it’s a question of getting all the variation. All the endales, all the structural variants, all the copy number changes and there are more. Estimates are four to five times more base pairs affected by structural variation than by single base pair and endale events. All right, so that’s kind of the achievements. This is just to remind you this is taken from two papers just kind of the phylogenes [spelled phonetically] that have been tapped in terms of this. I highlight here a few. I’ll just mention gorillas, for example. We’ve worked on many of the primate genomes. The gorilla genome was recently sequenced and assembled put together. It’s average contig length is about 11 kilobases, I think, if my last recollection on this. There’s about a half a million gaps in the gorilla genome. What that means is when we did the four-way alignment with the apes, which includes humans, 30 percent of the genome could not be aligned in that four-way alignment. So, only 65-70 percent of the genome could be aligned. That’s hue chromatic. That’s genes. So, we have very much heterogeneity on the terms of quality of the genomes that have been generated. Many of them have been just used simply to align to the human reference. So, many of them, mammalian genomes, roughly the 34 that have been done, 29 depicted here, are not particularly high-quality draft assemblies. The only high-quality assemblies on this slides really, to be honest, are human and mouse. And many of the others are in various stages of working draft. So, when we set out the goals from our group we basically went to the -- we actually broke them each one down to really four things. Essentially what’s the big question and why is NHGRI relevant in this particular question? Second was the tactic or the approach. Third was details and fourth was justification, not in that order. So, I think we agreed in our session that this was the single most important goal. People can disagree with me that were in that breakout group, was to move from aligning genomes to essentially doing de novo sequence and assembly without guidance. To be able to take a genome, no matter what species, human, otherwise, and to be able to generate a high-quality de novo sequence and assembly of that genome and so we have the specific. We would suggest or argue that what NHGRI should invest in and not be dependent solely on the commercial sector for this, was to advance sequencing technologies, to advance or assemble a genome for $10,000. So, this is not generate 40x sequence coverage with illuminae. This is to actually assemble. The cost of assembling genomes is actually still has been prohibitive. We have some statistics now based on assembly with one human genome using long read pack bio data that suggests it would cost us about $60,000 [unintelligible] to assemble a genome with an N50 contig length of 4.4 megabases. That’s a 150-fold improvement in terms of N50 contig length based on just standard of illiminae sequencing. I don’t think it’s unreasonable to think that we could have an order of magnitude drop in that cost to get to us to a 10,000 genome assembly. We suggest that one useful -- so, this is the specifics. I think Jeff Schlozz [spelled phonetically] asked for this specifically. One specific approach or area that NHGRI could invest in would be to apply this to a finite number of Human Reference Genomes. So, to generate Human Reference Genomes at the quality of the existing Human Reference Genome or better for 50 different humans representing diversity or sampled broadly across humanity. So, we’re thinking of this as kind of what we call gold genomes. Very high-quality genomes where most of the bases and structural variation copy number endale have all been resolved. We think it would be incredibly powerful because it would give us a comprehensive view of the types of genetic variation that exist kind of in the sweet spot right now where we can’t access very well. So, as a member of the structural variation working group for the last x number of years, part of the 1000 genomes and before that earlier on other projects, we are not particularly good at detecting inversions. We are not particularly good at genotyping or detecting insertions. We are terrible in terms of complex structural variation events with [unintelligible] duplications. And so this is an area if you could think about where we would have 50 reference -- call them continental genomes if you will: Africans, Asians, Amer-Indians, Europeans, but we would have high, very high-quality references at those positions. This would give us, I think, the first truly comprehensive view. We just ran some statistics recently and we think, based on what we’ve been able to do on one genome comparing illuminae pack biotechnology, that between 50 base pairs and 5,000 base pairs we are missing a 90 percent of deletion variance. Or, I should say, 62 percent of deletion variance, 90 percent of insertion variance. So, we -- if we think we are completely understanding this variation in the human genome, we’re really mistaken. I pushed for this but I -- there was push back in the group, so I’ll just mention this. I think the goal should even be better than this. I think we should push to sequence from telomere to telomere every human chromosome, including the dark matter. The centrameric [spelled phonetically], the akricentric [spelled phonetically]. I think it can be achieved. Won’t be achieved today, maybe not achieved in the next couple of years, but no other institute will achieve this. And we know that variation within centromere, we know that variation with t lemurs is important in terms of human health. This one, this is a big lofty goal. What makes us human? This requires an emphasis on primate genomes. We still have not achieved this, which is something that we set out over 10 years ago, which was to assign every human lineage specific or genomic change to a specific branch on the evolutionary tree of primates. Many in the group were most interested in no human specific changes with functional consequences, including gene innovations, and so we are still discovering new genes in 2012, 2013 that aren’t in the Human Reference Genome. These are typically often duplicated genes, but they are also important in terms of human health and human adaptation. So, in terms of specifics and concrete we would argue that it’s possible with all the researches that have been generated now to focus on generation of high-quality de nova assembly of non-human primate genomes. We suggest as a straw man of 16 primate genomes including many which have already been in working draft stage and to assemble them at the quality of the Human Reference Genome. Sixteen is a number that we use based on looking at available resources including back resources, but also having at least two representations from every major fellow genetic branch from the human lineage. This would provide us fundamental information on processes of mutational processes, speciation, differences in lineage specific sorting, gene flow, and et cetera. And there was some discussion in our group, and we think it’s an interesting observation, that many of the recurrent micro deletions that are actually mediating genomic disorders in the human population are caused by human-specific duplications and complex regions that have evolved over the last 5 to 10 million years of human evolution. There’s remarkable genetic variability in those regions which predisposes some individuals and certain haplotypes to develop, you know, have more current micro deletions and others not. The last point or goal that I’ll mention was essentially this: to obtain nucleotide level resolution of every conserved functional element in humans. We are not there yet. We heard some great stories yesterday about the power of actually comparative genomics and helping to identify regulatory elements. The story from David Kingsley [spelled phonetically] and finding that mutation in the regulatory element for a kit and that how these weren’t detected by Encode but were picked up as being based on comparative analyses. The data that’s out there right now which is roughly the 30 mammals gets us down to about 12 base pair resolution. Simulation suggested if you push this to 100 to 200 mammalian genomes sequenced deeply, you’d get down to a single base pair resolution. And I think that’s an easy target without any advances in sequencing technology right now could be generated. Some people said, “Well, maybe this will be done by, you know, the 10K diversity project or the 10K genome project or other projects that are out there.” I don’t think so. It may be, but that’s not their mission. The mission here should be to sequence genomes, make that data publically available so everybody can analyze it as quickly and rapidly as possible. This would allow us to quantify the selective constraints on each element or cause mammals and integrate with existing data sets and encode in both mouse and human, and if advances in computational technologies and advances in sequencing technologies came along, it wouldn’t I think be beyond the pale to think about doing additional mammalian genomes at high-quality like we have it done for the mouse. All right, I’m going to turn this over to Andy. Male Speaker: I wish there was a button like that where I could blank all your screens, too. If you think of the top causes of mortality and morbidity in humans and consider cardiovascular disease, COPD, stroke, diabetes, the list that Mike Banky [spelled phonetically] gave, they have two things that are in common. One is that they’re adult onset and the other is that they’re remarkable sensitive to the environment so that your risk is a function of your environmental stresses as well as some attribute of your genome. So, we’re all unique. Some of us are more unique than others. And we’re unique not only because of our genomes but because of that trajectory of environmental stresses and environmental exposures that we’ve had during our lifetime. Now, when you’re faced with trying to infer causality in a situation that’s so high-dimension where the factors are confounded so badly as they are with genotype and environment in the case of humans, you have a very tough problem ahead of you, and we’re all familiar with that. Now there are two major impediments, then, for studying and understanding causation in the face of adult onset diseases of this sort. One is the fact that there isn’t really a good, controlled experiment that any of us can do. There’s no control for that experiment. And the other is -- because it’s observational -- and the other is that we can’t replicate the experiment. We can’t take the same genotype and put it in a bunch of environments and ask what happens. So, there’s good news and bad news. The good news is that model organisms do precisely this. They allow us to put the same genotype in multiple environments and take apart genotype by environment interaction very carefully. The bad news, of course, is that when you do the right experiment, take a set of zebra fish or mice or flies or worms through a set of different environments, the rank order of the phenotype that you score almost always flip around. That is, genotype by environment interaction is almost universal. So, we’re in the situation where we need to understand, how do genotypes respond to different environments, and the answer is we need to figure out what’s the best way to work with model organisms. Now, we all know of examples where model organisms are terrible for modeling specific diseases. There’s a human disease where the mouse doesn’t even have that gene. But we need to move beyond that. We need to ask the question now using genomic technologies, what really is the best model for each specific disease. I think Aviv Regev’s talk was particularly kind of informative in thinking about how gnomic technologies could really incredibly sharpen our ability to focus on model organisms for specific disease. If we had catalogues of the sort she described for mouse genes, for instance, there’s great opportunity for taking this forward. So, goal four, then, is to leverage the power of model organisms in functional genomics. Of course, this resonates very well with goal one where Eric Orwinkle [spelled phonetically] emphasized the fact that we need to understand basic biology before we can understand disease. I would argue that model organisms are an important path to that. And, of course, in goal two Rick Myers [spelled phonetically] and Mark Erstine [spelled phonetically] argued quite well that model organisms do have the ability to allow us to infer function at the adult stage, which -- or whole organismal stage. So, some of the points for how we would proceed to do this by, for instance, doing -- applying large scale genomic and other o mix technologies to reference panels. So many of the model organism communities are working on functional genomics. There are such panels as the collaborative cross, the diversity out-reads, and so forth in mice or many others and other model organisms. The idea of taking human mutations forward into model organisms and studying them both at the cellular as well as the organismal stage. This will scale beautifully now with crisper technology. We’ll be able to make thousands of human mutations and put them in adult mice and put them in different stressful environments. So, by doing this across different environments then we’d have this really good handle on the sort of full degree of genotype by environment interaction in those particular physiologies that are relevant to these human disease states. One other ideas was that well, as we study these other organisms and the more comparative evolutionary kind of view of the world, we look at, for instance, naked mole rat we find the fact that they don’t have tumors. We might identify genes that we suspect might be important in that process. What about doing the reverse experiment and taking some states from these non-model organisms and putting them into human cells and seeing how they behave. That’s a kind of out there idea that I won’t take credit for. Okay, so that’s the end of the ‘model organisms’ sermon. Getting back to the grand scope of comparative and evolutionary genomics, all of the things that Eric told you about couldn’t be done without serious improvements to the computational infrastructure that we need. So we need to develop informatics infrastructure to produce, display, and quantify these multiple species genome alignments. An alignment of species is a central of species genomes is a central tool for inference of where along that phylogeny did particular changes occur. When we layer on that functional information as well. We get terrific insight about the way genes and phenotypes have evolved. So this requires development of algorithms, software, alignment methods. It requires development of new browsers. Anybody who’s been involved in any of these projects know that you have a constantly shifting coordinate system for the genomes as you discover huge insertions in one species that weren’t in the others and so forth. We need to devise methods of analysis of complex chromosomal rearrangements, methods of representing genomes in the face of those rearrangements, and finally, we need to produce benchmarks, quality control metrics, and assessments of accuracy of these methods. So, all summarize just by kind of listing all of the goals, going through these quickly, reminding you that evolution is the single most powerful unifying principle in all of biology. That we -- the history of biology is that we have learned an enormous amount from that now. I’ll warn us against the arrogance that we all are somewhat subject to with the power of the tools of genomics to think that well, now that we can do this in human cells we don’t need to think about anything else anymore. We can just do all the manipulations in human cells. I think that evolution still has an enormous amount to teach us. That model organisms have also marched forward in their technologies for manipulating and perturbing genomes. We need to develop strategies and technologies to obtain high-quality de novo reference sequence. This will be applicable throughout all of biology, not just even the goals that Eric outlined. We need to -- I mean, Evan, sorry. We need to target multiple primate genomes to infer high confidence, all of the human specific attributes: so this is enormously useful in doing comparative biology. Seeing what are uniquely human traits and how are we different from our closes relatives. That’s of also tremendous intellectual interest, and I think we could bring in the public in sort of sharing the excitement over these sort of aspects of the science that we do. The sort of fundamental question of how did we evolve from our most recent ancestors is one that resonates very deeply with the public. By sequencing multiple mammals -- we were told last night there are only 5,400 mammals that are known so eventually we might be able to get there but starting with the first few hundred using current technologies -- this is not expensive -- we could easily get to the point of being able to identify all the human-specific conserved elements from those alignments and comparisons. The fourth point, again, the model organism thing. I’ll beat that drum one last time. They’re still enormous utility for understanding many aspects of basic biology but including, particularly, these context-dependent variant functions where those contexts include anything from diet to drug treatments and so forth. Those scale beautifully in experiments with model organisms, and so sequencing reference panels of them would enable those studies to proceed at a much greater pace. The fifth goal was, again, this development of software tools for dealing with multiple sequence alignments, and I’ll close with just emphasizing again all of the things that we talked about; none of them are actually on the purview of other institutes. The National Institute of General Medical Sciences does do a lot of evolutionary biology. They have funded some model organism work and so forth, but the scale of the sequencing, there is an aspect of this problem that is uniquely NHGRI and we would like to see them do it. So, with that I’ll take questions. Dave? Male Speaker: So, I want to point out that this so resonates with the goal that we heard from Heidi. Detect all types of clinically-relevant variation in a single genome scale test. That’s very, very consistent with the 10,000 genome goal. And I would say that I love these charts that we’ve been seeing with the Moore’s Law and the approaching the $1,000 genome, but there’s a lot of wishful thinking in there. Those aren’t genomes, right? We really need to get back and do this the way that Evan described beautifully in this so we can really say that we’re sequencing the whole genome. Male Speaker: I mean, I agree. I was really pleased. I was hoping that one other group would come up with the importance of being able to do a single genome -- Male Speaker: Yes. Male Speaker: -- without alignment to the reference because that is fundamental human genomics. A wise man once told me is that our field is skin-deep in terms of its fundamental algorithm which is to understand all the genetic variation comprehensively and once we do that, that’s finite. We can assess links to human phenotype. And so this is where we will be, whether NHGRI leads the charge or not in five, 10 years from now. We won’t be doing these alignments anymore, we’ll be doing de noval assembly. Male Speaker: And there has been a spectacular improvement starting with $300 million dollars in 2000 to what we can do now in terms of getting up where real high-quality is though. Male Speaker: And it’s not as if the private sector isn’t going to play an important role. Companies like Pacbio, Nanopore, Oxford Nanopore, they can continue but if there was an incentive of push at some level to drive this even more I think we could accelerate and get to that point of a single genome assembly, you know, instead of ten years, five years as part of a routine clinical test. Male Speaker: Jim. Jim Evans: To follow up on those statements, I think this is so important, in particularly from the evolutionary aspect since new mutation frequency is much greater on a low to specific basis for copy number versus single nucleotide variance, we need, really, to get better structural intimation and assays for this. And it ties you right into an institute nobody I’ve heard talk about right now and that’s environmental health, NIEHS, and we’ve already got evidence from Tom Cluver’s [spelled phonetically] work that hydroxylamine, the chemical we use clinically induces C and D mutations at high rates in both yeast and mammalian cells in vitro, so studying these mutational methods -- I mean, the aims test doesn’t even test for copy number. It gets only at mostly single nucleotide variance so how the environment interacts with our genome with respect to copy number is a total black box. Male Speaker: I mean, a related idea to this inference of mutational processes is inferences about differences in re-combinational processes which are also fundamental to, well, evolution and everything else about the map. So, yes, I agree. Jim Evans: And Richard, to follow up on that, I mean, the fact that Alan Jeffries [spelled phonetically] has nicely shown PRDN 9 alleles influenced genomic disorder rates and that you can change that in different environments, I think we have to go at that in a big way for mutations. Male Speaker: Richard. Male Speaker: This seems an opportunity to move the center of gravity of disease models towards primates from mouse and whilst you mentioned primates in the other context I think you didn’t strongly state building and exploiting primate models for human disease should be a priority. Was that discussed? Male Speaker: Well, we actually didn’t discuss it too much. I think the impediment there is the problems of working with primates is just -- there are limits to what we can do. They’re not totally insurmountable, but that’s a limit. Male Speaker: Well, I think you could argue too that there’s been technological changes in all the areas that really change the complexion of that. Male Speaker: I was interested when in goal two you didn’t mention archaic humans in this high confidence list of human specific genome attributes. I think that is -- I know it’s not science fiction obviously -- Male Speaker: We discussed this specifically, NAME, and this came up with in the context of yes, there’ll be more archaic hominins that’ll be sequenced. Most of that will be done with short read technology largely because the fragment lengths are so short in these archaic hominins that it doesn’t really lend itself to a de nova assembly of Neanderthal or denisovan for that matter. It’s generally something that we feel is going to happen whether NHGRI invests or not, so we were looking for those seven characteristics that, you know, they laid out in the beginning. You know, high through put of, you know, consortia technology advance -- Male Speaker: I mean, then that’s a focus on the data generation component of it and not the data analysis component. I think it would be very foolish in the data analysis component to ignore the archaic human. Male Speaker: Absolutely. Male Speaker: Absolutely. Male Speaker: I’m just, a kind of push -- I’m I think on the model organisms I think there is this huge the gene environment affect. It’s very clear that you just have opportunities there especially with fixed genotypes to explore things which are very, very hard to do observationally in humans, I mean, so I really believe that. A personal plea is that we don’t narrow ourselves down to just mice and zebra fish as model organisms. I just think that we’ve got a much bigger repertoire of organisms out there than, just say, model organisms equals, you know, a very small list of species and there’s a bigger diversity of useful organisms out there beyond those two. Male Speaker: We’ll include badaca [spelled phonetically]. Mark. Male Speaker: I just wanted to say that I really agree with Evan’s point about the impact of structural variation, the importance of having high-quality genomes. I just want to mention, then, the functional breakout group we also did really talk about thinking about the functional impact of structural variance. I mean, it’s much more complicated and potentially much larger than single nucleotide variants. We really have to think about ways of thinking about this impact. Male Speaker: I agree. Carlos. Carlos Bustamante: Just to add to Ewan’s point on the potential for archaic and ancient DNA, in fact, there’s a ton of technology development recently that is sort of upending this issue about, you know, how much can you really get out, right? So, when they do, for example, the single-stranded library prep it turns out you get many more molecules than the double-stranded and that’s why you’re able to take these [unintelligible] to high coverage so it’s actually an area that the U.S. has invested almost no money in, right. All of the development has actually happened in Europe and you could imagine that, you know, in fact, there may be some bone somewhere that have somewhat larger fragments that could be sequenced. So, I wouldn’t totally rule it out and nobody knows how far back you could go, right? We don’t have a *** erectus sequence yet; that doesn’t mean it can’t be done, but you know, it just hasn’t been prioritized in terms of what’s been done and there’s basically two, three labs in Europe that are leading in that area. Male Speaker: Yeah. Female Speaker: So, just to goal five, to point five, was there discussion on the committee about partnering with, say, NSF on advancing some of the alignment algorithms and comparative data analysis tools because they also expend a lot of money in this area? It’d be a good partnership. Male Speaker: There wasn’t. That is a very good idea. The NIH/NSF joint developments in quantitative biology have been very encouraging and that’s a very good suggestion. Male Speaker: I mean, I guess my feeling on this is that most of the genome browsers that are being used, I mean that’s just one, obviously, way to do this, but have been driven largely by the genomics community funded by Welcome and NIH. And I think, I mean, we have a lot of experience in this and there’s been a lot of discussion, I’ve been to several meetings about how would you, if you had 50 human references right now at high-quality, how would you display them so people could access the information, optimize their mapping so they could find, you know, the right genome and still be able to communicate these ideas. It’s not trivial at all. I mean, for sure if there is value for adding NSF in partnership, I mean, we should take advantage of everything we got. But I think my feeling on this is that we as a community have taken the leadership role in this and we should continue to push on this because this is, again, not an easy or solved problem. Male Speaker: Eric. Eric Green: So, I wonder if it’s okay to make a comment across the four sessions so far because we’re, I assume, coming to the end. So, it’s a spectacular range of projects and I want to share the general enthusiasm about the specific things have been proposed. I think many of these things in this particular breakout are incredibly important, but I was trying to think about what’s missing in what we’ve been talking about this morning? And I think it’s probably there, but it’s maybe hidden a little bit when we’re spending a lot of attention on structure. What are the nucleotides? What are the variations in them? How do they correlate with disease? And then we mentioned disease relevant functions, and we talk a lot about mutating the gene and seeing in an assay what affect it has. What I’m wondering is missing is the connection with cellular circuitry. That being able to accomplish our goal of interpreting variation in the context of disease, we may be able to interpret variation in the context of the protein that it affects, but to truly interpret it in the context of disease, there’s a set of NHGRI-ish kind of activities of systematically dissecting cellular circuitry when this enhancer gets affected, when this protein gets affected, what are the consequences? When 108 in schizophrenia get identified or 60 genes in heart disease get identified, how do we recognize what effect that is having on the cell? And, so, I would not like NHGRI to have to pay for all of it, but I do think that there is a set of infrastructure, it’s a little related to the Links project, it’s related to what Aviv was talking about yesterday, it’s related to going from individual enhancers to whole circuits, and somewhere NHGRI ought to be the intellectual leader of that, and it ought to be paid for maybe by common fund or others, but we’re not going to be able to interpret disease without it. We’re going to get, based on everything that’s been laid out here, a great description of the structural problems, I agree, down to completeness. We’re going to get great correlations. We’re going to get protein structure responses, but there is a piece and we haven’t defined it here that we better define, and it’s a set of data bases about circuitry, circuitry responses, cells, and I was unclear whether it was inbounds or out of bounds but as I think about what we’re doing, if we don’t make sure that that piece gets done it’s going to be really hard to interpret these subtle mutations even with all the assays and all the other things we’re doing. So, I don’t mean to destabilize anything here by arguing and, I think, we should go forward with these things, but somewhere we better also launch a process that’s doing that. Male Speaker: But I think to make sense of those, Eric, you really have to start by making sure that you’ve got the finite aspect of our universe, which is our genome sequence totally understood, because all of those things really make sense in the context of the variation in which those mutations are found. Eric Green: Look, I’m agreeing that we want to get that sequence, but where I’m disagreeing is you first must do that because to meet our goal which is understand disease, we must do that and then we’re going to have to interpret that in terms physiology, and all of this structural stuff -- very important -- isn’t going to get us the physiology that we owe for disease so it’s not at the expense of it, we just also in addition, not separately, not competing better be doing it, that’s all. Male Speaker: It reminds me of something 10 years ago when we were actually analyzing first Venture’s genome and comparing it to human reference. And Venture’s genome was, if you remember -- I’m sure you do -- was significantly shorter. And where it was shorter were all the genes and all the segmental duplications which were not in Venture’s genome. So, we could not as a community, have begun to understand break points of genomic disorders, copy number variation, without actually the investment of NHGRI in terms of building a better reference because what you can’t see, you can’t assay. So, I think -- Eric Green: You’re defending the need to get complete structure. I’m totally agreeing; my comment is independent of your comment. Let me totally endorse all of what you’re saying and then add we still are not going to be able to reach our goal of interpreting disease without additional things. All of it is great, but we have an obligation to go all the way and I was just trying to figure out what piece feels like it’s not been discuss in this meeting, that’s all. Ewan Birney: [unintelligible] Rick Myers, and -- Eric Green: It does, and I was going back through the slides and, again, a lot of the emphasis was per variant. Understanding the effect of this variant and my concern, Ewan, is exactly that that when you look at the variant centered point of view you don’t, for example, have a circuitry center. You don’t see what happens across large numbers of interacting things. For example, take cellular circuitry and break it down into a catalogue of 2,337 processes so that we have a finite list of those processes and we understand the context in which that variant functions. So, a lot of what we have is bottom-up construction from individual variants interpreting it for patients in the clinic. Again, incredibly important; I’m not arguing against it. I’m just saying that the bottom-up inference is going to miss things if we also don’t have a sort of top-down completeness of a wiring diagram and many of the things, even in that presentation number two, didn’t really get at that higher order picture. So, again, not that we shouldn’t do all those things, I’m just concerned. Male Speaker: We do have some key players in systems biology -- Eric Green: I agree, in this room. Male Speaker: -- so let’s start with Minolus Kellus [spelled phonetically] and then Mark. Male Speaker: So, first of all, I just want to briefly second Eric’s point and basically say we had this fifth panel that was never created and I think systems biology could have been one of them. I think this paradigm of learning what’s common across all of the variants that are associated with the disease and then learning common properties of these variants like what tissue are they in, what type of enhancer are they in, what motifs are nearby, and then using that knowledge and applying it back to individual variants is something that’s emerging a lot in our community and that something that has been a paradigm of genomics, the fact that you have the whole genome allows you to learn global properties and then go back to the individual regions armed with those properties, and interpret them better than what you could do in isolation. I think the paradigm that’s pervasive, and I think systems biology approaches and regulatory genomics approaches should, you know, could be one these sort of fifth panel type recommendations. Going back to the -- and this is actually related to the comment that I wanted to make on the comparative genomics -- the heroic effort that we saw for the, you know, blonde hair variant where that particular nucleotide could only be interpreted through, not just sequence level conservation, but understanding exactly what are the regulatory regions that are active in these other genomes and what are the motifs and how they’ve moved and how they’ve changed, I think it’s something that should be routine. It’s something that rests upon, sort of, comparative genomics to provide it as a resource to the whole community just like Encode has provided a set of regulatory annotations at varying degrees of resolution and varying degrees of sophistication, I think comparative genomics should have a mandate to provide such a list so that the next time we find such a motif you don’t have to go through, you know, years of experimentation, that you have the catalogue of exactly how all these motifs have changed. And the recent realization that, I mean, for a mouse encode which is not yet published that there’s a huge amount of regulatory conservation between human and mouse which is actually not reflected in the nucleotide level conservation means that it’s imperative for us to develop better methods for understanding regulatory evolution and regulatory conservation because there’s bound to be much more conservation and that’s what we’re starting to realize with mouse encode. Much more regulatory conservation than sequence models allow us to infer and if you had a better way of detecting that, and that goes back to the proposal with NSF, then I think we would provide a great resource for understanding disease. Male Speaker: You’re straying a bit from Eric’s primary point which I completely agree with, which is there was an awful lot of language in this meeting that had the view that the effect of a snip is sort of this unitary thing that you can study outside of the context of the rest of the genome as well as different environments. The genotype by environment thing was one way that we’re sort of stepping away from that view but the idea that that genetic variant is embedded in other genetic variants -- there’s gene by gene and other sorts of things -- and the way we word this now is to think of human disease as perturbations of these networks of genes that are involved in them and metabolic disorders, particularly, there’s a lot of literature there. I do agree completely. Male Speaker: Andy, I’m sorry and everyone I’m sorry to stop the discussion but we are well over time at this point and we do need to break. Thanks. [end of transcript]