Interpreting Variants in Non - Coding regions of the genome - Lisa brooks

Lisa Brooks: Yes. Okay, and Mike Pazin, who I'm also working on this with, is here as well. So yes, we're -- we've talked with some of you, and we're -- okay, so this is the concept clearance. Okay, so this is actually an extremely simple-minded justification. As Adam had discussed about the reason why we look at exomes or why we look at whole genome sequence, obviously, there's a cost issue. But the other reason is we don't know how to interpret variation in non-coding regions, which is a serious issue. But that's really, pretty much, the simple-minded justification for this whole effort. We have a whole genome. Jeff Schloss's program has been very effective at generating ways of sequencing the whole genome. And as I'll say a little bit later, we know that there's lots of stuff that affects phenotyping disease in the non-coding parts of the genome. So exomes are so 2010. [laughs] So the question is, we know that many genes and variants are associated with a disease; which ones are actually causal? And, as you know, function is complicated, causation is complicated. We'll talk a little bit about how we're going to sort of get away from the word causal. But the point is, as you well know, and you've seen these things all the time, that you get a region of the genome, and some of you even know how to interpret this sort of diagram. But you get a whole bunch of variants associated with each other, with the disease. And there's clearly something going on there, something of a genetic variant really is mechanistically Male Speaker: Stay tethered to the mic -- -- Lisa Brooks: Sorry. Pathogenically, causally, mechanistically related to disease. There's something that's really there that is contributing to the mechanism, how disease happens. But they've got a whole bunch of buddies along for the ride. And it's varying -- I mean, LD does exist. It's very non-trivial to figure out, okay, there is a region, there's something real in there. But of a whole bunch of variants in those region, which is the variant or variants that's really, really causing the phenotypic effect. So we know about the genetic code. Coding regions have the genetic code, so we understand much as it's -- there can be a lot of over simplifications here. But it's a good place to start, that you know what's synonymous, non-synonymous, and stop codon variants. So it's -- so in the coding regions, we have some good information that helps us interpret it. In the exon, it's about 1.5 percent of the genome. If you only focus on the exon, exonic regions, it's like looking under the lamp post for your keys. We know that non-coding DNA variants affect human diseases. There's a bunch of diseases, and there's many more. We know they affect drug use, response to drugs. You know, the GWAS catalog is full of these associations, 90 percent of so are which are not in exons. We know that both the GWAS -- and if you look at scans of the genome for natural selection, that a lot of these adaptation signatures are outside of protein coding regions. So there's a lot of the genome that clearly has functional effects that's not in exonic regions. Lots of interesting things that the sequence does. So we're getting to the concept, interpreting variation in human non-coding genomic regions, using computational approaches with experimental validation. Just -- so what we're trying to do -- we're actually trying to address the really hardest questions here. As you know, function is complicated. There's sort of easier function and harder function. The easier function is things like looking for transcription factor binding sites. You know, that's not trivial, that's an ENCODE type project where you go through the genome and you look for these elements. The question is, though, which of these elements actually affect organismal function? Just as many variants probably have zero effect on function. It's -- Adam likes the metaphor of a perfectly functioning door that goes nowhere. It can work at the molecular level, but really not have -- make a difference at the organismal level. And so figuring out which variants actually cause organismal effect is a hard problem. And so we thought it would be worthwhile to stimulate research in this area, which is to say we still need all those molecular studies, those are hugely important, but we thought we would focus on the harder problem. And so we're -- the other thing, of course, is that computational approaches, there's a huge dataset base that's needed here. It would be great to get a lot more of these data. So that's a separate discussion. There certainly are datasets that already exist that can be used. So we want to -- computational approaches, highly innovative, to identify or narrow the set of potential variants. Causality is a very hard problem, so we're not -- especially we're trying to stimulate this area with the validation, the experimental validation; we're not trying to have groups absolutely prove that this variant causes that disease. But we want to narrow the set of variants to ones that are -- potentially contain the causal variant. Female Speaker: [inaudible] what do you mean by experimental validation? Male Speaker: Mic. Female Speaker: Oh, I'm sorry. So I'm confused, then what do you mean by -- Lisa Brooks: Oh, experimental -- Female Speaker: -- experimental validation -- Lisa Brooks: Yes. Female Speaker: -- if it's not to show that the variant causes the phenotype of interest? Lisa Brooks: Well, there's show -- so, first off, we're talking about computational approaches, so computational predictions. But then we want to have some ground truth with experimental validation. There's a whole range of validation from low-throughput, gold-plated validation that really does show that that variant causes that disease. But that's very expensive. To the extent there are ways of doing experimental validation that maybe don't show show completely, but give you an indication that narrow the set of variants, that seems to be okay, or we're proposing that that would be okay. You want to -- Male Speaker: Just be careful with the language you're using, the experimental validation that shows a variant causes the disease, the association is just a probability that's associated with the disease. Lisa Brooks: Absolutely. Male Speaker: I can't believe that you'd have an experimental validation that would prove it was causative. You might prove that it'd change the expression of that gene or something else. But the idea that you could make a direct correlation to disease is not that simple. Lisa Brooks: Oh, absolutely, and so that's -- that exactly gets to Jill's question, that at the highest levels, these are associations. Then when one is doing something else experimentally, there certainly are cases where the pathway has been worked on. So you start with associations, you narrow down a set of variants, you then do experimental work, clinical work, and you think, or you more or less prove, that that variant causes the phenotypic effect. It's not based on just a whole bunch of associations. So absolutely, we're trying to move beyond GWAS associations. We're trying to get through that middle ground, where it's more than just associations, and yet it's not one variant, one huge research project. You know, because we want to eventually be able to use these methods to at least narrow down the set of variants that then can have experimental studies, can be studied in much more detail. Does that help? Male Speaker: I think the phrase "experimental validation" was also left intentionally open, in hopes, in part, of stimulating good ideas in that area. And as written, it could go as far as what you described as the gold-plated experiment where -- Male Speaker: Do you mean gold standard experiment or gold plated? Male Speaker: Well, if it's a mouse, it's gold plated as well. [laughter] No, I'm just -- the -- that's the problem with the true recreation, where you generate -- you know, this was recently published, for example, for the coding region variant in the EDAR gene, one of the areas that shows strong signature selection in Asian populations. This specific human amino acid variant that shows one of the strong associations was recreated in the mouse, an expensive experiment, because it was a knock-in that had the very same residue in the endogenous gene made the switch, and they verified a whole series of animal phenotypes that were generated by that single amino acid chain. Now that's a very expensive whole animal experiment. You're not going to be able to do very many of them, if you try to do that for all of the interesting things that have come from GWAS, and there may be a variety of situations where you could have cell culture models of phenotypes that are in vitro surrogates of things that you would like to be able to score in a whole animal. So there's a range of possibilities that I think vary in how expensive they are per variant, how whole animal like they are, and how surrogately they are. And all of those could be put together or proposed by investigators in what will then be looked at to see what's the most compelling combination of prediction and some sort of experimental test of whether the predictions are finding things that have functional effects. Male Speaker: Lisa. Sorry. Male Speaker: Yeah, I was going to ask, this is a really important point because I think the -- this could be the poster child for what Jim said and David reiterated about the loop of trying to connect back to the biology of the disease through -- I mean, back to domains one and two. But it depends on what percentage of the effort goes into that. Some of it might be very high-throughput, but there might be good reason to do a significant number of those -- and significant is a question related to the budget -- of those gold plated, gold standard, where it's warranted. So did -- have you -- is there a -- in the concept clearance, I didn't see a kind of ratio of effort on the computational versus the validation. Lisa Brooks: Yes, well, we initially had suggested one, but the small group of council members we'd initially consulted about this said, "Don't put in a specific limit; it really depends on the expense of the method." And so that -- so there's no limit there. I mean, what you're talking about, in a sense, is validating the validation method, that you -- if you have some gold standard methods that can validate that your bronze standard methods actually work well, then that becomes -- you have sort of a few very expensive assays that will validate a larger number of less expensive -- Male Speaker: They may not be so expensive, you know it was mentioned by Eric, Cas9, CRISPR technology, may be able to do very rapidly, create mice that have, you know, both alleles replaced. And, you know, Rudy Anash [spelled phonetically] had a beautiful paper just demonstrating that. Lisa Brooks: [affirmative], yes. Male Speaker: Exactly, and if that sort of proposal came in as an innovative way to be able to test function, I think that would do great in this sort of RFA. Lisa Brooks: Yeah. Male Speaker: Yeah, I mean, I think is what you want to get across, right, is that you want the community to kind of hit the sweet spot. That you can't set the bar so high, but talking about experimental support or orthogonal types of support or validation seems to me to be what will provide the most coherent message. The other thing I would just throw in there is I'm very much in support of looking at the non-coding regions if for no other reason than the fact that so many important things seem to land in there. But I think it's very important to remember that we still are clueless as far as interpreting those variants in the coding regions, too. Lisa Brooks: Yeah, the causality issue is completely true for coding as well as non-coding. Male Speaker: Yeah, we could have the genetic code, and that's a help for a minority of changes -- Lisa Brooks: It's a help. Right. Male Speaker: -- but it by no means gets us out of the woods, Lisa Brooks: Absolutely. Male Speaker: -- I just -- I don't want, you know, people thinking that NHGRI thinks that we've solved that problem, and now we're on to the next problem. Lisa Brooks: Absolutely, and, actually, we say somewhere in here -- okay, focus on non-coding variants for the reasons we've discussed, but I mean if some method gets you a region and there's some coding variants in there, that's fine. Male Speaker: And many of the techniques will -- Lisa Brooks: Be agnostic. Female Speaker: -- work regardless of whether they're in coding or not -- Lisa Brooks: That's right, and that's completely fine. Male Speaker: You don't want people to forget that as they write these and think about these. Lisa Brooks: That's right. So what we're not looking for are sort of improvements to ways of inferring because of a non-synonymous change -- Male Speaker: Expression and functional, right, right -- Lisa Brooks: You know, a non-synonymous change, you know, affects protein structure this way, and therefore, it's more likely to be causal or something like that. So that's a real focus on coding variants. But as you said, you know, there are certainly methods that are agnostic to coding this or not, and those are completely acceptable. Do you want to -- Male Speaker: Yeah, to follow up, on what Jim's saying, one thing we would like is if people are going to follow up protein coding variants, that they should do it in an agnostic way. Some of the more interesting examples of non-coding variants were found because they were initially coding variants, and upon further study, it turned out they were tag SNPs for a nearby non-coding variant. Lisa Brooks: Okay, so which variants potentially affect organismal function; sometimes this will show how the effect is brought about, or the genetic architecture, if you have things like gene-gene or gene-environment interaction, so we expect applications will include the computational approaches, as well as the experimental validation of these approaches. We're not looking for large-scale production of functional data, aside from the validation data. And we're not looking for things simply like databases or just aggregation of information on variants. There's a lot of datasets that are available to use. So the initiative focus is on genome-wide interpretation, rather than somebody saying I have a very interesting region and I really want to study the variants in those region -- in that region. What we're looking for is approaches that can be applied to a lot of datasets, so that you start with the entire genome. So just GWAS; GWAS starts with the entire genome, based on association comes down to particular regions, but it's not saying I just want to, a priori, look at a particular region. It doesn't have to be GWAS; something like genome scans can also start with a whole genome and find regions. Male Speaker: I was going to suggest, I wonder if you could add to this concept clearance, the idea of having a coordinating center whose job it'll be to run a contest, where you would provide variants to groups that say they've developed a method, and then have them all analyze those variants, and see how they do. Sort of similar to what Brenner does with KG. Lisa Brooks: Yeah, I was just thinking of KG. Male Speaker: Yeah, exactly. Lisa Brooks: That's interesting. Let me get to one more Female Speaker: But then don't you need a gold standard to judge them? Male Speaker: You would have to -- you would ask the coordinating center to try and develop such a gold standard, but it would have to be something that's not in the public domain so that they couldn't cheat. Lisa Brooks: Exactly. That's an interesting idea. It's quite related to this. I'll also point out, the focus -- even though we want them to focus -- start with a whole genome and go down, different classes of variants may have different properties, so that CNVs, say, or transcription factor binding sites, or CPG islands, the signals of which variants are actually contributing to the organismal phenotype may differ according to the class of variant. So we're not trying to say you have to -- again, this is very hard, and it's kind of early days, so we're not saying, here's a genome, give me all functional variants. Male Speaker: So, Lisa, would you say that kind of the driving idea behind this is to sort of flesh out -- you know the best computational methods that are out there. Somebody who might be saying, well, I've got this theory that knowing something about the network structure would really help me predict which enhancements would be really important. And so I'm going to make some predictions, and then I think I can test that using this cellular phenotype, and I'll read it out and see. And so I viewed it as that way, right, sort of saying, okay -- and then once -- and you would want to fund sort of a portfolio of maybe a couple of network approaches, maybe somebody who says, what I really think is important is to take all the ENCODE data and put it through some prediction algorithm that's actually totally agnostic, it uses machine learning or something to make predictions. And then I'll run that through and see well that does. Female Speaker: That's the training segment. Male Speaker: Well, I mean, no. So you could imagine that a series of different approaches will be put forth, and then by, you know, having all these folks liaise with one another, you'd get some best practices, and maybe they'd be even sharing some of their gold data standard for example. I don't know, but it seems to me that was sort of the direction, or did I get that wrong? Lisa Brooks: Yeah, no, that's a very good description. And, of course, the reviewers would like to see some evidence that a method being proposed can actually work. And I'll get to that issue towards the end. We also figure that these people -- these groups will be meeting, like, once a year, exactly to exchange ideas, and possibly validation data sets and approaches. Okay, and we want the methods to generalize beyond the specific datasets and diseases studied. So, basically, the idea is that you start with a whole genome, and go through a series of approaches. And this is just a -- kind of a very straight forward, simple-minded example where you have the whole genome, you do GWAS, you come down to regions. Then you look at say, transcription and cell types related to the disease, and it gets you down to certain ones, and then you use ENCODE, and regulation, and pathway and other datasets to get to a smaller set of variants. Other examples are things like, you know, sort of starting with GWAS, you can start with a genome scan of natural selection, chromatin -- you know, there's an example where chromatin structure, you have an indel that affects what's the open chromatin structure there. And so -- and we already know examples where that -- where those indels actually change the chromatin structure, and therefore affect whether, like, persistence of fetal hemoglobin for thalassemia is. So there are examples like that, you know, very simple-minded thing is promoter binding, knowing which variants actually affect the promoter can help you interpret the variant's epigenomic variability, so the variability itself gives you a clue as to importance. So there's a set of types of things, and we certainly hope the applicants will be quite imaginative and come up with good methods for the computational approach. For the validation methods, you know, there's a range of types of validation as we discussed. They can use model organism data. The concept clearance says that we encourage innovation in methods for validation. I mean what Joe Acker [spelled phonetically] was saying about CRISPR methods or those sorts of -- zinc finger, you know, very specific things, maybe very nice validation and maybe not too expensive, that would be terrific. Okay, there are some other initiatives. NIGMS had an RFA that was related to figuring out everything you can about function of variants, both experimental and computational, and they included things like databases. So they made about eight awards or so, only one or two of which are related to this at all, so they haven't solved the problem. Other institutes, including ours, are doing very -- somewhat developing some datasets, experimental datasets, sort of very, you know, functional methods, but that are experimental. So those will be good datasets to use. So the timeline we're talking about, we're talking about two rounds here, and we actually think this is quite important, sort of partly getting at Carlos's point. So receipt dates in January 2014 and in January 2015. So if receipt dates -- so anybody who's kind of ready to go can put in an application. But because this is difficult, because there's a lot of moving parts here, that they have to have the computational approaches, they have to have experimental approaches, it'd be really nice if they had some preliminary data showing that their computational approaches actually work a bit. By having a second receipt date a year later, we give some confidence for groups to actually put the work to pull together the experimental and the computational side. And so we think having two rounds kind of defined ahead of time will actually help stimulate the field, it will actually put together collaborations, those groups will have a chance to get some preliminary data, and put in good applications. Because of the experimental side especially, we think these are reasonably large grants, that we really want to -- especially, again, this is a very difficult topic, we really want to have the validation in there so it's not just association based. So we figure 500K direct cost per year, make about five to six in each round. Okay, so, actually, we're hoping we'll be able to start interpreting the non-coding part of the genome. So any other comments? Mike, did you want to say anything more? Any other comments on this? Male Speaker: Yeah, the -- first of all, thank you for this very thorough and clear presentation, and I'm very excited about this concept clearance. It's clearly a high-priority area; it's clearly one that really generates a lot of excitement. And you can see -- I mean, the council couldn't let you finish your presentation, they had to keep jumping in. And Carlos even started designing the proposal [laughter] right then, I mean, it's that good. It is so important, it is really exciting, and I'm glad that we're -- well, I really encourage moving this forward to an RFA, you know, that way that you've got it set up. Some of the -- this issue about -- so if the experimental tests are vitally important, and they're going to be part of -- I hope you can come up with a shorter title that will still convey that. But using the term "validation" is tricky, and it does have specific meanings to a lot of people. To me it means that you're going to take a conclusion that you've inferred from one set of data or one technology, and you're going to test it with an orthogonal technology. That's really validation, and given -- and this is really back to Anthony's point -- given that the initial idea is that these -- something in a region is associated with and potentially causative of a disease, then validation has a very high mark there. But there's a -- there are other ways to define -- because I can get a lot of mileage out of just experimental tests. I mean, is this a broader term -- Male Speaker: Or support. Male Speaker: -- and the idea is to -- Lisa Brooks: Yeah, okay, that's -- Female Speaker: Sounds good. Male Speaker: Be all, is not all computational that, it's great that it's computational because it has to be genome wide, and might also emphasize something that you did point out. Well, for -- we can see the epigenetic signals giving us strong information, but there are lots of them. There's actually many, many variables to bring in. Now that's another thing that I'm very excited about because nobody knows how best to do this. We all kind of have all a little intuition about how to go about it, and many people are already active in it, but we don't really know what the best way is. It has to be computational, it has to go genome wide, but computation and the absence of experimental feedback is of limited use. So I think this could really work well. I also like the fact that you -- it's set up, I think it's set up to not overengineer, not overdesign the RFA, and to do -- I think, was it Carlos, you said, let the best of the community -- the community's best ideas really come to bear. So I think it's great. I also really like the two rounds, two rounds, because not everybody's ready for primetime on this, but there's so much excitement, give people a chance, give many people a chance to try. We just wish there was more money to put into it. Lisa Brooks: Thank you. Male Speaker: On that, I agree with that, and I'd be even content with plausibility, you know, biologic plausibility is jumping, no -- Male Speaker: I think you set the bar too low. Male Speaker: Well, no, but I mean, that's how far off we are. So just some form of words that make it clear that we don't need to -- we want to be towards an understanding. Lisa Brooks: Right, the actual RFA, of course, can have a discussion of this, so take your point that validation may be too strong, plausibility may be too weak, support may be in the middle. But there will be discussion of it, so hopefully people will kind of understand what we're going for. Eric Green: Any other comments, discussion? So if there are no other comments, we take a vote on concept clearance matters. Female Speaker: Can we just have this friendly amendment about support versus validation? Lisa Brooks: Oh, yeah, sure. Male Speaker: Would you like to -- would one of you like to state the amendment so that we have clarity? Female Speaker: Oh, just to change the title from experimental validation to experimental support. Lisa Brooks: Yes. Male Speaker: We can go with that. All right. Eric Green: So can I -- Male Speaker: Motion to accept. Eric Green: Thank you. And a second. All in favor? Any opposed? Thank you. Lisa Brooks: Okay, thank you. Good discussion.