Tip:
Highlight text to annotate it
X
Lisa Brooks: Yes. Okay, and Mike Pazin, who I'm also working
on this with, is here as well. So yes, we're -- we've talked with some of you, and we're
-- okay, so this is the concept clearance. Okay, so this is actually an extremely simple-minded
justification. As Adam had discussed about the reason why we look at exomes or why we
look at whole genome sequence, obviously, there's a cost issue. But the other reason
is we don't know how to interpret variation in non-coding regions, which is a serious
issue. But that's really, pretty much, the simple-minded justification for this whole
effort.
We have a whole genome. Jeff Schloss's program has been very effective at generating ways
of sequencing the whole genome. And as I'll say a little bit later, we know that there's
lots of stuff that affects phenotyping disease in the non-coding parts of the genome. So
exomes are so 2010. [laughs] So the question is, we know that many genes and variants are
associated with a disease; which ones are actually causal? And, as you know, function
is complicated, causation is complicated. We'll talk a little bit about how we're going
to sort of get away from the word causal.
But the point is, as you well know, and you've seen these things all the time, that you get
a region of the genome, and some of you even know how to interpret this sort of diagram.
But you get a whole bunch of variants associated with each other, with the disease. And there's
clearly something going on there, something of a genetic variant really is mechanistically
Male Speaker: Stay tethered to the mic --
--
Lisa Brooks: Sorry.
Pathogenically, causally, mechanistically related to disease. There's something that's
really there that is contributing to the mechanism, how disease happens. But they've got a whole
bunch of buddies along for the ride. And it's varying -- I mean, LD does exist. It's very
non-trivial to figure out, okay, there is a region, there's something real in there.
But of a whole bunch of variants in those region, which is the variant or variants that's
really, really causing the phenotypic effect.
So we know about the genetic code. Coding regions have the genetic code, so we understand
much as it's -- there can be a lot of over simplifications here. But it's a good place
to start, that you know what's synonymous, non-synonymous, and stop codon variants. So
it's -- so in the coding regions, we have some good information that helps us interpret
it. In the exon, it's about 1.5 percent of the genome. If you only focus on the exon,
exonic regions, it's like looking under the lamp post for your keys.
We know that non-coding DNA variants affect human diseases. There's a bunch of diseases,
and there's many more. We know they affect drug use, response to drugs. You know, the
GWAS catalog is full of these associations, 90 percent of so are which are not in exons.
We know that both the GWAS -- and if you look at scans of the genome for natural selection,
that a lot of these adaptation signatures are outside of protein coding regions. So
there's a lot of the genome that clearly has functional effects that's not in exonic regions.
Lots of interesting things that the sequence does.
So we're getting to the concept, interpreting variation in human non-coding genomic regions,
using computational approaches with experimental validation. Just -- so what we're trying to
do -- we're actually trying to address the really hardest questions here. As you know,
function is complicated. There's sort of easier function and harder function. The easier function
is things like looking for transcription factor binding sites. You know, that's not trivial,
that's an ENCODE type project where you go through the genome and you look for these
elements. The question is, though, which of these elements actually affect organismal
function? Just as many variants probably have zero effect on function. It's -- Adam likes
the metaphor of a perfectly functioning door that goes nowhere. It can work at the molecular
level, but really not have -- make a difference at the organismal level. And so figuring out
which variants actually cause organismal effect is a hard problem. And so we thought it would
be worthwhile to stimulate research in this area, which is to say we still need all those
molecular studies, those are hugely important, but we thought we would focus on the harder
problem.
And so we're -- the other thing, of course, is that computational approaches, there's
a huge dataset base that's needed here. It would be great to get a lot more of these
data. So that's a separate discussion. There certainly are datasets that already exist
that can be used. So we want to -- computational approaches, highly innovative, to identify
or narrow the set of potential variants. Causality is a very hard problem, so we're not -- especially
we're trying to stimulate this area with the validation, the experimental validation; we're
not trying to have groups absolutely prove that this variant causes that disease. But
we want to narrow the set of variants to ones that are -- potentially contain the causal
variant.
Female Speaker: [inaudible] what do you mean by experimental
validation?
Male Speaker: Mic.
Female Speaker: Oh, I'm sorry. So I'm confused, then what
do you mean by --
Lisa Brooks: Oh, experimental --
Female Speaker: -- experimental validation --
Lisa Brooks: Yes.
Female Speaker: -- if it's not to show that the variant causes
the phenotype of interest?
Lisa Brooks: Well, there's show -- so, first off, we're
talking about computational approaches, so computational predictions. But then we want
to have some ground truth with experimental validation. There's a whole range of validation
from low-throughput, gold-plated validation that really does show that that variant causes
that disease. But that's very expensive. To the extent there are ways of doing experimental
validation that maybe don't show show completely, but give you an indication that narrow the
set of variants, that seems to be okay, or we're proposing that that would be okay. You
want to --
Male Speaker: Just be careful with the language you're using,
the experimental validation that shows a variant causes the disease, the association is just
a probability that's associated with the disease.
Lisa Brooks: Absolutely.
Male Speaker: I can't believe that you'd have an experimental
validation that would prove it was causative. You might prove that it'd change the expression
of that gene or something else. But the idea that you could make a direct correlation to
disease is not that simple.
Lisa Brooks: Oh, absolutely, and so that's -- that exactly
gets to Jill's question, that at the highest levels, these are associations. Then when
one is doing something else experimentally, there certainly are cases where the pathway
has been worked on. So you start with associations, you narrow down a set of variants, you then
do experimental work, clinical work, and you think, or you more or less prove, that that
variant causes the phenotypic effect. It's not based on just a whole bunch of associations.
So absolutely, we're trying to move beyond GWAS associations. We're trying to get through
that middle ground, where it's more than just associations, and yet it's not one variant,
one huge research project. You know, because we want to eventually be able to use these
methods to at least narrow down the set of variants that then can have experimental studies,
can be studied in much more detail.
Does that help?
Male Speaker: I think the phrase "experimental validation"
was also left intentionally open, in hopes, in part, of stimulating good ideas in that
area. And as written, it could go as far as what you described as the gold-plated experiment
where --
Male Speaker: Do you mean gold standard experiment or gold
plated?
Male Speaker: Well, if it's a mouse, it's gold plated as
well.
[laughter]
No, I'm just -- the -- that's the problem with the true recreation, where you generate
-- you know, this was recently published, for example, for the coding region variant
in the EDAR gene, one of the areas that shows strong signature selection in Asian populations.
This specific human amino acid variant that shows one of the strong associations was recreated
in the mouse, an expensive experiment, because it was a knock-in that had the very same residue
in the endogenous gene made the switch, and they verified a whole series of animal phenotypes
that were generated by that single amino acid chain.
Now that's a very expensive whole animal experiment. You're not going to be able to do very many
of them, if you try to do that for all of the interesting things that have come from
GWAS, and there may be a variety of situations where you could have cell culture models of
phenotypes that are in vitro surrogates of things that you would like to be able to score
in a whole animal.
So there's a range of possibilities that I think vary in how expensive they are per variant,
how whole animal like they are, and how surrogately they are. And all of those could be put together
or proposed by investigators in what will then be looked at to see what's the most compelling
combination of prediction and some sort of experimental test of whether the predictions
are finding things that have functional effects.
Male Speaker: Lisa. Sorry.
Male Speaker: Yeah, I was going to ask, this is a really
important point because I think the -- this could be the poster child for what Jim said
and David reiterated about the loop of trying to connect back to the biology of the disease
through -- I mean, back to domains one and two. But it depends on what percentage of
the effort goes into that. Some of it might be very high-throughput, but there might be
good reason to do a significant number of those -- and significant is a question related
to the budget -- of those gold plated, gold standard, where it's warranted. So did -- have
you -- is there a -- in the concept clearance, I didn't see a kind of ratio of effort on
the computational versus the validation.
Lisa Brooks: Yes, well, we initially had suggested one,
but the small group of council members we'd initially consulted about this said, "Don't
put in a specific limit; it really depends on the expense of the method." And so that
-- so there's no limit there. I mean, what you're talking about, in a sense, is validating
the validation method, that you -- if you have some gold standard methods that can validate
that your bronze standard methods actually work well, then that becomes -- you have sort
of a few very expensive assays that will validate a larger number of less expensive --
Male Speaker: They may not be so expensive, you know it
was mentioned by Eric, Cas9, CRISPR technology, may be able to do very rapidly, create mice
that have, you know, both alleles replaced. And, you know, Rudy Anash [spelled phonetically]
had a beautiful paper just demonstrating that.
Lisa Brooks: [affirmative], yes.
Male Speaker: Exactly, and if that sort of proposal came
in as an innovative way to be able to test function, I think that would do great in this
sort of RFA.
Lisa Brooks: Yeah.
Male Speaker: Yeah, I mean, I think is what you want to
get across, right, is that you want the community to kind of hit the sweet spot. That you can't
set the bar so high, but talking about experimental support or orthogonal types of support or
validation seems to me to be what will provide the most coherent message. The other thing
I would just throw in there is I'm very much in support of looking at the non-coding regions
if for no other reason than the fact that so many important things seem to land in there.
But I think it's very important to remember that we still are clueless as far as interpreting
those variants in the coding regions, too.
Lisa Brooks: Yeah, the causality issue is completely true
for coding as well as non-coding.
Male Speaker: Yeah, we could have the genetic code, and
that's a help for a minority of changes --
Lisa Brooks: It's a help. Right.
Male Speaker: -- but it by no means gets us out of the woods,
Lisa Brooks: Absolutely.
Male Speaker: -- I just -- I don't want, you know, people
thinking that NHGRI thinks that we've solved that problem, and now we're on to the next
problem.
Lisa Brooks: Absolutely, and, actually, we say somewhere
in here -- okay, focus on non-coding variants for the reasons we've discussed, but I mean
if some method gets you a region and there's some coding variants in there, that's fine.
Male Speaker: And many of the techniques will --
Lisa Brooks: Be agnostic.
Female Speaker: -- work regardless of whether they're in coding
or not --
Lisa Brooks: That's right, and that's completely fine.
Male Speaker: You don't want people to forget that as they
write these and think about these.
Lisa Brooks: That's right. So what we're not looking for
are sort of improvements to ways of inferring because of a non-synonymous change --
Male Speaker: Expression and functional, right, right --
Lisa Brooks: You know, a non-synonymous change, you know,
affects protein structure this way, and therefore, it's more likely to be causal or something
like that. So that's a real focus on coding variants. But as you said, you know, there
are certainly methods that are agnostic to coding this or not, and those are completely
acceptable. Do you want to --
Male Speaker: Yeah, to follow up, on what Jim's saying,
one thing we would like is if people are going to follow up protein coding variants, that
they should do it in an agnostic way. Some of the more interesting examples of non-coding
variants were found because they were initially coding variants, and upon further study, it
turned out they were tag SNPs for a nearby non-coding variant.
Lisa Brooks: Okay, so which variants potentially affect
organismal function; sometimes this will show how the effect is brought about, or the genetic
architecture, if you have things like gene-gene or gene-environment interaction, so we expect
applications will include the computational approaches, as well as the experimental validation
of these approaches. We're not looking for large-scale production of functional data,
aside from the validation data. And we're not looking for things simply like databases
or just aggregation of information on variants.
There's a lot of datasets that are available to use. So the initiative focus is on genome-wide
interpretation, rather than somebody saying I have a very interesting region and I really
want to study the variants in those region -- in that region. What we're looking for
is approaches that can be applied to a lot of datasets, so that you start with the entire
genome. So just GWAS; GWAS starts with the entire genome, based on association comes
down to particular regions, but it's not saying I just want to, a priori, look at a particular
region. It doesn't have to be GWAS; something like genome scans can also start with a whole
genome and find regions.
Male Speaker: I was going to suggest, I wonder if you could
add to this concept clearance, the idea of having a coordinating center whose job it'll
be to run a contest, where you would provide variants to groups that say they've developed
a method, and then have them all analyze those variants, and see how they do. Sort of similar
to what Brenner does with KG.
Lisa Brooks: Yeah, I was just thinking of KG.
Male Speaker: Yeah, exactly.
Lisa Brooks: That's interesting. Let me get to one more
Female Speaker: But then don't you need a gold standard to
judge them?
Male Speaker: You would have to -- you would ask the coordinating
center to try and develop such a gold standard, but it would have to be something that's not
in the public domain so that they couldn't cheat.
Lisa Brooks: Exactly. That's an interesting idea. It's
quite related to this. I'll also point out, the focus -- even though we want them to focus
-- start with a whole genome and go down, different classes of variants may have different
properties, so that CNVs, say, or transcription factor binding sites, or CPG islands, the
signals of which variants are actually contributing to the organismal phenotype may differ according
to the class of variant. So we're not trying to say you have to -- again, this is very
hard, and it's kind of early days, so we're not saying, here's a genome, give me all functional
variants.
Male Speaker: So, Lisa, would you say that kind of the driving
idea behind this is to sort of flesh out -- you know the best computational methods that are
out there. Somebody who might be saying, well, I've got this theory that knowing something
about the network structure would really help me predict which enhancements would be really
important. And so I'm going to make some predictions, and then I think I can test that using this
cellular phenotype, and I'll read it out and see. And so I viewed it as that way, right,
sort of saying, okay -- and then once -- and you would want to fund sort of a portfolio
of maybe a couple of network approaches, maybe somebody who says, what I really think is
important is to take all the ENCODE data and put it through some prediction algorithm that's
actually totally agnostic, it uses machine learning or something to make predictions.
And then I'll run that through and see well that does.
Female Speaker: That's the training segment.
Male Speaker: Well, I mean, no. So you could imagine that
a series of different approaches will be put forth, and then by, you know, having all these
folks liaise with one another, you'd get some best practices, and maybe they'd be even sharing
some of their gold data standard for example. I don't know, but it seems to me that was
sort of the direction, or did I get that wrong?
Lisa Brooks: Yeah, no, that's a very good description.
And, of course, the reviewers would like to see some evidence that a method being proposed
can actually work. And I'll get to that issue towards the end. We also figure that these
people -- these groups will be meeting, like, once a year, exactly to exchange ideas, and
possibly validation data sets and approaches.
Okay, and we want the methods to generalize beyond the specific datasets and diseases
studied. So, basically, the idea is that you start with a whole genome, and go through
a series of approaches. And this is just a -- kind of a very straight forward, simple-minded
example where you have the whole genome, you do GWAS, you come down to regions. Then you
look at say, transcription and cell types related to the disease, and it gets you down
to certain ones, and then you use ENCODE, and regulation, and pathway and other datasets
to get to a smaller set of variants.
Other examples are things like, you know, sort of starting with GWAS, you can start
with a genome scan of natural selection, chromatin -- you know, there's an example where chromatin
structure, you have an indel that affects what's the open chromatin structure there.
And so -- and we already know examples where that -- where those indels actually change
the chromatin structure, and therefore affect whether, like, persistence of fetal hemoglobin
for thalassemia is. So there are examples like that, you know, very simple-minded thing
is promoter binding, knowing which variants actually affect the promoter can help you
interpret the variant's epigenomic variability, so the variability itself gives you a clue
as to importance.
So there's a set of types of things, and we certainly hope the applicants will be quite
imaginative and come up with good methods for the computational approach. For the validation
methods, you know, there's a range of types of validation as we discussed. They can use
model organism data. The concept clearance says that we encourage innovation in methods
for validation. I mean what Joe Acker [spelled phonetically] was saying about CRISPR methods
or those sorts of -- zinc finger, you know, very specific things, maybe very nice validation
and maybe not too expensive, that would be terrific.
Okay, there are some other initiatives. NIGMS had an RFA that was related to figuring out
everything you can about function of variants, both experimental and computational, and they
included things like databases. So they made about eight awards or so, only one or two
of which are related to this at all, so they haven't solved the problem. Other institutes,
including ours, are doing very -- somewhat developing some datasets, experimental datasets,
sort of very, you know, functional methods, but that are experimental. So those will be
good datasets to use.
So the timeline we're talking about, we're talking about two rounds here, and we actually
think this is quite important, sort of partly getting at Carlos's point. So receipt dates
in January 2014 and in January 2015. So if receipt dates -- so anybody who's kind of
ready to go can put in an application. But because this is difficult, because there's
a lot of moving parts here, that they have to have the computational approaches, they
have to have experimental approaches, it'd be really nice if they had some preliminary
data showing that their computational approaches actually work a bit. By having a second receipt
date a year later, we give some confidence for groups to actually put the work to pull
together the experimental and the computational side. And so we think having two rounds kind
of defined ahead of time will actually help stimulate the field, it will actually put
together collaborations, those groups will have a chance to get some preliminary data,
and put in good applications.
Because of the experimental side especially, we think these are reasonably large grants,
that we really want to -- especially, again, this is a very difficult topic, we really
want to have the validation in there so it's not just association based. So we figure 500K
direct cost per year, make about five to six in each round.
Okay, so, actually, we're hoping we'll be able to start interpreting the non-coding
part of the genome.
So any other comments? Mike, did you want to say anything more? Any other comments on
this?
Male Speaker: Yeah, the -- first of all, thank you for this
very thorough and clear presentation, and I'm very excited about this concept clearance.
It's clearly a high-priority area; it's clearly one that really generates a lot of excitement.
And you can see -- I mean, the council couldn't let you finish your presentation, they had
to keep jumping in. And Carlos even started designing the proposal [laughter] right then,
I mean, it's that good. It is so important, it is really exciting, and I'm glad that we're
-- well, I really encourage moving this forward to an RFA, you know, that way that you've
got it set up.
Some of the -- this issue about -- so if the experimental tests are vitally important,
and they're going to be part of -- I hope you can come up with a shorter title that
will still convey that. But using the term "validation" is tricky, and it does have specific
meanings to a lot of people. To me it means that you're going to take a conclusion that
you've inferred from one set of data or one technology, and you're going to test it with
an orthogonal technology.
That's really validation, and given -- and this is really back to Anthony's point -- given
that the initial idea is that these -- something in a region is associated with and potentially
causative of a disease, then validation has a very high mark there. But there's a -- there
are other ways to define -- because I can get a lot of mileage out of just experimental
tests. I mean, is this a broader term --
Male Speaker: Or support.
Male Speaker: -- and the idea is to --
Lisa Brooks: Yeah, okay, that's --
Female Speaker: Sounds good.
Male Speaker: Be all, is not all computational that, it's
great that it's computational because it has to be genome wide, and might also emphasize
something that you did point out.
Well, for -- we can see the epigenetic signals giving us strong information, but there are
lots of them. There's actually many, many variables to bring in. Now that's another
thing that I'm very excited about because nobody knows how best to do this. We all kind
of have all a little intuition about how to go about it, and many people are already active
in it, but we don't really know what the best way is. It has to be computational, it has
to go genome wide, but computation and the absence of experimental feedback is of limited
use. So I think this could really work well.
I also like the fact that you -- it's set up, I think it's set up to not overengineer,
not overdesign the RFA, and to do -- I think, was it Carlos, you said, let the best of the
community -- the community's best ideas really come to bear. So I think it's great. I also
really like the two rounds, two rounds, because not everybody's ready for primetime on this,
but there's so much excitement, give people a chance, give many people a chance to try.
We just wish there was more money to put into it.
Lisa Brooks: Thank you.
Male Speaker: On that, I agree with that, and I'd be even
content with plausibility, you know, biologic plausibility is jumping, no --
Male Speaker: I think you set the bar too low.
Male Speaker: Well, no, but I mean, that's how far off we
are. So just some form of words that make it clear that we don't need to -- we want
to be towards an understanding.
Lisa Brooks: Right, the actual RFA, of course, can have
a discussion of this, so take your point that validation may be too strong, plausibility
may be too weak, support may be in the middle. But there will be discussion of it, so hopefully
people will kind of understand what we're going for.
Eric Green: Any other comments, discussion? So if there
are no other comments, we take a vote on concept clearance matters.
Female Speaker: Can we just have this friendly amendment about
support versus validation?
Lisa Brooks: Oh, yeah, sure.
Male Speaker: Would you like to -- would one of you like
to state the amendment so that we have clarity?
Female Speaker: Oh, just to change the title from experimental
validation to experimental support.
Lisa Brooks: Yes.
Male Speaker: We can go with that. All right.
Eric Green: So can I --
Male Speaker: Motion to accept.
Eric Green: Thank you. And a second. All in favor? Any
opposed? Thank you.
Lisa Brooks: Okay, thank you. Good discussion.