Tip:
Highlight text to annotate it
X
Male Speaker: So we were probably the smallest group, I
think, of the breakout sessions, comparative genomics and evolution, but, notwithstanding
we had a pretty lively, sometimes confusing discussion which I think we finally cemented
near the end. We started out with kind of revisiting, I think, some of the things we
think are so important and foundational about this program that has existed at NHGRI almost
since its inception. And the kind of the fundamentals which I think we were all in agreement is
on essentially that evolution is the unifying principle by which everything that we are
doing rests. So, studies of variation, studies of mutation, are essentially fundamental and
how those processes occur require really comparative genomics.
The idea of genotype-phenotype correlations, which is what most of the people in this room
are interested in, also benefits and I would say it probably makes most sense in the context
of evolution, and so it provides us an unbiased framework for discovery and prioritization
of regions and, I would argue, that as we move perhaps into interrogating non-coding
sequences, regulatory sequences and trying to understand them, that the comparative genomics
and evolutionary aspects would become more and more important. And I guess the last important
point is that NHGRI really has blazed a trail in terms of this research. In terms of mammalian
and vertebrates in specifically there’s the expertise, we have the computation, we
have the resources in terms of libraries and other types of things, and we have the consortia.
So, the track record and the ability to do this type of work really surpasses any other
institute at NIH.
So, we began by first, kind of -- and I’ll do this very quickly -- just reviewing the
accomplishments and sorry to those that we don’t list as an accomplishment but there
have been many in this area over the last 15 years. Sixty vertebrate genomes have been
sequenced in some form or fashion and aligned with the human data revealing about 3 million
specific evolutionary conserve segments so about 4.5 percent of our genome. I think the
important point to think about is that this is work in progress. In many cases, the genomes
are not assembled or they’re just used essentially to align to the human reference and so we
don’t, in many cases, have stand-alone, high-quality or even reasonable working draft
assemblies from many of the genomes. So, I just looked at the average N50 contig length
for primate genomes and it’s on the order of about 25 kilobases.
Point number two: one of the missions that we’ve had for many years is essentially
to reconstruct evolutionary history of every base in the human genome. We are not there
yet. We’ve made some good strides in this area but we’re lacking critical species
in terms of high-quality. We’re lacking prosimians; we don’t have a high-quality
tarsier reference genome, for example. There’s one and a half million gaps in the tarsier
assembly right now. Less than 50 percent of that sequence can be aligned to the human
reference. So, if you think this is fait accompli, you’re wrong. We’re not done in terms
of this project, at least in terms of how we set out. We’ve done moderately well.
I changed this from deep catalogues to been [spelled phonetically] catalogues. There have
been efforts by the Grade A Genome Project, rhesus macaque, African green monkey to begin
to survey some of the genetic variation that exists within these species. There’s been
studies that have inbred Drosophila strains to understand, really, a product provider
framework for quantitative genetic trait studies, and so these have been good. More could be
done in this particular area.
And the last point, which we had some debate whether this was part of our purview but we
think this is fundamentally comparative in nature because it involves comparative genomes
both past and present, is the fact that we’ve had -- and I think this is an achievement
-- the Human Reference Genome Consortium, whose mission it has been to continually improve
the reference of the human genome as we go forward. So, for many of you, you might think
of this as the housekeeping exercise to kind of finish off the gaps. It’s much more than
that. The regions that are being tackled right now and it’s a targeted approach for specific
regions are regions that are incredibly diverse. Think of MHC, think of T cell receptor regions,
think of regions around signal duplications, highly dynamic, gene rich, important in terms
of human health and highly variable. There is more variation in the three megabase stretch
-- tens of KB, hundreds of kilobases of sequence variation between different haplotypes which
has not been catalogued because of the complexity of that type of variation. So, as a result,
some of the holes -- and we kind of really echoed the first goal of the last group is
that we have not yet comprehensively assessed all genetic variation in any single genome.
So, it’s not just a question of allele frequency spectrum, it’s a question of getting all
the variation. All the endales, all the structural variants, all the copy number changes and
there are more. Estimates are four to five times more base pairs affected by structural
variation than by single base pair and endale events.
All right, so that’s kind of the achievements. This is just to remind you this is taken from
two papers just kind of the phylogenes [spelled phonetically] that have been tapped in terms
of this. I highlight here a few. I’ll just mention gorillas, for example. We’ve worked
on many of the primate genomes. The gorilla genome was recently sequenced and assembled
put together. It’s average contig length is about 11 kilobases, I think, if my last
recollection on this. There’s about a half a million gaps in the gorilla genome. What
that means is when we did the four-way alignment with the apes, which includes humans, 30 percent
of the genome could not be aligned in that four-way alignment. So, only 65-70 percent
of the genome could be aligned. That’s hue chromatic. That’s genes. So, we have very
much heterogeneity on the terms of quality of the genomes that have been generated. Many
of them have been just used simply to align to the human reference. So, many of them,
mammalian genomes, roughly the 34 that have been done, 29 depicted here, are not particularly
high-quality draft assemblies. The only high-quality assemblies on this slides really, to be honest,
are human and mouse. And many of the others are in various stages of working draft.
So, when we set out the goals from our group we basically went to the -- we actually broke
them each one down to really four things. Essentially what’s the big question and
why is NHGRI relevant in this particular question? Second was the tactic or the approach. Third
was details and fourth was justification, not in that order. So, I think we agreed in
our session that this was the single most important goal. People can disagree with me
that were in that breakout group, was to move from aligning genomes to essentially doing
de novo sequence and assembly without guidance. To be able to take a genome, no matter what
species, human, otherwise, and to be able to generate a high-quality de novo sequence
and assembly of that genome and so we have the specific. We would suggest or argue that
what NHGRI should invest in and not be dependent solely on the commercial sector for this,
was to advance sequencing technologies, to advance or assemble a genome for $10,000.
So, this is not generate 40x sequence coverage with illuminae. This is to actually assemble.
The cost of assembling genomes is actually still has been prohibitive. We have some statistics
now based on assembly with one human genome using long read pack bio data that suggests
it would cost us about $60,000 [unintelligible] to assemble a genome with an N50 contig length
of 4.4 megabases. That’s a 150-fold improvement in terms of N50 contig length based on just
standard of illiminae sequencing. I don’t think it’s unreasonable to think that we
could have an order of magnitude drop in that cost to get to us to a 10,000 genome assembly.
We suggest that one useful -- so, this is the specifics. I think Jeff Schlozz [spelled
phonetically] asked for this specifically. One specific approach or area that NHGRI could
invest in would be to apply this to a finite number of Human Reference Genomes. So, to
generate Human Reference Genomes at the quality of the existing Human Reference Genome or
better for 50 different humans representing diversity or sampled broadly across humanity.
So, we’re thinking of this as kind of what we call gold genomes. Very high-quality genomes
where most of the bases and structural variation copy number endale have all been resolved.
We think it would be incredibly powerful because it would give us a comprehensive view of the
types of genetic variation that exist kind of in the sweet spot right now where we can’t
access very well.
So, as a member of the structural variation working group for the last x number of years,
part of the 1000 genomes and before that earlier on other projects, we are not particularly
good at detecting inversions. We are not particularly good at genotyping or detecting insertions.
We are terrible in terms of complex structural variation events with [unintelligible] duplications.
And so this is an area if you could think about where we would have 50 reference -- call
them continental genomes if you will: Africans, Asians, Amer-Indians, Europeans, but we would
have high, very high-quality references at those positions. This would give us, I think,
the first truly comprehensive view. We just ran some statistics recently and we think,
based on what we’ve been able to do on one genome comparing illuminae pack biotechnology,
that between 50 base pairs and 5,000 base pairs we are missing a 90 percent of deletion
variance. Or, I should say, 62 percent of deletion variance, 90 percent of insertion
variance. So, we -- if we think we are completely understanding this variation in the human
genome, we’re really mistaken. I pushed for this but I -- there was push back in the
group, so I’ll just mention this. I think the goal should even be better than this.
I think we should push to sequence from telomere to telomere every human chromosome, including
the dark matter. The centrameric [spelled phonetically], the akricentric [spelled phonetically].
I think it can be achieved. Won’t be achieved today, maybe not achieved in the next couple
of years, but no other institute will achieve this. And we know that variation within centromere,
we know that variation with t lemurs is important in terms of human health.
This one, this is a big lofty goal. What makes us human? This requires an emphasis on primate
genomes. We still have not achieved this, which is something that we set out over 10
years ago, which was to assign every human lineage specific or genomic change to a specific
branch on the evolutionary tree of primates. Many in the group were most interested in
no human specific changes with functional consequences, including gene innovations,
and so we are still discovering new genes in 2012, 2013 that aren’t in the Human Reference
Genome. These are typically often duplicated genes, but they are also important in terms
of human health and human adaptation. So, in terms of specifics and concrete we would
argue that it’s possible with all the researches that have been generated now to focus on generation
of high-quality de nova assembly of non-human primate genomes. We suggest as a straw man
of 16 primate genomes including many which have already been in working draft stage and
to assemble them at the quality of the Human Reference Genome. Sixteen is a number that
we use based on looking at available resources including back resources, but also having
at least two representations from every major fellow genetic branch from the human lineage.
This would provide us fundamental information on processes of mutational processes, speciation,
differences in lineage specific sorting, gene flow, and et cetera. And there was some discussion
in our group, and we think it’s an interesting observation, that many of the recurrent micro
deletions that are actually mediating genomic disorders in the human population are caused
by human-specific duplications and complex regions that have evolved over the last 5
to 10 million years of human evolution. There’s remarkable genetic variability in those regions
which predisposes some individuals and certain haplotypes to develop, you know, have more
current micro deletions and others not.
The last point or goal that I’ll mention was essentially this: to obtain nucleotide
level resolution of every conserved functional element in humans. We are not there yet. We
heard some great stories yesterday about the power of actually comparative genomics and
helping to identify regulatory elements. The story from David Kingsley [spelled phonetically]
and finding that mutation in the regulatory element for a kit and that how these weren’t
detected by Encode but were picked up as being based on comparative analyses. The data that’s
out there right now which is roughly the 30 mammals gets us down to about 12 base pair
resolution. Simulation suggested if you push this to 100 to 200 mammalian genomes sequenced
deeply, you’d get down to a single base pair resolution. And I think that’s an easy
target without any advances in sequencing technology right now could be generated. Some
people said, “Well, maybe this will be done by, you know, the 10K diversity project or
the 10K genome project or other projects that are out there.” I don’t think so. It may
be, but that’s not their mission. The mission here should be to sequence genomes, make that
data publically available so everybody can analyze it as quickly and rapidly as possible.
This would allow us to quantify the selective constraints on each element or cause mammals
and integrate with existing data sets and encode in both mouse and human, and if advances
in computational technologies and advances in sequencing technologies came along, it
wouldn’t I think be beyond the pale to think about doing additional mammalian genomes at
high-quality like we have it done for the mouse. All right, I’m going to turn this
over to Andy.
Male Speaker: I wish there was a button like that where
I could blank all your screens, too. If you think of the top causes of mortality and morbidity
in humans and consider cardiovascular disease, COPD, stroke, diabetes, the list that Mike
Banky [spelled phonetically] gave, they have two things that are in common. One is that
they’re adult onset and the other is that they’re remarkable sensitive to the environment
so that your risk is a function of your environmental stresses as well as some attribute of your
genome. So, we’re all unique. Some of us are more unique than others. And we’re unique
not only because of our genomes but because of that trajectory of environmental stresses
and environmental exposures that we’ve had during our lifetime. Now, when you’re faced
with trying to infer causality in a situation that’s so high-dimension where the factors
are confounded so badly as they are with genotype and environment in the case of humans, you
have a very tough problem ahead of you, and we’re all familiar with that.
Now there are two major impediments, then, for studying and understanding causation in
the face of adult onset diseases of this sort. One is the fact that there isn’t really
a good, controlled experiment that any of us can do. There’s no control for that experiment.
And the other is -- because it’s observational -- and the other is that we can’t replicate
the experiment. We can’t take the same genotype and put it in a bunch of environments and
ask what happens.
So, there’s good news and bad news. The good news is that model organisms do precisely
this. They allow us to put the same genotype in multiple environments and take apart genotype
by environment interaction very carefully. The bad news, of course, is that when you
do the right experiment, take a set of zebra fish or mice or flies or worms through a set
of different environments, the rank order of the phenotype that you score almost always
flip around. That is, genotype by environment interaction is almost universal.
So, we’re in the situation where we need to understand, how do genotypes respond to
different environments, and the answer is we need to figure out what’s the best way
to work with model organisms. Now, we all know of examples where model organisms are
terrible for modeling specific diseases. There’s a human disease where the mouse doesn’t
even have that gene. But we need to move beyond that. We need to ask the question now using
genomic technologies, what really is the best model for each specific disease. I think Aviv
Regev’s talk was particularly kind of informative in thinking about how gnomic technologies
could really incredibly sharpen our ability to focus on model organisms for specific disease.
If we had catalogues of the sort she described for mouse genes, for instance, there’s great
opportunity for taking this forward.
So, goal four, then, is to leverage the power of model organisms in functional genomics.
Of course, this resonates very well with goal one where Eric Orwinkle [spelled phonetically]
emphasized the fact that we need to understand basic biology before we can understand disease.
I would argue that model organisms are an important path to that. And, of course, in
goal two Rick Myers [spelled phonetically] and Mark Erstine [spelled phonetically] argued
quite well that model organisms do have the ability to allow us to infer function at the
adult stage, which -- or whole organismal stage.
So, some of the points for how we would proceed to do this by, for instance, doing -- applying
large scale genomic and other o mix technologies to reference panels. So many of the model
organism communities are working on functional genomics. There are such panels as the collaborative
cross, the diversity out-reads, and so forth in mice or many others and other model organisms.
The idea of taking human mutations forward into model organisms and studying them both
at the cellular as well as the organismal stage. This will scale beautifully now with
crisper technology. We’ll be able to make thousands of human mutations and put them
in adult mice and put them in different stressful environments. So, by doing this across different
environments then we’d have this really good handle on the sort of full degree of
genotype by environment interaction in those particular physiologies that are relevant
to these human disease states.
One other ideas was that well, as we study these other organisms and the more comparative
evolutionary kind of view of the world, we look at, for instance, naked mole rat we find
the fact that they don’t have tumors. We might identify genes that we suspect might
be important in that process. What about doing the reverse experiment and taking some states
from these non-model organisms and putting them into human cells and seeing how they
behave. That’s a kind of out there idea that I won’t take credit for.
Okay, so that’s the end of the ‘model organisms’ sermon. Getting back to the grand
scope of comparative and evolutionary genomics, all of the things that Eric told you about
couldn’t be done without serious improvements to the computational infrastructure that we
need. So we need to develop informatics infrastructure to produce, display, and quantify these multiple
species genome alignments. An alignment of species is a central of species genomes is
a central tool for inference of where along that phylogeny did particular changes occur.
When we layer on that functional information as well. We get terrific insight about the
way genes and phenotypes have evolved. So this requires development of algorithms, software,
alignment methods. It requires development of new browsers. Anybody who’s been involved
in any of these projects know that you have a constantly shifting coordinate system for
the genomes as you discover huge insertions in one species that weren’t in the others
and so forth. We need to devise methods of analysis of complex chromosomal rearrangements,
methods of representing genomes in the face of those rearrangements, and finally, we need
to produce benchmarks, quality control metrics, and assessments of accuracy of these methods.
So, all summarize just by kind of listing all of the goals, going through these quickly,
reminding you that evolution is the single most powerful unifying principle in all of
biology. That we -- the history of biology is that we have learned an enormous amount
from that now. I’ll warn us against the arrogance that we all are somewhat subject
to with the power of the tools of genomics to think that well, now that we can do this
in human cells we don’t need to think about anything else anymore. We can just do all
the manipulations in human cells. I think that evolution still has an enormous amount
to teach us. That model organisms have also marched forward in their technologies for
manipulating and perturbing genomes.
We need to develop strategies and technologies to obtain high-quality de novo reference sequence.
This will be applicable throughout all of biology, not just even the goals that Eric
outlined. We need to -- I mean, Evan, sorry. We need to target multiple primate genomes
to infer high confidence, all of the human specific attributes: so this is enormously
useful in doing comparative biology. Seeing what are uniquely human traits and how are
we different from our closes relatives. That’s of also tremendous intellectual interest,
and I think we could bring in the public in sort of sharing the excitement over these
sort of aspects of the science that we do. The sort of fundamental question of how did
we evolve from our most recent ancestors is one that resonates very deeply with the public.
By sequencing multiple mammals -- we were told last night there are only 5,400 mammals
that are known so eventually we might be able to get there but starting with the first few
hundred using current technologies -- this is not expensive -- we could easily get to
the point of being able to identify all the human-specific conserved elements from those
alignments and comparisons.
The fourth point, again, the model organism thing. I’ll beat that drum one last time.
They’re still enormous utility for understanding many aspects of basic biology but including,
particularly, these context-dependent variant functions where those contexts include anything
from diet to drug treatments and so forth. Those scale beautifully in experiments with
model organisms, and so sequencing reference panels of them would enable those studies
to proceed at a much greater pace.
The fifth goal was, again, this development of software tools for dealing with multiple
sequence alignments, and I’ll close with just emphasizing again all of the things that
we talked about; none of them are actually on the purview of other institutes. The National
Institute of General Medical Sciences does do a lot of evolutionary biology. They have
funded some model organism work and so forth, but the scale of the sequencing, there is
an aspect of this problem that is uniquely NHGRI and we would like to see them do it.
So, with that I’ll take questions. Dave?
Male Speaker: So, I want to point out that this so resonates
with the goal that we heard from Heidi. Detect all types of clinically-relevant variation
in a single genome scale test. That’s very, very consistent with the 10,000 genome goal.
And I would say that I love these charts that we’ve been seeing with the Moore’s Law
and the approaching the $1,000 genome, but there’s a lot of wishful thinking in there.
Those aren’t genomes, right? We really need to get back and do this the way that Evan
described beautifully in this so we can really say that we’re sequencing the whole genome.
Male Speaker: I mean, I agree. I was really pleased. I was
hoping that one other group would come up with the importance of being able to do a
single genome --
Male Speaker: Yes.
Male Speaker: -- without alignment to the reference because
that is fundamental human genomics. A wise man once told me is that our field is skin-deep
in terms of its fundamental algorithm which is to understand all the genetic variation
comprehensively and once we do that, that’s finite. We can assess links to human phenotype.
And so this is where we will be, whether NHGRI leads the charge or not in five, 10 years
from now. We won’t be doing these alignments anymore, we’ll be doing de noval assembly.
Male Speaker: And there has been a spectacular improvement
starting with $300 million dollars in 2000 to what we can do now in terms of getting
up where real high-quality is though.
Male Speaker: And it’s not as if the private sector isn’t
going to play an important role. Companies like Pacbio, Nanopore, Oxford Nanopore, they
can continue but if there was an incentive of push at some level to drive this even more
I think we could accelerate and get to that point of a single genome assembly, you know,
instead of ten years, five years as part of a routine clinical test.
Male Speaker: Jim.
Jim Evans: To follow up on those statements, I think
this is so important, in particularly from the evolutionary aspect since new mutation
frequency is much greater on a low to specific basis for copy number versus single nucleotide
variance, we need, really, to get better structural intimation and assays for this. And it ties
you right into an institute nobody I’ve heard talk about right now and that’s environmental
health, NIEHS, and we’ve already got evidence from Tom Cluver’s [spelled phonetically]
work that hydroxylamine, the chemical we use clinically induces C and D mutations at high
rates in both yeast and mammalian cells in vitro, so studying these mutational methods
-- I mean, the aims test doesn’t even test for copy number. It gets only at mostly single
nucleotide variance so how the environment interacts with our genome with respect to
copy number is a total black box.
Male Speaker: I mean, a related idea to this inference of
mutational processes is inferences about differences in re-combinational processes which are also
fundamental to, well, evolution and everything else about the map. So, yes, I agree.
Jim Evans: And Richard, to follow up on that, I mean,
the fact that Alan Jeffries [spelled phonetically] has nicely shown PRDN 9 alleles influenced
genomic disorder rates and that you can change that in different environments, I think we
have to go at that in a big way for mutations.
Male Speaker: Richard.
Male Speaker: This seems an opportunity to move the center
of gravity of disease models towards primates from mouse and whilst you mentioned primates
in the other context I think you didn’t strongly state building and exploiting primate
models for human disease should be a priority. Was that discussed?
Male Speaker: Well, we actually didn’t discuss it too
much. I think the impediment there is the problems of working with primates is just
-- there are limits to what we can do. They’re not totally insurmountable, but that’s a
limit.
Male Speaker: Well, I think you could argue too that there’s
been technological changes in all the areas that really change the complexion of that.
Male Speaker: I was interested when in goal two you didn’t
mention archaic humans in this high confidence list of human specific genome attributes.
I think that is -- I know it’s not science fiction obviously --
Male Speaker: We discussed this specifically, NAME, and
this came up with in the context of yes, there’ll be more archaic hominins that’ll be sequenced.
Most of that will be done with short read technology largely because the fragment lengths
are so short in these archaic hominins that it doesn’t really lend itself to a de nova
assembly of Neanderthal or denisovan for that matter. It’s generally something that we
feel is going to happen whether NHGRI invests or not, so we were looking for those seven
characteristics that, you know, they laid out in the beginning. You know, high through
put of, you know, consortia technology advance --
Male Speaker: I mean, then that’s a focus on the data
generation component of it and not the data analysis component. I think it would be very
foolish in the data analysis component to ignore the archaic human.
Male Speaker: Absolutely.
Male Speaker: Absolutely.
Male Speaker: I’m just, a kind of push -- I’m I think
on the model organisms I think there is this huge the gene environment affect. It’s very
clear that you just have opportunities there especially with fixed genotypes to explore
things which are very, very hard to do observationally in humans, I mean, so I really believe that.
A personal plea is that we don’t narrow ourselves down to just mice and zebra fish
as model organisms. I just think that we’ve got a much bigger repertoire of organisms
out there than, just say, model organisms equals, you know, a very small list of species
and there’s a bigger diversity of useful organisms out there beyond those two.
Male Speaker: We’ll include badaca [spelled phonetically].
Mark.
Male Speaker: I just wanted to say that I really agree with
Evan’s point about the impact of structural variation, the importance of having high-quality
genomes. I just want to mention, then, the functional breakout group we also did really
talk about thinking about the functional impact of structural variance. I mean, it’s much
more complicated and potentially much larger than single nucleotide variants. We really
have to think about ways of thinking about this impact.
Male Speaker: I agree. Carlos.
Carlos Bustamante: Just to add to Ewan’s point on the potential
for archaic and ancient DNA, in fact, there’s a ton of technology development recently that
is sort of upending this issue about, you know, how much can you really get out, right?
So, when they do, for example, the single-stranded library prep it turns out you get many more
molecules than the double-stranded and that’s why you’re able to take these [unintelligible]
to high coverage so it’s actually an area that the U.S. has invested almost no money
in, right. All of the development has actually happened in Europe and you could imagine that,
you know, in fact, there may be some bone somewhere that have somewhat larger fragments
that could be sequenced. So, I wouldn’t totally rule it out and nobody knows how far
back you could go, right? We don’t have a *** erectus sequence yet; that doesn’t
mean it can’t be done, but you know, it just hasn’t been prioritized in terms of
what’s been done and there’s basically two, three labs in Europe that are leading
in that area.
Male Speaker: Yeah.
Female Speaker: So, just to goal five, to point five, was
there discussion on the committee about partnering with, say, NSF on advancing some of the alignment
algorithms and comparative data analysis tools because they also expend a lot of money in
this area? It’d be a good partnership.
Male Speaker: There wasn’t. That is a very good idea.
The NIH/NSF joint developments in quantitative biology have been very encouraging and that’s
a very good suggestion.
Male Speaker: I mean, I guess my feeling on this is that
most of the genome browsers that are being used, I mean that’s just one, obviously,
way to do this, but have been driven largely by the genomics community funded by Welcome
and NIH. And I think, I mean, we have a lot of experience in this and there’s been a
lot of discussion, I’ve been to several meetings about how would you, if you had 50
human references right now at high-quality, how would you display them so people could
access the information, optimize their mapping so they could find, you know, the right genome
and still be able to communicate these ideas. It’s not trivial at all. I mean, for sure
if there is value for adding NSF in partnership, I mean, we should take advantage of everything
we got. But I think my feeling on this is that we as a community have taken the leadership
role in this and we should continue to push on this because this is, again, not an easy
or solved problem.
Male Speaker: Eric.
Eric Green: So, I wonder if it’s okay to make a comment
across the four sessions so far because we’re, I assume, coming to the end. So, it’s a
spectacular range of projects and I want to share the general enthusiasm about the specific
things have been proposed. I think many of these things in this particular breakout are
incredibly important, but I was trying to think about what’s missing in what we’ve
been talking about this morning? And I think it’s probably there, but it’s maybe hidden
a little bit when we’re spending a lot of attention on structure. What are the nucleotides?
What are the variations in them? How do they correlate with disease? And then we mentioned
disease relevant functions, and we talk a lot about mutating the gene and seeing in
an assay what affect it has.
What I’m wondering is missing is the connection with cellular circuitry. That being able to
accomplish our goal of interpreting variation in the context of disease, we may be able
to interpret variation in the context of the protein that it affects, but to truly interpret
it in the context of disease, there’s a set of NHGRI-ish kind of activities of systematically
dissecting cellular circuitry when this enhancer gets affected, when this protein gets affected,
what are the consequences? When 108 in schizophrenia get identified or 60 genes in heart disease
get identified, how do we recognize what effect that is having on the cell?
And, so, I would not like NHGRI to have to pay for all of it, but I do think that there
is a set of infrastructure, it’s a little related to the Links project, it’s related
to what Aviv was talking about yesterday, it’s related to going from individual enhancers
to whole circuits, and somewhere NHGRI ought to be the intellectual leader of that, and
it ought to be paid for maybe by common fund or others, but we’re not going to be able
to interpret disease without it. We’re going to get, based on everything that’s been
laid out here, a great description of the structural problems, I agree, down to completeness.
We’re going to get great correlations. We’re going to get protein structure responses,
but there is a piece and we haven’t defined it here that we better define, and it’s
a set of data bases about circuitry, circuitry responses, cells, and I was unclear whether
it was inbounds or out of bounds but as I think about what we’re doing, if we don’t
make sure that that piece gets done it’s going to be really hard to interpret these
subtle mutations even with all the assays and all the other things we’re doing. So,
I don’t mean to destabilize anything here by arguing and, I think, we should go forward
with these things, but somewhere we better also launch a process that’s doing that.
Male Speaker: But I think to make sense of those, Eric,
you really have to start by making sure that you’ve got the finite aspect of our universe,
which is our genome sequence totally understood, because all of those things really make sense
in the context of the variation in which those mutations are found.
Eric Green: Look, I’m agreeing that we want to get that
sequence, but where I’m disagreeing is you first must do that because to meet our goal
which is understand disease, we must do that and then we’re going to have to interpret
that in terms physiology, and all of this structural stuff -- very important -- isn’t
going to get us the physiology that we owe for disease so it’s not at the expense of
it, we just also in addition, not separately, not competing better be doing it, that’s
all.
Male Speaker: It reminds me of something 10 years ago when
we were actually analyzing first Venture’s genome and comparing it to human reference.
And Venture’s genome was, if you remember -- I’m sure you do -- was significantly
shorter. And where it was shorter were all the genes and all the segmental duplications
which were not in Venture’s genome. So, we could not as a community, have begun to
understand break points of genomic disorders, copy number variation, without actually the
investment of NHGRI in terms of building a better reference because what you can’t
see, you can’t assay. So, I think --
Eric Green: You’re defending the need to get complete
structure. I’m totally agreeing; my comment is independent of your comment. Let me totally
endorse all of what you’re saying and then add we still are not going to be able to reach
our goal of interpreting disease without additional things. All of it is great, but we have an
obligation to go all the way and I was just trying to figure out what piece feels like
it’s not been discuss in this meeting, that’s all.
Ewan Birney: [unintelligible] Rick Myers, and --
Eric Green: It does, and I was going back through the
slides and, again, a lot of the emphasis was per variant. Understanding the effect of this
variant and my concern, Ewan, is exactly that that when you look at the variant centered
point of view you don’t, for example, have a circuitry center. You don’t see what happens
across large numbers of interacting things. For example, take cellular circuitry and break
it down into a catalogue of 2,337 processes so that we have a finite list of those processes
and we understand the context in which that variant functions. So, a lot of what we have
is bottom-up construction from individual variants interpreting it for patients in the
clinic. Again, incredibly important; I’m not arguing against it. I’m just saying
that the bottom-up inference is going to miss things if we also don’t have a sort of top-down
completeness of a wiring diagram and many of the things, even in that presentation number
two, didn’t really get at that higher order picture. So, again, not that we shouldn’t
do all those things, I’m just concerned.
Male Speaker: We do have some key players in systems biology
--
Eric Green: I agree, in this room.
Male Speaker: -- so let’s start with Minolus Kellus [spelled
phonetically] and then Mark.
Male Speaker: So, first of all, I just want to briefly second
Eric’s point and basically say we had this fifth panel that was never created and I think
systems biology could have been one of them. I think this paradigm of learning what’s
common across all of the variants that are associated with the disease and then learning
common properties of these variants like what tissue are they in, what type of enhancer
are they in, what motifs are nearby, and then using that knowledge and applying it back
to individual variants is something that’s emerging a lot in our community and that something
that has been a paradigm of genomics, the fact that you have the whole genome allows
you to learn global properties and then go back to the individual regions armed with
those properties, and interpret them better than what you could do in isolation. I think
the paradigm that’s pervasive, and I think systems biology approaches and regulatory
genomics approaches should, you know, could be one these sort of fifth panel type recommendations.
Going back to the -- and this is actually related to the comment that I wanted to make
on the comparative genomics -- the heroic effort that we saw for the, you know, blonde
hair variant where that particular nucleotide could only be interpreted through, not just
sequence level conservation, but understanding exactly what are the regulatory regions that
are active in these other genomes and what are the motifs and how they’ve moved and
how they’ve changed, I think it’s something that should be routine. It’s something that
rests upon, sort of, comparative genomics to provide it as a resource to the whole community
just like Encode has provided a set of regulatory annotations at varying degrees of resolution
and varying degrees of sophistication, I think comparative genomics should have a mandate
to provide such a list so that the next time we find such a motif you don’t have to go
through, you know, years of experimentation, that you have the catalogue of exactly how
all these motifs have changed.
And the recent realization that, I mean, for a mouse encode which is not yet published
that there’s a huge amount of regulatory conservation between human and mouse which
is actually not reflected in the nucleotide level conservation means that it’s imperative
for us to develop better methods for understanding regulatory evolution and regulatory conservation
because there’s bound to be much more conservation and that’s what we’re starting to realize
with mouse encode. Much more regulatory conservation than sequence models allow us to infer and
if you had a better way of detecting that, and that goes back to the proposal with NSF,
then I think we would provide a great resource for understanding disease.
Male Speaker: You’re straying a bit from Eric’s primary
point which I completely agree with, which is there was an awful lot of language in this
meeting that had the view that the effect of a snip is sort of this unitary thing that
you can study outside of the context of the rest of the genome as well as different environments.
The genotype by environment thing was one way that we’re sort of stepping away from
that view but the idea that that genetic variant is embedded in other genetic variants -- there’s
gene by gene and other sorts of things -- and the way we word this now is to think of human
disease as perturbations of these networks of genes that are involved in them and metabolic
disorders, particularly, there’s a lot of literature there. I do agree completely.
Male Speaker: Andy, I’m sorry and everyone I’m sorry
to stop the discussion but we are well over time at this point and we do need to break.
Thanks.
[end of transcript]