Tip:
Highlight text to annotate it
X
[Dr. Marta Gwinn] -- and so, what I have to say is really not new.
In fact, I'll be mentioning things that others have
discussed in a lot more detail during earlier talks,
but I hope that by offering what might be considered something
of a consumer perspective, I can help synthesize and
integrate what we've been hearing about today into
the larger picture of our knowledge of how genetics
and environmental factors contribute to human health
and disease.
Now, this is a slide from the Department of Energy's public
slide gallery, which was made available early on in the
Human Genome Project.
And the title says, "Gene chips reveal susceptibilities."
If only it were that easy, because what they really
reveal is data.
And, by dint of effort and cleverness and high technology
we can transform those data into information.
This is the very same image that Teri showed earlier of results
of a genome-wide association study in type 2 diabetes.
But we don't really want information either.
What we really want is knowledge, and this is
what's being touted as the key to personalized medicine.
So, where's the knowledge?
That's what we really want.
Right now, what we mostly have is a lot of data.
We have more data than any other thing.
We have more data than information, and certainly
more than we have knowledge.
So, in trying to offer some comments on synthesizing and
integrating the results of genome-wide association studies,
I'm going to first review a little bit of the experience
in replicating genetics associations in general.
And in doing so, I'll make a contrast in the aims of such
studies between identifying novel associations and
measuring their effects in populations, and mention a few
methodologic issues that pertain to genome-wide associations.
Then I'll describe some network approaches, many of which have
already been discussed in great detail by other speakers.
I'll focus on the Human Genome Epidemiology Network,
which was a network of epidemiologists, actually,
and is the one that I work on.
And I'll give a few other examples.
And finally, I'm going to discuss two important results,
they're well known as success stories in genetic association
studies, to try to show how the results of candidate gene and
genome-wide association studies can fit together.
So, thanks to that wonderful scientific resource, PubMed,
we can actually monitor the growth of science in this area.
And these data are from a database that we have
produced from PubMed, on an ongoing basis since 2001,
by conducting a sweep, weekly, of the new scientific
publications added to the PubMed database, and identifying
the ones that are genetic association studies,
mostly of unrelated persons.
And you can see that the number of published gene
disease association studies has grown tremendously just
over the last five or six years, to the point that we now have
over 5,000 such publications entered into PubMed annually.
Studies of genetic association that actually examine some
other factor from the environment have grown
at a much slower pace.
Those are in green, the green subset.
And then, there are also a small but growing number of
meta-analyses to synthesize the results of these candidate gene
association studies.
Now, as early as 2001 it was clear that there were problems
in replicating the results of genetic association studies
of candidate genes.
And this is a rather famous graphic from a paper by
John Ioannidis that was published in Nature Genetics
in 2001 showing that often the first publication of a
particular gene disease association had the most
extreme outcome, or odds ratio, and that over time,
as the same association was studied by other investigators,
the effects tended to converge either to a very small or to
a null result.
And John Ioannidis called this the "Proteus Phenomenon"
after the Greek god who could metamorphose himself into many
different shapes.
And so, basically -- you know, there has been a lot of jousting
with results of these scattered, non-replicating genetic
association studies.
And in this article, John, who has written extensively
on the topic, recommended a systematic approach using
meta-analysis.
Now, around the same time, we established what is known as
the Human Genome Epidemiology, or HuGENet collaboration,
which now has four coordinating centers.
In addition to the one at CDC, there is a HuGENet Canada --
its headquarters is at the University of Ottawa --
also in Cambridge, the UK, and the University of Ioannina,
for obvious reasons.
And the main functions of this particular network are the
published literature scan and review that I mentioned,
the production of systematic reviews, methodologic work to
strengthen reporting of associations, and also promotion
of network collaboration.
And just a schematic to show how we have pursued this.
I have here something -- it's a figure that was published in
January 2006 in a commentary that described our approach.
And, the first workshop that we had to discuss this model was
to focus on a network of networks, which I'll describe
in a minute.
That was in November of 2005.
Since then we've had others that, first of all devised
standardized procedures for reviewing and conducting
meta-analysis of such associations.
There is an online handbook that was published in 2006,
and an addendum is being developed currently for
genome-wide associations -- the results of genome-wide
association studies.
A workshop last summer in Canada focused on strengthening the
reporting of genetic associations with
some guidance.
Last fall, there was another group that met to discuss --
this says, "grading" -- but basically the evaluation
of evidence for an association.
And in Atlanta next year we hope to gather people back together
to discuss this model and the body of evidence to date.
Now, reasons why replication of genetic associations has been
challenging can be divided roughly into three categories.
First of all, there is heterogeneity that we've
discussed quite a bit already today.
And there are many different reasons why heterogeneity may
occur within the context of different studies, including
differences in phenotypic measures, perhaps true
differences in underlying genetic factors.
But there are also many unmeasured factors, including
exposures that might play an important role.
The second major category has to do with statistical uncertainty,
and basically the usual problems, including type
one error which can occur when -- just based on sampling
variability, when many, many comparisons are made.
And also, the problem of low power, which -- you know,
many of the early studies were quite small.
And you know, even genome-wide association studies may be too
small to detect small effects.
And this is another reason that's already been presented
for pooling data and collaborating in analysis.
And finally, there are biases that can affect the results,
including all the usual epidemiologic biases,
and perhaps particularly important in this field,
publication bias, where -- you know, another very likely
explanation for this Proteus Phenomenon is that positive
results, especially initially, are much more likely to be
published than those that are negative.
So, how do these concerns differ when we're talking
about genome-wide association studies?
The same problems are still there.
There are perhaps a few advantages here and there,
and additional kinds of information one can use to
get at some of them.
For example, as has already been discussed quite
extensively here, we can address at least one of the
unmeasured factors, which is the different genetic background,
especially among different ethnic groups, that could result
in population stratification.
With respect to sampling variability, a number of
statistical techniques are being explored for addressing this.
And in terms of low power, there is the use of meta-analysis that
-- David Hunter already showed several large ones,
and the use of prior information from candidate genes,
which can be used to inform the analysis.
Now, we still have all the usual epidemiologic biases.
But, to the extent that the data collection methods and
protocols can be made available for other
investigators to peruse, the greater transparency can at
least provide insight into what those biases might be.
So, having that kind of information available,
for example, in the dbGaP resource, along with the
study data, really has great potential to at least provide
some information to address the problem.
Publication bias, of course, still remains a problem,
although by enhancing access to data, as has just been discussed
through either dbGaP or other data sharing mechanisms,
people will be able to perhaps interrogate these other sources
for the same association and demonstrate variation in the
results of that.
So as I mentioned, our particular network has done
some work in the area of systematic reviews and
meta-analysis and has made this handbook for systematic
reviews available online.
I'm providing the CDC link, although it actually resides
on the University of Ottawa Web site.
We also maintain a database of systematic reviews and
meta-analyses, which has -- we currently have sponsored
about 50 such reviews that are published in collaboration with
about 10 journals that allow us to publish those reviews
simultaneously online.
And we also have a citation database of about 550
meta-analyses that have been conducted so far.
Also in progress are some guidance for reporting
association data in publications,
and as I mentioned, criteria for evaluating the evidence.
And more information about all of these things can be found
on the HuGENet Web site.
So, is synthesizing information from genome-wide association
studies any different from that collected in candidate
gene studies?
Well, one thing to mention, I think, is that the
priorities of such studies may differ.
I mean, an important goal of the genome-wide association
studies is to identify novel associations, whereas,
at least now, a predominant goal of candidate gene studies is
to measure the size of the effect.
Now in principle, you know, both approaches can be used
for both things, but currently a lot of the excitement about the
genome-wide association study results stems from the discovery
of novel associations that remain to be tested.
Most differences between these types of studies are really
a matter of degree.
We still have to consider type one error.
We have to consider type two errors.
We still have the issue of harmonization among studies,
especially of phenotypic information, also, among
different genotyping platforms -- this has already been
discussed quite a bit.
And there are methods to deal with all of these things.
Likewise, population stratification is still
an issue.
So, the more information that's available about each of the
studies, the more transparent they are, the better the
information obtained from synthesis.
So, what's the purpose of conducting meta-analysis of
data from genome-wide association studies?
We've seen some examples.
This approach can improve the power, to measure small effects,
to assess heterogeneity among genome-wide association studies.
There are methodological challenges also discussed
earlier, such as the use of different genotyping platforms,
the harmonization of data, especially when different
criteria are used to define phenotype of interest.
And, also the treatment of replication samples that are
within the same genome-wide association study, a phenomenon
that is quite typical.
But I think, you know, to me anyway, the meta-analysis has
its limits.
I mean, it's definitely a good way to start, but it really
is not the end-all of data integration, because it's
really only good for synthesizing data in
one dimension.
So this is just a draft of some proposed evaluation
criteria for considering individual gene disease
associations -- I guess a proposal rather than guidance.
And basically there are five main categories that tend to
span not only validity, but I guess to a certain extent,
utility of the discoveries.
And they are: effect size, the amount of evidence in
replication, protection from bias, biological plausibility,
and relevance to health conditions.
And really, only the first two can be addressed
by meta-analysis.
The other things are somewhat subtle in many ways and can't
be assessed in any automatic way.
So I may have failed to point it out, but at the center of
my big wagon wheel image was the expression
"Network of networks."
Why network of networks?
What's the utility of this approach?
Well, the way we think of this is as a way to bridge cottage
industry with big science, to quote Bob Hoover who has
talked about this at SER [spelled phonetically]
last year.
And a way to -- prior to trying to combine everything in one
final repository, like dbGaP, there is really a great deal
that can be done by investigators working
together within a particular domain, and we've already heard
numerous examples of that, because people who are
working on the same problem tend to share not only specific
knowledge and -- for example, there are, within fields,
groups that devise phenotypic criteria that can be used to
standardize the collection of clinical data and phenotypic
data in epidemiologic studies.
So there's specific knowledge.
There is awareness of current research problems, so that
the publication of the results provides a feedback mechanism
to the research agenda.
And they tend to share funding sources.
So you see, for example, in the National Cancer Institute,
which has had a consortium model in place for many years,
this network of networks idea is already in place.
And in other places, as Andy Singleton mentioned in his talk,
you know, there are various kinds of consortia and
collaborations that can come together for a single purpose
in an ad hoc way, or for a prolonged collaboration
in a research area.
And many networks already exist.
Some of these were mentioned earlier.
The first two are NIH-sponsored.
There are also international collaborations that tend to
overlap with some of the NIH-funded projects.
Some are independent.
There are big ones, like this one on genetic susceptibility
to environmental carcinogens, but then there are also
very small ones, nascent ones, that have been formed to
address smaller topics such as the PREBIC collaborative
to study pre-term birth.
[Dr. Teri Manolio] Two minutes more.
[Dr. Marta Gwinn] Okay, now, here's a crazy network image, but I do love it
because it shows just what can be done when data are
made available.
This is actually based on OMIM, a network model that
connects genes that have been studied in associations with
diseases and where associations have been found.
And the top one is disease- centered and the bottom one
is gene-centered.
And you can see these are not random.
Of course, to a certain extent, it's a looking-under-the-light
post phenomenon, but there probably are true relations
in there.
And this is based entirely on data and OMIM, and was done
by physicists, by the way.
So here's another model of a network that I think
is worth showing.
It's the AlzGene database, which is embedded in the
Alzheimer Research Forum, which is a collaborative
group to promote research on Alzheimer disease.
And again, the data are obtained by sweeping PubMed
for publications, and are curated in this database,
which can also perform online meta-analyses.
Lars Bertram at Harvard is the founder and curator of that.
Here's the P3G Observatory.
It's from Montreal, where they are also trying to create a
repository of questionnaires and comparison tools,
and they have compiled a number of them from 11 studies in the
U.S. and other countries.
I think they should connect up with dbGaP.
So, in two minutes I may not have time to tell my tale of
two associations.
[Dr. Teri Manolio] [Inaudible].
You don't believe me.
It's 3:40.
[Dr. Marta Gwinn] I don't believe you, no, but anyway -- okay,
I'll hit the buttons fast and you will get an
impressionistic image.
Okay, so this association between CARD15 and Crohn's
disease is a huge success of the candidate gene era.
It was discovered in 2001.
And as we've already heard, complement factor H in
age-related macular degeneration is a huge success of the
genome-wide association study era.
Here's the natural history of big discovery.
The pink is CARD15 -- lots and lots of replications.
It's an early success, has offered key insights into
pathogenesis and phenotype, but six years later we're not
entirely sure how to use this.
It hasn't replicated in all populations, and it was hoped
at the time that it would be useful in identifying patients
who could benefit from Infliximab, which was at
the time a big new treatment intervention,
but it didn't work.
However, genome-wide association has been helpful, and since
I don't have time to discuss it I would suggest that everyone
who hasn't looked at this do so.
It is a commentary by Lon Cardon following the publication of
the IL23R association with Crohn's disease,
which shows just how a genome- wide association, in combination
with candidate gene data, can be used to expand the
knowledge horizon.
This is the macular degeneration.
You see CFH dropped on the scene in 2005 -- been replicated many,
many times.
And there already have been three meta-analyses.
Another early success provided great insight into pathogenesis
and progression.
There was a recent study examining interaction with
smoking and BMI.
Direction for translation isn't clear.
It doesn't currently have any utility for screening.
And, in fact, there was no interaction in that same
environmental factors study with the treatment assignment
in the AREDS trial, although I was very disappointed,
even though the authors said there was no interaction,
I was very disappointed the data weren't presented.
Isn't that the thing you would most want to know?
[Dr. Teri Manolio] You need to finish up, Marta.
[Dr. Marta Gwinn] Okay, so, I won't repeat this, instead I'm going to use
Teri's slide.
And this -- you know, she called it the wave; that's good.
Waves can be good or bad.
I've heard it called a tsunami; let's not call it that.
It's a rising tide that lifts all boats.
That's what we want, right?