Synthesizing Findings across Multiple Gwa Studies And Integrating

[Dr. Marta Gwinn] -- and so, what I have to say is really not new. In fact, I'll be mentioning things that others have discussed in a lot more detail during earlier talks, but I hope that by offering what might be considered something of a consumer perspective, I can help synthesize and integrate what we've been hearing about today into the larger picture of our knowledge of how genetics and environmental factors contribute to human health and disease. Now, this is a slide from the Department of Energy's public slide gallery, which was made available early on in the Human Genome Project. And the title says, "Gene chips reveal susceptibilities." If only it were that easy, because what they really reveal is data. And, by dint of effort and cleverness and high technology we can transform those data into information. This is the very same image that Teri showed earlier of results of a genome-wide association study in type 2 diabetes. But we don't really want information either. What we really want is knowledge, and this is what's being touted as the key to personalized medicine. So, where's the knowledge? That's what we really want. Right now, what we mostly have is a lot of data. We have more data than any other thing. We have more data than information, and certainly more than we have knowledge. So, in trying to offer some comments on synthesizing and integrating the results of genome-wide association studies, I'm going to first review a little bit of the experience in replicating genetics associations in general. And in doing so, I'll make a contrast in the aims of such studies between identifying novel associations and measuring their effects in populations, and mention a few methodologic issues that pertain to genome-wide associations. Then I'll describe some network approaches, many of which have already been discussed in great detail by other speakers. I'll focus on the Human Genome Epidemiology Network, which was a network of epidemiologists, actually, and is the one that I work on. And I'll give a few other examples. And finally, I'm going to discuss two important results, they're well known as success stories in genetic association studies, to try to show how the results of candidate gene and genome-wide association studies can fit together. So, thanks to that wonderful scientific resource, PubMed, we can actually monitor the growth of science in this area. And these data are from a database that we have produced from PubMed, on an ongoing basis since 2001, by conducting a sweep, weekly, of the new scientific publications added to the PubMed database, and identifying the ones that are genetic association studies, mostly of unrelated persons. And you can see that the number of published gene disease association studies has grown tremendously just over the last five or six years, to the point that we now have over 5,000 such publications entered into PubMed annually. Studies of genetic association that actually examine some other factor from the environment have grown at a much slower pace. Those are in green, the green subset. And then, there are also a small but growing number of meta-analyses to synthesize the results of these candidate gene association studies. Now, as early as 2001 it was clear that there were problems in replicating the results of genetic association studies of candidate genes. And this is a rather famous graphic from a paper by John Ioannidis that was published in Nature Genetics in 2001 showing that often the first publication of a particular gene disease association had the most extreme outcome, or odds ratio, and that over time, as the same association was studied by other investigators, the effects tended to converge either to a very small or to a null result. And John Ioannidis called this the "Proteus Phenomenon" after the Greek god who could metamorphose himself into many different shapes. And so, basically -- you know, there has been a lot of jousting with results of these scattered, non-replicating genetic association studies. And in this article, John, who has written extensively on the topic, recommended a systematic approach using meta-analysis. Now, around the same time, we established what is known as the Human Genome Epidemiology, or HuGENet collaboration, which now has four coordinating centers. In addition to the one at CDC, there is a HuGENet Canada -- its headquarters is at the University of Ottawa -- also in Cambridge, the UK, and the University of Ioannina, for obvious reasons. And the main functions of this particular network are the published literature scan and review that I mentioned, the production of systematic reviews, methodologic work to strengthen reporting of associations, and also promotion of network collaboration. And just a schematic to show how we have pursued this. I have here something -- it's a figure that was published in January 2006 in a commentary that described our approach. And, the first workshop that we had to discuss this model was to focus on a network of networks, which I'll describe in a minute. That was in November of 2005. Since then we've had others that, first of all devised standardized procedures for reviewing and conducting meta-analysis of such associations. There is an online handbook that was published in 2006, and an addendum is being developed currently for genome-wide associations -- the results of genome-wide association studies. A workshop last summer in Canada focused on strengthening the reporting of genetic associations with some guidance. Last fall, there was another group that met to discuss -- this says, "grading" -- but basically the evaluation of evidence for an association. And in Atlanta next year we hope to gather people back together to discuss this model and the body of evidence to date. Now, reasons why replication of genetic associations has been challenging can be divided roughly into three categories. First of all, there is heterogeneity that we've discussed quite a bit already today. And there are many different reasons why heterogeneity may occur within the context of different studies, including differences in phenotypic measures, perhaps true differences in underlying genetic factors. But there are also many unmeasured factors, including exposures that might play an important role. The second major category has to do with statistical uncertainty, and basically the usual problems, including type one error which can occur when -- just based on sampling variability, when many, many comparisons are made. And also, the problem of low power, which -- you know, many of the early studies were quite small. And you know, even genome-wide association studies may be too small to detect small effects. And this is another reason that's already been presented for pooling data and collaborating in analysis. And finally, there are biases that can affect the results, including all the usual epidemiologic biases, and perhaps particularly important in this field, publication bias, where -- you know, another very likely explanation for this Proteus Phenomenon is that positive results, especially initially, are much more likely to be published than those that are negative. So, how do these concerns differ when we're talking about genome-wide association studies? The same problems are still there. There are perhaps a few advantages here and there, and additional kinds of information one can use to get at some of them. For example, as has already been discussed quite extensively here, we can address at least one of the unmeasured factors, which is the different genetic background, especially among different ethnic groups, that could result in population stratification. With respect to sampling variability, a number of statistical techniques are being explored for addressing this. And in terms of low power, there is the use of meta-analysis that -- David Hunter already showed several large ones, and the use of prior information from candidate genes, which can be used to inform the analysis. Now, we still have all the usual epidemiologic biases. But, to the extent that the data collection methods and protocols can be made available for other investigators to peruse, the greater transparency can at least provide insight into what those biases might be. So, having that kind of information available, for example, in the dbGaP resource, along with the study data, really has great potential to at least provide some information to address the problem. Publication bias, of course, still remains a problem, although by enhancing access to data, as has just been discussed through either dbGaP or other data sharing mechanisms, people will be able to perhaps interrogate these other sources for the same association and demonstrate variation in the results of that. So as I mentioned, our particular network has done some work in the area of systematic reviews and meta-analysis and has made this handbook for systematic reviews available online. I'm providing the CDC link, although it actually resides on the University of Ottawa Web site. We also maintain a database of systematic reviews and meta-analyses, which has -- we currently have sponsored about 50 such reviews that are published in collaboration with about 10 journals that allow us to publish those reviews simultaneously online. And we also have a citation database of about 550 meta-analyses that have been conducted so far. Also in progress are some guidance for reporting association data in publications, and as I mentioned, criteria for evaluating the evidence. And more information about all of these things can be found on the HuGENet Web site. So, is synthesizing information from genome-wide association studies any different from that collected in candidate gene studies? Well, one thing to mention, I think, is that the priorities of such studies may differ. I mean, an important goal of the genome-wide association studies is to identify novel associations, whereas, at least now, a predominant goal of candidate gene studies is to measure the size of the effect. Now in principle, you know, both approaches can be used for both things, but currently a lot of the excitement about the genome-wide association study results stems from the discovery of novel associations that remain to be tested. Most differences between these types of studies are really a matter of degree. We still have to consider type one error. We have to consider type two errors. We still have the issue of harmonization among studies, especially of phenotypic information, also, among different genotyping platforms -- this has already been discussed quite a bit. And there are methods to deal with all of these things. Likewise, population stratification is still an issue. So, the more information that's available about each of the studies, the more transparent they are, the better the information obtained from synthesis. So, what's the purpose of conducting meta-analysis of data from genome-wide association studies? We've seen some examples. This approach can improve the power, to measure small effects, to assess heterogeneity among genome-wide association studies. There are methodological challenges also discussed earlier, such as the use of different genotyping platforms, the harmonization of data, especially when different criteria are used to define phenotype of interest. And, also the treatment of replication samples that are within the same genome-wide association study, a phenomenon that is quite typical. But I think, you know, to me anyway, the meta-analysis has its limits. I mean, it's definitely a good way to start, but it really is not the end-all of data integration, because it's really only good for synthesizing data in one dimension. So this is just a draft of some proposed evaluation criteria for considering individual gene disease associations -- I guess a proposal rather than guidance. And basically there are five main categories that tend to span not only validity, but I guess to a certain extent, utility of the discoveries. And they are: effect size, the amount of evidence in replication, protection from bias, biological plausibility, and relevance to health conditions. And really, only the first two can be addressed by meta-analysis. The other things are somewhat subtle in many ways and can't be assessed in any automatic way. So I may have failed to point it out, but at the center of my big wagon wheel image was the expression "Network of networks." Why network of networks? What's the utility of this approach? Well, the way we think of this is as a way to bridge cottage industry with big science, to quote Bob Hoover who has talked about this at SER [spelled phonetically] last year. And a way to -- prior to trying to combine everything in one final repository, like dbGaP, there is really a great deal that can be done by investigators working together within a particular domain, and we've already heard numerous examples of that, because people who are working on the same problem tend to share not only specific knowledge and -- for example, there are, within fields, groups that devise phenotypic criteria that can be used to standardize the collection of clinical data and phenotypic data in epidemiologic studies. So there's specific knowledge. There is awareness of current research problems, so that the publication of the results provides a feedback mechanism to the research agenda. And they tend to share funding sources. So you see, for example, in the National Cancer Institute, which has had a consortium model in place for many years, this network of networks idea is already in place. And in other places, as Andy Singleton mentioned in his talk, you know, there are various kinds of consortia and collaborations that can come together for a single purpose in an ad hoc way, or for a prolonged collaboration in a research area. And many networks already exist. Some of these were mentioned earlier. The first two are NIH-sponsored. There are also international collaborations that tend to overlap with some of the NIH-funded projects. Some are independent. There are big ones, like this one on genetic susceptibility to environmental carcinogens, but then there are also very small ones, nascent ones, that have been formed to address smaller topics such as the PREBIC collaborative to study pre-term birth. [Dr. Teri Manolio] Two minutes more. [Dr. Marta Gwinn] Okay, now, here's a crazy network image, but I do love it because it shows just what can be done when data are made available. This is actually based on OMIM, a network model that connects genes that have been studied in associations with diseases and where associations have been found. And the top one is disease- centered and the bottom one is gene-centered. And you can see these are not random. Of course, to a certain extent, it's a looking-under-the-light post phenomenon, but there probably are true relations in there. And this is based entirely on data and OMIM, and was done by physicists, by the way. So here's another model of a network that I think is worth showing. It's the AlzGene database, which is embedded in the Alzheimer Research Forum, which is a collaborative group to promote research on Alzheimer disease. And again, the data are obtained by sweeping PubMed for publications, and are curated in this database, which can also perform online meta-analyses. Lars Bertram at Harvard is the founder and curator of that. Here's the P3G Observatory. It's from Montreal, where they are also trying to create a repository of questionnaires and comparison tools, and they have compiled a number of them from 11 studies in the U.S. and other countries. I think they should connect up with dbGaP. So, in two minutes I may not have time to tell my tale of two associations. [Dr. Teri Manolio] [Inaudible]. You don't believe me. It's 3:40. [Dr. Marta Gwinn] I don't believe you, no, but anyway -- okay, I'll hit the buttons fast and you will get an impressionistic image. Okay, so this association between CARD15 and Crohn's disease is a huge success of the candidate gene era. It was discovered in 2001. And as we've already heard, complement factor H in age-related macular degeneration is a huge success of the genome-wide association study era. Here's the natural history of big discovery. The pink is CARD15 -- lots and lots of replications. It's an early success, has offered key insights into pathogenesis and phenotype, but six years later we're not entirely sure how to use this. It hasn't replicated in all populations, and it was hoped at the time that it would be useful in identifying patients who could benefit from Infliximab, which was at the time a big new treatment intervention, but it didn't work. However, genome-wide association has been helpful, and since I don't have time to discuss it I would suggest that everyone who hasn't looked at this do so. It is a commentary by Lon Cardon following the publication of the IL23R association with Crohn's disease, which shows just how a genome- wide association, in combination with candidate gene data, can be used to expand the knowledge horizon. This is the macular degeneration. You see CFH dropped on the scene in 2005 -- been replicated many, many times. And there already have been three meta-analyses. Another early success provided great insight into pathogenesis and progression. There was a recent study examining interaction with smoking and BMI. Direction for translation isn't clear. It doesn't currently have any utility for screening. And, in fact, there was no interaction in that same environmental factors study with the treatment assignment in the AREDS trial, although I was very disappointed, even though the authors said there was no interaction, I was very disappointed the data weren't presented. Isn't that the thing you would most want to know? [Dr. Teri Manolio] You need to finish up, Marta. [Dr. Marta Gwinn] Okay, so, I won't repeat this, instead I'm going to use Teri's slide. And this -- you know, she called it the wave; that's good. Waves can be good or bad. I've heard it called a tsunami; let's not call it that. It's a rising tide that lifts all boats. That's what we want, right?