Dr. Andrew Su - Crowdsourcing biology - The gene wiki, biogps, and citizen science

>> Good Morning everyone. Thank you for joining us today for the NCI CBIIT Speaker Series, a knowledge sharing forum featuring internal and external speakers on topics of interest to the biomedical informatics and research communities. My name is Tony Kerlavage. I'm the head of the Informatics Program here at CBIIT. Just to remind you, today's presentation is being recorded and will be available on the wiki for the Speaker Series as a screencast with voiceover. And it will also be posted on the Speaker Series YouTube playlist. And if you Google NCI Speaker Series, you'll find that Wiki page. There's also information about future speakers available via Twitter and our blog, so check out those sites for the latest info. To find the blog, you can Google NCI blog, it should be the first search result. Our Twitter handle is @NCI_NCIP. Today we're very happy to welcome Dr. Andrew Su who is associate professor at the Scripps Research Institute in the Departments of Molecular and Experimental Medicine and Integrative Structural and Computational Biology. Dr. Su received his Ph.D. in Chemistry from the Scripps Research Institute and a B.A. in Chemistry, Computing and Information Systems and Integrative Science from Northwestern University. The title of his presentation is "Crowdsourcing Biology: The Gene Wiki, BioGPS and" -- looks like he's changed the title already to "Citizen Science." So with that, I'll turn over the floor to Dr. Su. >> Great. Thank you Tony for the introduction and for the invitation to speak. So I did want to tell about some of our group's work in Crowdsourcing Biology. I want to give three vignettes. And as Tony alluded to, I made a last minute change on what that third vignette was. Instead of our work in GeneGames.org, I want to tell you about our latest, newest, greatest work with Citizen Science. So unfortunately, Barney is our icon here for our GeneGames.org and he is going to -- need to go away. But in its place, I hope you'll find our work in Citizen Science -- very early work in Citizen Science -- to be interesting. OK. So I want to start with an analysis that motivates a lot of my lab's interests and essentially an analysis of gene annotations, of gene ontology annotations for human genes. If you sort all 20,000 or so protein-- human protein coding genes by the number of linked GO annotations, what you find over here at the top is not surprising. You have many genes that have hundreds of linked GO annotations and this is a who's who list of biomedically relevant and scientifically interesting genes. But what we found very interesting is the rapid degree to which this level of annotation fell off. To the point where 65% of human genes have 5 or fewer linked to GO annotations; 41% of human genes have 1 or 0 linked GO annotations. So it really says in our efforts to systematically and comprehensively annotate the human genome that we're far away off from that grand vision. Now, there's a lot of reasons why this pattern can exist. And I'm going to suggest one of these reasons is because that the literature is sparsely curated. I don't need to tell anybody here that the biomedical literature is exploding. There are over 1 million new PubMed indexed articles published every year -- that corresponds to approximately one every 30 seconds. In the face of this explosive growth of the biomedical literature, our ability as individuals to consume that literature hasn't quite kept pace. Generally speaking, we as individuals have read the same number of articles we did last year and the year before that. And what it means for our biocuration efforts are our efforts to really translate the knowledge in the biomedical literature into a database and into structured content. It means that those biocuration efforts are simply being overwhelmed. So if you look at all -- across all Gene Ontology annotations, there have been an incredibly impressive 300 -- over 300,000 articles cited in support of the gene ontology annotations. But when you think that's only 1.5% of PubMed, if you believe that greater than 1.5% of PubMed is relevant to understanding human gene function, then it says that we have a bottleneck here. It says that we're under annotating the literature. This is not unappreciated within the biocuration community. Several years ago, this review of the biocuration field came out and it described this problem among others and several solutions. One of the solutions that they proposed is described in this quote. It says, "Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation." So this idea of directly engaging the research community is reminiscent of a principle that is well-known in sort of the internet community and that's called the Long Tail. So let me explain the Long Tail, first by describing its opposite, the Short Head. So the Short Head is when you have a few number of producers, each producing a lot of content. OK. And so the classic example here is a daily newspaper. The Long Tail equivalent of that are when you have lots of contributors, each of which produce a relatively small amount of content. And the analogy here would be the blogosphere. The interesting thing about the Long Tail is that while individually their contributions may be small and aggregate, their contributions can be quite substantial. So if you take, for example, there are 1,500 or so daily newspapers left in the United States. Well, there's over a hundred million blogs online now. So clearly, just by the sheer volume of the number of contributors, it rivals newspapers just amount -- in the amount of content being generated. So the Long Tail argument can be applied to newspapers and blogs. In the video space, it applies to TV and Hollywood versus YouTube. It can be applied to a wide variety of other areas from Consumer Reports versus Amazon reviews, food critics versus Yelp and so on and so forth. OK. So, undoubtedly the most successful Long Tail application is the online encyclopedia, Wikipedia. And I won't go into too much on Wikipedia 'cause I seem everybody is familiar. But I'll just point out two facts that we'll take as given. First, Wikipedia is reasonably accurate. Nature did a study several years ago now on Wikipedia versus Britannica Online for science related topics. And the underlying conclusion was that they had similar error rates. And there have been a number of studies since then that have backed this up. All right. In -- So with comparable accuracy, the value of having an open platform is you get incredible breadth and depth. And these are very old statistics now and the gap, I'm sure has only grown. But you can see both in terms of the number of articles and the number of words represented in total, you can see Wikipedia far outpaces Britannica Online. Wikipedia being the Long Tail and encyclopedia-- Britannica Online being the Short Head one. So our fundamental hypothesis in all of our crowdsourcing work is can we harness the Long Tail to directly participate in our, you know, scientist goal of annotating the human -- of annotating human gene function? So, I'm going to again give you three vignettes and I'm going to start with the first one, the Gene Wiki, which directly harnesses Wikipedia actually. OK. So what are we trying to do with the Gene Wiki? Well, we are trying to essentially -- well, let's frame this in the context of a use case that many scientists face. Suppose you've done some high throughput genome study and you come out with a gene that for which you're not familiar, right. The first thing you do is probably plug that gene into PubMed. And if you take a very extreme example like fibronectin, you can kick back with upwards of 30,000 articles that return. So, if you want to really get up on the state of the art on what's known about fibronectin, it's a pretty daunting process to go in and attempt to read 30,000 articles. Of course, scientists have recognized this and they have a system of creating review articles where you have essentially some esteemed researcher summarizing the field in the form of one review article that then you can read. So, we were going from a document centric view of the world to a concept centric view of the world. So the goal of the Gene Wiki is essentially to do this en masse. Not to do this for a few genes in sort of an ad hoc basis. But to do this in a systematic way using the crowd and making a gene specific review article for every human gene. And more importantly, right, this review article would be collaboratively rewritten, it would be continuously updated, and it would be community reviewed. So, this is the overarching vision. And so, how do we assess whether or not we're being successful? And I put this framework in terms of how we assess that. So on the one hand, we want to create a resource that has some level of page utility to a community. And by being somewhat useful, it will draw some number of users over time. With any open platform, the hope is that some number of those users will stay around and actually make a contribution to your -- to the Wiki effort in our case. And that contribution could be fixing a typo, it could be something like summarizing the recent literature, a recent literature finding. But regardless, in the context of making that contribution, we make the page that much more useful. We'll still draw that many more users and which will in turn draw more contributors and it's this positive feedback loop that I think any crowdsourcing [inaudible], any Wiki type effort needs to nucleate. So, this is sort of the framework by which we're going to judge sort of our work in the Gene Wiki. So, with that context, right, what do we do in terms of actually starting this project and establishing this minimal level of utility? Well, what we did is we created about 10,000 gene stubs. These stubs are essentially small placeholder frameworks for Wikipedia pages that give some basic -- very basic information. And what we did to create these stubs is we essentially mined content from bioinformatics databases and genomics databases that we and the community already know and love. And we simply integrated that into one system, reformatted that, and put that in Wikipedia. The important part is that we can do that all roughly computationally and in an automated way and we can do that for 10,000 genes with no real problem. And so, we have things like protein structures, symbols and identifiers, gene ontology annotation and so on and so forth. These are all mined largely from NCBI resources. So, we hope that by bringing this all in one place and having it in a systematic layout and continuously updated, that we're satisfying this first axis of utility. So, how're we doing actually in terms of users? Well in aggregate, these 10,000 pages are viewed a total of between 4 and 5 million times per month. And, in large part, because we're simply within a very popular user community of Wikipedia, not because of anything particularly unique or innovative that we've done. Of course there's a wide distribution. So there are some articles that are very highly viewed, tens of thousands of times a month. The median view is somewhere in the hundreds times per month. The bottom chart shows the average Google rank for each one of these 10,000 pages and it shows that greater than 80% of these Gene Wiki/Wikipedia pages are on the first page of Google -- Google search results, which really suggests that again this is a resource that's not going to go away and it's poised for growth. So, that's where we are in terms if our usage metrics and the third axis is how we doing actually in terms of contributions. Currently, we see an increase of about 10,000 words per month from over 1,000 edits per month and that roughly corresponds to 230 full length research articles. So, it's -- continuing to grow and we have quite a bit of critical mass, we think, both around the usage and the contribution. So, I think it's useful to show one of these pages just to give you an idea of what this could become, right. So, this is the page for protein called reelin. This was largely done prior to our involvement. We systematized some of the presentation of what's shown in the background here. But by and large, this was the effort of the community that preexisted prior to our involvement. But just to walk through, right, this page starts with a very nice table of contents that shows what's covered in this article. It has figures and diagrams of reelin in the mouse brain. It has a detailed section here on pathological roles of reelin. Importantly, Wikipedia especially, you know, overall, but especially on scientific articles, encourages inline references to literature just like we read articles -- primary research articles. So, for example, for any even given statement, you're one click away from actually getting the reference and the corresponding link to the PubMed record. Also importantly, Wikipedia, these articles don't exist in vacuum, they are in a network of highly related with other Wikipedia articles. So, if you have no idea what Norman-Roberts Syndrome is, that information is one click away. OK. So, reelin, these are older edits but, you know, certainly over a hundred editors, over hundreds of edits over many years, this pattern applies to other biomedically relevant articles on Wikipedia. So, I think again it really shows that this is a community-based effort. So, we could almost leave the Gene Wiki alone and we think it has the critical mass it has that positive feedback loop that it will just continue to grow. Our efforts in trying to poke and prod Wikipedia or the Gene Wiki now are twofold. So, first we want to make the Gene Wiki more computible, OK. And what does that mean? Well, I told you that we started with structured annotations from gene annotation databases. We reformat -- we integrated and reformatted those and put them into Wikipedia pages. And use that as a forum to encourage contributions from the community where these contributions are largely in the form of free text. And our goal now is actually to take that community contributed free text, circle back into structured annotation that can then be fed back into other databases. And why is it that the structured annotations are important? Well, these structured annotations are important for all sorts of analyses from network-based analyses to enrichment analyses to pathway analyses and so on and so forth. When we have structured annotation, that's the point at which informatics scientists and genomic scientists can start doing computations and statistics on those data. So, I'm going to tell you about one very simplistic strategy we have for mining structured annotations out of the Gene Wiki. And I'm going to show that by a very, very simplistic example. So, here we have the Gene Wiki page -- snapshot of the Gene Wiki page for APLP2 which is a gene in the human genome. It starts with this line, APLP2, associates with antigen presenting molecules like MHC class II yada, yada, yada, and regulates surface expression by enhancing endocytosis. OK. And so that link to endocytosis is actually a Wiki link. If you click on that and you actually get to the Wikipedia page for endocytosis which describes what endocytosis is. OK. So, everything here is what I would say is in the unstructured data world. But because we had a role in helping to create that APLP2 page, we know that APLP2 actually corresponds to this NCBI Entrez Gene ID, ID 334. And just by basic text matching, we can know -- we can strongly, strongly guess that endocytosis, just by with that string match, corresponds to this gene ontology ID. And so there, you might assume that this is a candid assertion between this Entrez Gene ID and this gene ontology term. So, this is very basic strategy. Oh sorry. And because we have again, the inline references over here, we could fish out actually the exact article that the Wikipedia community cited in support of this assertion. Here, an article that simply says, APLP2 increases the endocytosis instability and turnover of the MHC Class I Molecule. So this is a simple strategy for mining annotations. Using an approach like this, we were able to mine out thousands of gene ontology annotations and thousands of novel disease ontology annotations. So, of course, the first and most natural question is how do we know whether these annotations are trustworthy at all? Perfectly legitimate question when it comes from a crowd-based resource, as opposed to a community-based resource -- an expert curation-based resource. We developed a evaluation strategy, the details of which, I'm going to skip today because it's published work. But it's essentially based on a gene set enrichment approach where we know the gold standard. OK. And essentially, I'm plotting the p-values of the PubMed plus the Gene Wiki relative to PubMed alone. And if the Gene Wiki is having value, we should expect to see many more dots in this quadrant and that is in fact what we see. So systematically across all the gold standards, we can find, we see large amounts of enrichment that are aided by having access to the Gene Wiki results. OK. So, that covers our efforts essentially to make the Gene Wiki more computable. The second main area that we're trying to expand the Gene Wiki is actually to make -- expand it through outreach and incentives. OK. So, a common question is why would a professional researcher in the field want to contribute to the Gene Wiki? Because there are no aligned incentives, there's not something that they could put on their CV. So we started this collaboration with the journal Gene that's described in this editorial here and we're calling these Gene Wiki reviews, where we are doing invited review articles on genes of interest. Where it fits the criteria where there is lots about this gene known in the biomedical literature but the Wikipedia articles are nonexistent or relatively underdeveloped. So this is a really new initiative, so far we published five -- oh sorry, and -- so in these invited articles -- these invited review authors would contribute one article that would be published in the journal -- would be peer reviewed and published in the journal, that would be the review article of record. And they would write a second, perhaps, shorter review article that would go in Wikipedia. And this would be the living document that could evolve as the field evolved. So, new initiative? We have five published submissions so far shown on these genes here and we're excited about this potential moving forward just as one brief plug. We're expanding a focused effort now in the cardiovascular space, in collaboration with Peipei Ping at UCLA and her NHLBI Proteomics Center where we want to identify genes involved in cardiovascular health and really develop those articles by the same criteria I mentioned before. This is the current targeted list now, if anybody has suggestions on appropriate people that would be great, and suggestions on appropriate people and any other space are welcomed as well for this community, certainly with cancer space would be highly appreciated. OK. So I'm going to wrap up that piece on the Gene Wiki and just tell you or hope that I convinced you that the Long Tail of scientists can be a valuable resource for information on gene function. So the next piece is on BioGPS. And BioGPS has a similar but a complimentary sort of set of use cases and let me walk through what that is. So again, suppose you've done a genome scale profiling experiment and you have a new gene of interest, how are you going to learn about what's known about that gene? Well clearly, you could search NCBI. Entrez Gene resource has great information on gene ontology annotations, genomic context and things like that. You might also want to search the MGI site where it has information on the mouse models and the Mouse Knockout Phenotypes. And you might want to look at the Wikipedia page to see what has been added in the most recent edits. You might search GeneCards which has great information on tool compounds and drugs. And point being, there are hundreds, if not thousands, more resources online that all claim to tell something about gene function. And as a user to visit them all is pretty impractical. OK. So a first question is why is there so much redundancy? There's again, probably many reasons why this exists and I'm going to tell you one reason why I think it exists. So I think for any successful resource over time you see an increase in the number of users of that resource. And with the number of users, you see an increase in number of requests. Unfortunately, especially in this day and age, the resources we have access to are largely static. And so you see this growing disconnect between the number of requests and your ability as a developer to fulfill those requests. So our proposal is if we create a community platform that enables and empowers the community to continue to add new content and new features into this resource, then that community development will naturally scale with the growth in the user community. And so, one of the emphases that we really put into BioGPS is this idea of community extensibility. And -- right. One main emphasis is community extensibility. The second main emphasis essentially relates to this idea here, where if I were a domain expert and I got a lot of value out of the NCBI Entrez Gene page or any other page I was looking at. But I said, you know, this is great, but here's another noodle nugget that I know that isn't listed here. I essentially have no way to add my knowledge, or if I'm a data provider, I have no way to add my data to these locked gene report views. So, I also can't define the set of data that I want to see. So a structural biologist may have a different set of needs and interests than a geneticist. So the second main principle that BioGPS emphasizes is user customizability. OK. So, community extensibility and user customizability, this is our webpage here. I was going to just do a very quick demo just to give you an idea what BioGPS allows you to do. So it allows you to search by any number of terms, you could search by gene symbol, wild card queries, you could search by gene ontology terms, affymetrix IDs, any number of search terms. For here, I'm just going to search for all cyclin- dependent kinases and I'm going to click on cyclin- dependent kinase 2. So essentially, what we have here is a view of the expression pattern in normal human tissue. We have some structured views from databases where we have gene ontology terms, aliases, symbols and so on and so forth. And if I scroll down there, I have the Wikipedia entry. If I click on Other Genes over here for CDK8 and now this page just refreshes and we see that information now for CDK8 and so on and so forth. OK. So, this is essentially, you know, our view of what we think you should know about your gene of interest. But we have this concept of layout. And in this menu over here, you could go down and you could look for pathway, right. If I'm very interested in the pathway context of CDK2, I can see what pathways CDK2 is involved in, from Pathway Commons, from Wiki Pathways or from Reactome. And again, as I click over to other resources, you see these pages update for CDK5 now. If I'm interested in, not pathways, but perhaps I'm interested in the model organism databases. So what's known about CDK8 from, for example, the Rat Genome Database, the Mouse Genome Informatics site or FlyBase, right. Again, an easy way to browse related content corresponding to different use cases. That's what we have here. We have others, ones around literature and there are other ones that are around some expression data and so on and so forth. So from here, if you're -- you can also add other resources that you might be interested in. So if you, for example, know you're interested in dbSNP, you can simply search the Plugin Library for dbSNP. If I click on that, that is a way to now see the report for CDK8 in dbSNP. It's also a way to discover resources you may not have been aware of. So for example, I know I'm interested in splicing, if I'm interested in splicing you can see what -- you know, not knowing a specific resource but I can search for what splicing resource is available by keyword within the BioGPS Plugin Library and you can click one and that will simply be added over here and you can see essentially the splicing patterns that are known for CDK8. And again this layout is preserved as you go back and forth between different genes of interest. So if you sign up for a user account, you can save this particular layout and other layouts like it so that you can have easy access to the ones that are most relevant to you. OK. So this, in a very brief nutshell, is what BioGPS does. So let's go back to the presentation. And I want to assess BioGPS using sort of the same metrics I outlined for the Gene Wiki between utility users and contributors since this is also a crowdsourcing project. We define utility based off of the resources that end users are able to access through the BioGPS interface. So this is just a smattering of the model organism databases and the gene portals that you can access within BioGPS. Here are some of the many genetics resources that you can access. Here are the literature resources that you can access. Protein-based resources on protein annotation and protein function, pathway databases and expression databases, so we have a wide variety of different resources. We try to categorize essentially all the gene-centric databases that are available online. In total we have about 540 or over 540 gene centric online databases registered and easily accessible through this BioGPS interface. So we think that establishes the first axis here, Utility. How are we doing in critical mass? Well, we have over 6,400 registered users. I should note that's the vast majority of our features can be accessed anonymously without creating an account as we were doing in that short demo. We get, let's see, this is on a yearly basis, almost 2 million page views per year from over 100,000 unique visitors. So we think we're doing very well in terms of user- based as well. So how do people actually contribute knowledge back into the BioGPS community? So again this references -- one of the things of user or community extensibility. So our Plugin Library is actually community curated, anybody who has an account can register a new Plugin. It's pretty simple if you have any sort of web savvy skills to figure out how to add new resources. And so, over 120 users have registered at least 540 plugs that expand over 280 unique domains. So this is one way how people explicitly contribute to BioGPS, the other way is actually more implicit contributions. So again I showed you, if you search the BioGPS Plugin Library for splicing you get the list of resources that we have available. You can see that actually we sort them by this concept of Layout Popularity, very simply put, we sourced these resources based of how often they're used within the BioGPS community. So if you have no idea which of these is most reliable or most useful, well, a good place to start is simply the one that is most popular within the BioGPS community and that's a way of harnessing the implicit knowledge within our user community. OK. So there is BioGPS as a crowdsourcing initiative. Since this is a bioinformatics community that I'm talking to here, I want to do one plug for one really critical piece back end infrastructure behind BioGPS which we call MyGene.info. And the tag line for MyGene.info is Gene Annotation Query as a Service, following the pattern of many cloud computing resources. So this has all of our gene-centric annotation information that powers the front end of BioGPS. We expose this as a community resource. So it powers but not only BioGPS but other community initiatives. It's high performance. We get roughly 3 million hits per month. It's highly scalable. We have all species indexed by NCBI in all genes in those species, so 16 million genes. These get updated automatically on a weekly update. We have JSON output, REST based to web services interface together with some programmatic libraries. So if people who are developing web applications are doing bioinformatics analyses that need gene- centric information the URL to go to is MyGene.info. And there're some really interesting extensions of this that I'm not going to talk about here that go into crowdsourcing among bioinformatics developers about how to maintain a resource like this. OK. But in any case I want to end this little vignette just by saying that I hope I've convinced you that the Long Tail of bioinformaticians can be quite useful in collaboratively building a gene portal in terms of BioGPS. OK. That hits two of the three vignettes. So the last vignette I want to talk about here is our efforts on Citizen Science, a topic that's becoming quite popular these days. OK. So the fundamental piece that we're trying to address, or challenge we're trying to address, harkens back to this explosive growth of the biomedical literature that I described before. This is not just growth in the biomedical literature it's growth in biomedical knowledge, right. So in terms of getting the knowledge in PubMed to be very accessible, that's a process that we think about in terms of -- it's referred to as information extraction and the text mining in NLP, Natural Language Processing community. So information extraction into a structured form can be broken down into these three steps. Finding mentions of concepts within the text, so concept recognition with the text. Mapping those concept mentions to specific terms in ontologies and controlled vocabularies and databases, so this is a concept normalization text. And the third piece is identifying the relationships between those concepts. So, in our Citizen Science efforts, we wanted to start small. And so we're -- this initial effort focuses simply on the first piece. Finding mentions of concepts in biomedical text. Specifically, we wanted to identify the disease mentioned in PubMed abstracts. OK. So we're building on a corpus that was built actually when the NCBI, Intramural Program by Zhiyong Lu and his colleagues. Essentially they annotated 793 PubMed abstracts specifically for mentions of diseases in those abstracts. It took -- They used 12 expert annotators, several months. Each abstract was annotated by two of these expert annotators. And in total they found almost 7,000 disease mentioned. What are those types of mentions? Well, there're actually four types of mentions that they went after. So this isn't quite as simple as it might seem, it's just on the outset. So not only are we looking at specific diseases like diastrophic dysplasia here. But we're looking at disease classes, like cancer, where they wanted to look at compound, a composite mentions of different diseases. So, prostatic, skin and lung cancer as a phrase would be a mention of a disease. And also modifiers of diseases where familial breast cancer in this phrase, even though it modifies this mention of a gene and BRCA2, this also would be "disease mention" that would be -- that they would want to highlight in this particular corpus. So the fundamental question we wanted to ask in this section of the talk are the Citizen Science effort is can a group of non-scientists, non-expert curators, non-scientists, collectively perform concept recognition in biomedical text. This is our Citizen Science effort and as a sort of a sandbox for prototyping the interfaces to do Citizen Science, we actually leveraged the system called Amazon Mechanical Turk. So for those who aren't familiar with Amazon Mechanical Turk, let me explain. So it harkens back to a machine that was very famous in the 18th century and it was a chess playing machine which was called the Turk. You can read more about it here on the Wikipedia page. And essentially this toured around Europe. It was hailed as a great invention where -- of artificial intelligence where you could actually have a chess playing machine. Of course, it was later of revealed that actually this wasn't a machine at all but it was simple a housing for a cleverly hidden human within this machine. So people who were playing against what they thought was this machine were actually playing against the human hidden in this large box. So it wasn't artificial intelligence, it was artificial artificial intelligence. OK. And it simply -- so Amazon repurposed this name simply to refer to the set -- a resource to address tasks for which humans really are the best people or the best resource to accomplish these types of tasks, again artificial artificial intelligence. This is how Amazon Mechanical Turk works. It starts with the requester, in this case we are the requester, and we create tasks that we want the Mechanical Turk community to do. And for each task, we specify a qualification task -- qualification test so people have to qualify before they're allowed to work on our tasks. We also defined how many workers we want to work on per task on each individual task and how much we will pay per task. So all of these tasks go into the giant queue that Amazon maintains and Amazon manages the execution of these jobs and manages interfacing with these workers and paying these workers and as well as advertising these tasks. So workers come here. They get tasks from this community. They execute them. And for each task, you get multiple workers to work on each task. And then back to us, we work on how to aggregate these community non-expert-based results into something that is meaningful for us. So I'm going to very quickly describe the instruction that we gave to the Amazon Mechanical Turk workers that we thought roughly captured the essence of the instructions that we give into the expert annotators used to generate that NCI corpus. So first we told them, obviously, "Highlight all diseases and disease abbreviations" and not only Huntington Disease but also HD. We asked them to highlight the longest span of text that was specific to a disease so not just diabetes mellitus but insulin-dependent diabetes mellitus. We wanted them to highlight the disease conjunction. So these are the compound composite mentions familial breast and ovarian cancer. And we asked them to highlight symptoms which were the physical result of having a disease. So, for example, dwarfism, learning disability, visual impairment, things like that. So those were the instructions. Users who wanted to work on our tasks were given these qualification tests. We would just give them three simple blocks of tasks -- three simple blocks of text and we asked them to highlight all the things here and then we quizzed them with 26 yes-no questions. Should myotonic dystrophy been highlighted? Should DM be highlighted? Should protein be highlighted? Should kinase encoding gene be highlighted, so on and so forth. And based off of their performance of this test we saw this distribution of context -- of result. So again we asked 26 yes-no questions based off of the three texts shown below. There are a few people that did very well, lots of people who didn't do very well. Relatively arbitrarily, we said that our threshold for passing was greater than 21 out of 26 correct answers, and so we called that population our qualified workers. So 33 out of 194 individuals passed. So we had roughly 17% pass rate. So, for those who passed, we essentially gave them large amounts of text and gave them this interface where they could highlight the disease mentioned that's based on this text. So the interface essentially looks very simply like this. We present a PubMed abstract -- title and abstract and they're simply asked to use their mouse with a click and drag interface to highlight disease mentioned. So that's how we interfaced with the community. We set up our experiment like so. We took the 593 abstracts from the test corpus in the NCBI disease corpus. We paid the 6 cents per task and in the Amazon Mechanical Turk space, these are called HITs for Human Intelligence Tasks. And again each HIT was to annotate one abstract from PubMed and we asked for five workers per abstract, OK, 30 cents per human abstract. And then we work on how we aggregate it. So suppose we have five people who do, to annotate a given piece of text. Here is that, you know, piece of text shown for people -- for four times. So if we simply take all highlights that were shown in any of our workers, you know, essentially take the union of all annotations, we essentially get the highest recall, but relatively low position -- precision. We got good stuff like leukemia and orthotopic leukemia but we get bad stuff like growth and including and things like that. So then we can step it up our stringency. We can say, "Well, we only want annotations that have -- we're given by at least two of the five individuals. And so there we start to drop out some of these false positives and we can keep on going up into things that have only been shown by three people, three out of the five, four out of five. Of course, the more repetition you require, you actually lose good stuff. So in this case we lost orthotopic leukemia, we lost AML, Acute Myeloid Leukemia. So there's a tradeoff here. OK. And this tradeoff is between high recall/low precision at low values of K, and high precision and low recall at high values of K. And we can graphically show that like this. OK. So again, if we start with low stringency, we have high recall and that recall drops off as we increase our demand for replication. And the reverse is true for precision. OK. We summarize those both by this F measure and the F measure is simply a harmonic mean of precision and recall. Common metric used in NLP. And so we see this peak here at K equals 2. So we have the best tradeoff between precision and recall and an F score of 0.81 and this is when we recall 2 out of our 5 individuals to agree on this annotation. Just a quick note that we did 593 documents in 7 days using 17 workers at a cost of under $200, so this turns out to be quite efficient. So we wanted to test a little bit about how our system would -- you know, this is one of our first experiments. So how would our systems behave as we modulate N here, you know, if we ask for more or fewer documents? So -- I'm sorry -- more fewer people to do each document. So if we look at N equals 3, so if we go down a little bit. The peak was at K equals 2, so 2 out of the 3 reviewers would agree our maximum F score was 0.69. If we now try N equals 6, so we want 6 individuals to look at each task. Our peek now is at K equals 3 and we get a max F of 0.79. If we now have 9 people to look at each document, we get up to 0.82. And interestingly, after we get to about 12, we see essentially diminishing return. We don't see great increases in our maximum F score after 12. And at 12 we say, "OK, well, somebody says if 5 out of 12 annotators agree on an annotation, that essentially is the peak balance between precision and recall." So 0.85 or, you know, how does it actually-- what does it actually mean? How is -- How do we rate relative to text-mining algorithms? So it turns out because of some of the idiosyncrasies behind, you know, disease mentions and things like that, text-mining doesn't do very well on this particular task. So text-mining is represented by the first two bars here, the Y-axis is the F score. So we used NCBO annotator which is a very simplistic text matching algorithm as well a BANNER which a slightly more -- a much more complex method base off of conditional random field. And you can see these get F scores around 0.25 and 0.35. Our mturk experiment, I'm showing for mturk experiments. The one I presented is actually the orange bar which get up to about 0.8. We see our Mechanical Turk community being much higher performance than this text-mining algorithm. So interesting, we're doing better than text-mining. How are we doing actually relative to the human annotators themselves? Well, if you go back actually to the original paper where this corpus was described, you know, I mentioned that they had two annotators look at each document. Well, if you look at the F measure that describes the agreement between two human annotators that haven't looked at each other's results. You only see an F measure of 0.76. So there's truly some ambiguity here in terms of what these standards should be. So it says, I think, you know, we're -- you know, our community is actually pretty close to, you know, a single expert level curator. If you look at what they call annotation stage 2, where annotators now have a chance to look at each other's annotations and to resolve disagreements, they got up to 0.87. But even with that we're not too far off with our Mechanical Turk-based community. And so I think the conclusion from this is actually that, you know, are in aggregate, our worker ensemble is a faster and cheaper with comparable accuracy, you know, resource to do disease concept recognition. OK. Again, this is only the first step in our information extraction process. We really want to move want to move down towards identifying relationships between concepts, 'cause this is even harder from a machine learning standpoint. And, you know, not only do we want to identify the concepts that are present within these texts but actually the relationships that link those texts. So that's something we're very excited about moving in the near future. And to loop it back, right, we really don't -- you know, our goal is not to use Mechanical Turk long- term. Our goal is really to use -- to bring this to Citizen Science, right. I think many Citizen Science efforts have shown that if you give non-scientists the ability to really engage and contribute to scientific research that the scientific community is well -- or the non -- the general community is well motivated to help out. That's essentially where we want to end up. So we have this launching site. You can only sign up to for email updates right now at Mark2Cure.org. And our goal is essentially once we prototype the interface and refine that interface within the Amazon Mechanical Turk community to really unleash the entirety of the biomedical literature on the massive Citizen Science community. OK. So I hope I've convinced you again that the Long Tail of Citizen Scientists now can help collaboratively annotate biomedical text and that wraps up the three vignettes I wanted to cover. I work with fantastic team of collaborators and colleagues. Just particularly highlight Ben Good who really spearheaded the Gene Wiki efforts. And Ben with Max really spearheaded the Citizen Science, Mark2Cure effort. Chunlei Wu is all of the brains and implementation behind BioGPS and MyGene.info. NIH has been incredibly generous and lots of collaborators. So with that, I'll wrap and I'm happy to take any questions. >> Great, lets thank Dr. Su for a very interesting presentation on these valuable crowdsourcing resources. [ Applause ] We have time for a just a few questions. So if you're in the room here, please use the microphones on the desk in front of you. If you're on WebEx indicate with a raised hand on the dashboard and will unmute your line. So go ahead. >> Hi Andrew, thank you very, very much, fantastic talk. Could you just briefly comment on the types of problems that you think are amenable to this approach? I mean, the obvious sort of things or like, you know, text analysis, image analysis, pattern workings and that sort of thing. But are there other ones that maybe aren't so obvious? >> So, I would say the low laying fruit is image analysis and the vast majority of Citizen Science efforts have been taking advantage of humans, you know, reasonably well-developed and universal capability at spatial reasoning. So applications like Full Bit [phonetic], Self Light or I.Wire, Phylo, these are all applications that take advantage of that. The visual ability, I think, you know, the text analysis is actually the frontier. But it makes sense because language is also something that is nearly universally human although, you know, here we're limiting to English language. And it's also something that humans do much better than computers by the current state of the art. Other types of Citizen Science things that tend to be very successful are things that take advantage of just physical proximity, right? So you'd to have crowdsourcing efforts that take advantage of bird watchers, you know, to do bird censuses or identification of invasive species or things like that. Those I think are the sweet spots for crowdsourcing in Citizen Science. >> Thanks Andrew. Those are three very interesting vignettes. With the Mechanical Turk example and your 17% that qualified. Did you do any follow-up to find out what people you were getting and whether that was of any expertise, you know, experts lurking in that group? >> The closest we get to experts I would say, we get people from Library Science. That's the only one I can think of. I don't think we had anybody who said, you know, oh yeah, I have a Ph.D. in Biology and I want to work, you know, earn 30 cents per abstract. So, you know, the most common thing, response that we get in our interactions with individuals were, "You know what, your tasks didn't pay very well but they were really interesting and therefore I like doing it. Tell me when you have more." Or we get, "You know what, I found this very valuable because, you know, I had a loved one who, you know, suffered from cancer or Alzheimer's or some disease, you know, very personal connection to disease. And, you know, your application made me feel like I was doing something about it." And that is what really gives us a lot of hope around the Citizen Science aspect. If we gave people that outlet, you know, in the same way that they give 10 bucks when somebody does a walkathon, I think we can convince people to give their time. And, you know, I think we have identified a sweet spot where their efforts really can make a difference in terms of making biomedical knowledge more accessible and more computable. >> Actually sort of follow-up to Ian's question. This is fantastic presentation by the way. So, how do you actually get the experts to come along? I mean, I'm just curious because we have our own Pathway Interaction Database. A database that actually I would love to email you about and maybe we can have it linked to some of your Gene Wiki and looking at the pathways. But -- for the crowdsourcing because I'm still a little bit concerned about the accuracy, I know that you showed some great graphs there, that, you know, there's a great deal of, you know, trust in this. But in your opinion is there a way to actually get the experts in the field to -- or are there incentives that we can offer or what are your thoughts on that? >> I'm sorry you were referring to the Gene Wiki in particular? >> Right, or the BioGPS. >> So BioGPS is actually -- that's not a hard sell because most of the people who are using that are, you know, working biologists who have a need to understand gene function. And so, that one, you know, we're largely hitting a scientific community. The Gene Wiki is a very mixed community. We do see lots of domain experts in there, right. So if you go and look at the gene pages, the ones that are developed, right, I mean there are people who are doing, you know, adding real insightful comments. You know, one of my colleagues, Ben, just noted that on a gene page he happened to be looking at two days ago, George Church had made an edit to that particular page, it's a nice finding. So -- But there is more to be done in terms of encouraging the utility in the scientific community. Folks like Rfam. Rfam has been very good at linking Wikipedia to their site on RNA families. So Alex Bateman is a very strong supporter of using Wikipedia in research. The other thing I'll say is really just aligning incentives like the initiative we had for marrying the peer reviewed -- review article to Wikipedia contribution. That is allowing people to align the incentives between crowdsourcing and their own professional development. But those are the initiatives that we have going on. But you hit on the exact point. And the crux of the matter is how do you incentivize people and encourage people to do it. And I think there's room for plenty more innovation in this space. >> Great, thanks a lot. Unfortunately, we're just about out of time. I just want to remind you about our next presentation will be on May 28th. It'll be Dr. Ada Hamosh from Johns Hopkins Medical School, Institute of Genetic Medicine presenting "OMIM, Knowledgebase of Genes and Genetic Disorders." So once again, thanks everybody who joined today and a special thanks again to Andrew for a very, very interesting presentation. Thanks and so long everyone. >> Thank you.