Dr. Jeremy Goecks Presents - Using galaxy to understand cancer genomes

>> Good morning everyone. And thank you for joining us today for the NCI CBIIT speaker series. I'm Tony Kerlavage, the Branch Chief of Informatics Programs here at the NCI Center for Biomedical Informatics and Information Technology. The speaker series is intended to be a knowledge-sharing forum featuring both internal and external speakers on topics of interest to the biomedical informatics and research communities. Today, we are very happy to welcome Dr. Jeremy Goecks, assistant professor of computational biology at George Washington University. Dr. Goecks is a lead member of the Galaxy project, a popular web based platform for performing accessible, reproducible and transparent genomics research. Dr. Goecks received his Ph.D. from the Georgia Institute of Technology and his bachelor's from the University of Wisconsin. Title of his presentation is "Using Galaxy to Understand Cancer Genomes." I'll also remind you that this presentation will be available on the wiki for the speaker series as a screencast with a voiceover and it will also be posted on the NCI YouTube channel. If you Google NCI speaker series, you'll find that wiki page and we try to post video material within 10 days of actual, of the actual session. Information about future speakers in this series is also available via Twitter and on our blog, so please check out those sites for the latest information. You can just Google NCI blog. That will be the top result. And our Twitter handle is @nci_ncip. And with that, I'll turn over the floor to Jeremy Goecks. >> Great. Thank you, Tony. And thank you Julie Klemm for inviting me to talk today. Let's get started. OK. Oh no. OK. Fantastic. OK, here we go. So what I want to talk about today are really three main topics. [audio problems in first few minutes of recording] I'm going to talk about this high-throughput DNA sequencing era that we're in and why it matters [inaudible] and what challenges it creates as well for informaticians and for bioinformaticians. That will be a lead-in to the Galaxy platform and talk about some of the things that Galaxy can help do to make it easier to work in this high-throughput sequencing era, [inaudible] high-throughput sequencing era and turning your sequencing data into real meaningful biological information. And then I'm going to go on and talk in particular about how we applied Galaxy more recently to problems in cancer genomics. OK. So this is probably old hat to some of you. I know it's new to some others and a good refresher if you've seen it before. So currently, I know that there's a revolution going on in biology that's driven by this high throughput sequencing technology that's available. Here, you're seeing Illumina HiSeq 2500 [inaudible] state-of-the-art machine that can produce gigabases of sequencing data a day. Not only that, but it's low-cost as well. That the cost per sequencing is going down and the amount of throughput that you get from this sequencing is going up. So, a couple of graphs to make this clear. A lot of people I'm sure are familiar with this, the [inaudible] represent a cost per genome. So, back in 2001 when we sequenced the human genome, it cost 100 million dollars. It progressed among Moore's Law which is a logarithmic curve. So we're looking at a logarithmic graph here, costs are falling. But in 2007, costs are falling incredibly precipitously. And that leads us to monitor where things have kind of leveled out and you can argue that there are more advances going on. But in practice right now, if you go to sequencing exome for instance, the small amount of coding DNA that you have, about two percent of our DNA, it's just under 10,000 dollars. And if you sequence human genomes at scale going to Illumina or Complete Genomics or another large provider, you can get about 10K per gene. So it's very cheap to sequence human genomes these days. At the same time it's cheap, then [inaudible] across the world is continuing to rise. So you see lots of hotspots around the world, a lot in the United States, a lot in Europe. The one at the far right is China, it's home to BGI, which is a huge sequencing center that has, I think, is a private entity. It's heavily supported by Chinese government. So, another graph to drive this home, SRA, which is NIH's [inaudible] archive or sequence archive where people store their raw sequence data from their published studies going up in terms of [inaudible]. So, lots and lots of data are available. Yeah? >> Tony Kerlavage: Jeremy. Sorry for the interruption. >> Yeah. >> We are getting a little bit of break up on the, perhaps if you could dial in on the telephone, that might work a little bit better. >> OK. Give me just a couple of seconds to do that. OK. I'm back. Can everybody hear me? >> Much better, thanks. >> OK. Fantastic. So should I start over? Is it that bad? >> No, no just go on from where you are. >> OK. Fantastic. OK. So lots of DNA sequencing data is available. This sounds fantastic from a biologist's perspective. The challenge is that a couple of years ago, it started to become clear that computers and computational technology were not keeping up with the pace at which we were producing data. And so the concern here is, can computers-- these are limiting factors here and it's a really disconcerting notion because we build computers to make our lives easier and to automate tasks, then to do tasks that we normally can't do as quickly. So for computers to be a bottleneck, really it seems very counterintuitive. Now, to get at the heart of this, here's a nice graph that I'd like to show back from 2001 from Mark Bernstein, who argues that, as you look at the cost of doing a biological experiment that includes DNA sequencing, you look from the pre-NGS, the pre-high throughput sequencing or next generation sequencing area-- era on the left, the left bar chart. And then to about now and 2010 to 2015 and then in the future, what you see is that the cost of performing an experiment from the time you spent changes for actually doing the sequencing where it used to take a long time to do the sequencing. Now, it's going to take much longer to do the data management which I feel is a little bit small in these graphs, that data management is not going to be quite this concise. But the downstream analyses as well, the things that we're trying to draw this information off the sequence, the differential expression regulatory networks, meaningful regulatory variance are going to take our time and our effort. And more to the point, if you consider the wet lab to be the red, the table collection and the light purple and the sequencing and then the dry lab or the computational part to be the gray and the orange, we're going to be spending more of our time doing computation in the future and less time doing the sample collection of sequencing. And so the way to do more effective genomics, I would argue, is to focus on the computational infrastructure and make it better. And so this leads me to the point of why genomic analyses are difficult. So, on the right you see a typical file that you're going to get when you sequence some amount of DNA. It's just the strings of As, Cs, Gs, and Ts, along with some identifiers as well as some quality scores. Now, when you get this, there are a couple of problems that you're going to have or a couple of challenges that need to be overcome. One is that many biologists, many research oncologists and whatnot are unfamiliar with computation, right? So, the simple fact of the matter is that all this DNA data you're getting, you're getting, you know, millions, billions of lines of sequencing data. It requires a computer to analyze, you can't analyze by hand anymore, right? We've got here some form of computation. But many people down there from a command line with a bunch of scripts that's not a very user-friendly way and it takes a lot of time for biologists to get up to speed, if they have to do that time and again. Now, even if you're familiar with computation, and a lot of us are, the problems of creating and reproducing this work is hindered by the complexity that you see out there. There is complexity in different systems. So, as simple as are you using a Windows system or a Mac OS. But this goes deeper. You have individual scripts that people put together so they'll write small scripts to do a little bit of data processing. Oftentimes, those aren't included in publications. Tools obviously have many parameters, right? Parameters are done so you can adapt the tools to the particular needs that you have. And you need to be able to capture what parameters you use in order for other people to reproduce it. So even with the ability to program well, you still need to build out this broader infrastructure a lot of you use to both share your analysis with someone else across the country as well as enabling them to reproduce it. And then finally, I'd argue that once you can do this, you'd look at our current mode of publication in PDF documents or online HTML and these don't really support what you'd want out of a study these days. You know, if you'd talk to a bioinformatician, or you talk to a biologist who wants to reproduce somebody else's study and maybe eventually expand it in their lab, what are they going to do with that? Well, they want to first get their hands on the data, be able to run the tool and produce output that looks like what they [inaudible], right? And so the text of the paper is certainly very important, right? It describes what was done. But it doesn't provide the raw material. The data, for instance, might be at SRA. The tools may be hosted on a lot of websites, the workflows, right? So how do you connect these tools together if they-- [inaudible] first ran a quality control tool. And then I ran a tool that trimmed my reads, so I took away the bases that were of low quality and then I've mapped those reads and from those reads that I map, then I call it variance. Well, that's what these four tools that you're looking at and how you connect those tools and which parameters you use are not often captured and provided in these PDF documents that you see online. So there's a lot of challenges with huge genomic analyses, well, whether you're well-versed in computation or not. And so that's really the lead-in to argue why Galaxy is needed in this day and age. OK. So the Galaxy project has three fundamental questions and they're directly derived from the difficulties that I argued were abundant when you try to analyze DNA sequencing data right now. When genomics or any biomedical science becomes dependent on computational method. And I'll interject a little bit in the parenthesis here to say that there are lots of other areas that are dependent on computational methods these days too because you're generating large amounts of data. The imaging community definitely has a need to be able to do high throughput computational biology as well. So this isn't just genomics. So let's talk about genomics and say that we want to make sure that our tools and workflows are accessible to scientists in general. And we want to do so in a way that allows scientists with limited computational experience to still be able to analyze their DNA sequence data. We want to ensure that these analyses are reproducible. And then finally, we want to enable transfer communication and reuse, right? So we don't want people to reinvent the wheel. We want to be able to allow people to build on what others have done. And that requires, as I talked about before, the ability to not only access those raw resources, but to put them together quickly to understand what they need. So, there are lots of ways you could potentially accomplish these three goals, accessibility, reproducibility, and shareable. Shareable, that's what you want to call it. Galaxy comes at us from an open web-based platform approach. And from here I'm going to launch into a Galaxy demo that I'm hoping we'll come across as well. So let's bring this up and let's make it full screen. And so, I'm going to demo Galaxy and then I'll talk about the different pieces of it. But I would still like it's good for you just to get a sense of what's going on rather than going into the details. So as I said, Galaxy is a web-based platform. I'm starting this movie here. Everybody can see it? Yes, OK, so you've seen a web browser. This happens to be Galaxy on our local cluster here at Emory. On the left is a bunch of tools that are available in Galaxy. These are actually grouped by category right now. So we have hundreds if not thousands of tools that you can plug into Galaxy if you're motivated to do so. In the middle is our analysis pane and that will come into play in just a minute. And on the right are data sets. So this happens to be in the middle of an analysis that I'm working on for pancreatic cancer. I've done some analyses here. I have a bunch of data sets. My data sets are stacked up in green here. Now I want to show you a really simple instance of running a tool in Galaxy. And what we're going to do is we're going to take a cite [inaudible] of an assembled transcript that we have and transcripts for those that are well-versed is different forms of a gene. So it turns out if you read the DNA, sometimes you can skip parts of it. And when you skip those parts you create different transcripts and make those for different proteins [inaudible]. So, one of the things that we care about sometimes, we're looking at a cancer sample, which transcripts are present, which transcripts are absent. Now what we've done is we have a particular tool called Cufflinks we've used. It's very popular. And we've assembled a bunch of transcripts. And what we're going to do is we're going to filter this set of transcripts only to look at chromosome 2. We may care about chromosome 2 because we want to do an in depth analysis or something like that. What I want to demonstrate is how easy it is to filter this with Galaxy. Galaxy has a filter tool. Over here in the left you'll see that we have a filter and sort collection of tools. And there's a tool called filter here. In the middle pane now, we see our dataset. It's a bunch of texts. The first column has the chromosome name that we are working on. So what we're going to do is we're going to choose this dataset of assembled transcripts as our input. We're going to give it the condition of filtering them and say column 1 is going to be equal to chromosome 2. So we click execute. Now what we're doing here is we're running the filter tool back on our cluster. So it's going to start out gray and then it's going to go to yellow in a second where it's running under cluster. And then it's going to finish up and it will turn green at the end. What we've just done here is, should be relatively impressive, I hope, if you understand it. So, I'm going to pause it and let's say what we've done is we have data on our server. We were just able to run a filter tool and because of Galaxy and its connection to the computing cluster that we have here at Emory, we were able to run that tool on the cluster. So through our web browser, we were able to initiate a job on a high performance computing cluster, run it and key the results in our web browser. What we've seen here is we filtered our assembled transcripts. We've successfully identified all the transcripts that are on chromosome 2, OK? So this is really the canonical way in which you run an individual tool in Galaxy. You take a web form in the middle. You select the tool that you want to use. You have some inputs. You've got some parameters. You click Run. It goes and it runs on your high-performing computing resources and it comes back when it's done. Now, I've talked about this notion that you oftentimes want to connect tools together. And so this is what I want to show you now, where Galaxy makes it pretty easy to take, extract the workflow or a pipeline of all the steps that I used in this particular analysis. So here, Galaxy has the steps that I used in the middle pane. You can see that huge STAR which is an RNA-seq liner, of the Tophat2 which is a different RNA-seq liner? And I'm creating a workflow. I've done these steps. I also have the filter steps that I performed at the bottom as well as the coupling steps that I mentioned earlier. And so I'm going to take all these steps now and the output of say STAR and Tophat go into the input of Cufflink. And so in order to show that relationship, Galaxy has this workflow editor. We'll bring up the workflow editor now. You can start to see we have all our steps that were done in this workflow. I'm able to pull it apart here and see the connections, right? So I can see what was done. I can also edit these connections, add connections, change parameter values that are stored and then save this workflow for future analysis. The idea behind this is that once you have a workflow in place in Galaxy, you can use it again and again to do a perfectly reproducible analyses. I can take the same steps. I can-- OK, and that video is done. OK. I can do the same steps with the workflow. I can keep the same parameter values and I can run it again and again on multiple samples. So this is a nice way to achieve reproducibility with Galaxy in a nice, visual way. Let me switch back to the presentation now. OK, so to be explicit about this, I talked about accessibility. [Inaudible] Galaxy makes tools accessible, computational tools accessible to everybody. Every tool looks roughly the same, so you find a filter tool and action. Here, we have the complex tool which as I've talked about is a tool for assembling transcripts from short read. And it looks roughly the same. You have an input at the top. You have some parameter values that you can stack, which is maximum intron lines for your transcripts, minimum isoform fraction and so on. And then there's a big blue Execute button at the bottom that starts that job for you. There's no command line or programming as of right now. I'll talk about how you can use Galaxy with a command line later. But for right now in terms of accessibility, Galaxy ensures the ability to look at that command line and try to figure out what type. Instead we're using web [inaudible]. And what is possible to do-- and I kind of hinted at this in the-- or the video hinted at this -- that you can chain tools together to create larger analyses. So we're not talking about one tool here. We're talking about the ability to collect a bunch of tools in Galaxy and use the tools one after another to view a longer process. And what Galaxy ensures is that the output from one tool can be turned into the input from another tool. In terms of reproducibility, as they argued, workflows enabled this precise reproducibility because everything is stored in Galaxy and it's automatically stored. So that if I choose to run this workflow in Galaxy, I'll get the exact parameter values that I set up. And I'll get the exact order of steps, inputs and outputs connected accordingly. Galaxy also has some options where users can add tags and annotations to further describe their analyses. So they can say, I'm running this step to filter bases that are of low quality. Or I'm running this step because I only care about transcripts in chromosome 2. So Galaxy, I'm fond of saying, captures the what quite well. And in computers in general, they can tell you exactly what you did in the past. But the why requires somebody saying here's why I'm doing this stuff. And with Galaxy, you can do that as well, and so you couple the what with the why and you get a full description of a [inaudible]. And finally communication and reuse. I didn't demonstrate this in the video, but what Galaxy has is this notion of cages [phonetic] and this is kind of our take on what would be a next generation of computational biology publication looks like. Here, what you see is a webpage and you can actually go to it. It's done by one of our collaborators at Penn State and relates to a paper they did looking at [inaudible] mixture and demographic footprints across polar bear and brown bear genomes and looking for climate change. Now, this is a supplement, an interactive supplement for this paper, And that is, and if you can see your Galaxy data assessment at the bottom of our Galaxy workflow. And so what a reader can do is they come, they read the paper, they too want to perform this analysis. If they want to do that, what they can do is they can click on these data sets and workflows to extend them in real time, click on the little Import button, so in the middle of-- the middle screenshot. You can see that there's an import workflow button. You click on import workflow button. A copy of that workflow is made, it's copied into your analysis workspace and you can immediately get into it in a bucket, in the workflow editor or what you can do is you can use it right away. So what we're fond of saying here is that Galaxy enables these readers, these readers of these supplements to immediately become active data analysts. Because they can take these embedded Galaxy objects in these Galaxy pages and begin using them immediately. We'd like to use pages both for interactive paper supplements and they're also especially useful for tutorials. So we have tutorials made for RNA-seq analysis, various analyses and so on, that have been re-used in these pages as well. OK, so now, I'm going to talk you through a little bit about the features of Galaxy at a high level and here's our description of Galaxy: Galaxy is a platform for high throughput genomics. It has three very important things that I would argue captures much of the scientific process when you're doing a genomics analysis experiment. First, you can get your own data into Galaxy and you can integrate it with public data. So you can download, say 1,000 Genomes data or iGenome is the popular resource that's used with RNA-seq analysis and Galaxy. You can couple it with your private data: your private data and your public data put it together. With Galaxy, you can also analyze data and create workflows. So, this is the core of the analysis being able to run tools, chain those tools together to create workflows. And then finally, you can do things that really aid the process of understanding your data. You can do visualization in Galaxy. You can share any object in Galaxy, any history, any workflow, any page, they all get a static URL that you can share with colleagues or include in publications. And then finally, I showed this notion with Galaxy pages which is our take on this next generation of publication. From a software perspective, Galaxy is open-source software that you can download and you can run on various high-performance computing resources. So we have a public website at usegalaxy.org that you're welcome to sign up for and use. It's connected to a high performance computing cluster down in Texas as part of the [inaudible] system. And so, we have really nice resources behind our public instance right now that works really well. You can run a local instance. So for instance, this video I shared with you was taken on our local instance at Emory and if you're using the Emory high performance computing resources. And then finally, you can run Galaxy on the cloud. And so, I have-- I'm not going to talk about the cloud, I think there's some slides a little bit later on that talks about how to use Galaxy in the cloud. Now, one criticism of Galaxy, one potential criticism of Galaxy is that it locks out the bioinformatician from using what they've developed and what they've invested time and effort in which is writing scripts. And so to kind of ensure that bioinformaticians aren't left out of the Galaxy world and ensure that the bioinformaticians can work productively with the biologist side by side in the lab, we're building out Galaxy into a platform where the platform, the core of it is still running tools and workflows on high performance computing resources but also we provided API to Galaxy so the bioinformatician can script against Galaxy. So it's no longer necessary to go into Galaxy and manually start a workflow. You can actually tell Galaxy -- import this data from this particular file, take these datasets that I've created from these files in my local directory and run a workflow on it. And so-- the gist of this, and this is a kind of an unrefined picture right now, an unrefined slide unfortunately, but Galaxy is available to end users. At the bottom left here as a web browser. In the middle though, Galaxy's API makes Galaxy available to the bioinformatician so they can script against Galaxy. And then finally, Galaxy has a plug-in architecture as well, plug-ins for tools, plug-ins for visualization. And so, Galaxy is definitely meant to be expansive. It's not an end-all be-all but it's a platform that allows you to put together different tools, different visualizations and draw together different areas of expertise to be able to give you genomic analysis that-- genomics analysis experiments. One thing that's becoming more and more an effort, this is my second bullet point up there, minimizing data movement is, there's this increasing focus on remote computing. This notion that the data is too large to draw out, to bring down to your laptops, to your desktop these days. It's not possible to analyze that oftentimes using these both for resources. So you have to use these high performance resources. Because of that, Galaxy uses this remote computing model and uses the web as kind of the vehicle for that. Where on your data, we-- fits on the Galaxy server which is connected to your high-performance computing cluster and your access to the data is through the web browser or through the Galaxy API where you can download snippets of the data. So we've made it very easy, for instance for visualizations. We only draw over the data that you're looking at. If you're looking at chromosome 1 data, that's all you're going to see from Galaxy. You can download the whole dataset if you want, but the default is to do this remote computing and to minimize data movement across different platforms and across the network. OK. So, I want to mention this briefly 'cause I feel like it's a pretty nice feature of Galaxy. I talked about how you could run Galaxy in the cloud. And in particular, on the Amazon Cloud is what we've targeted so far. So we're looking at our public server and there's this feature called Cloud Launch, so there's a cloud tab you see in the middle, right? They want to create a new cloud cluster. If we kind of walk through this, you can enter your key ID, your secret key and cluster password. Once you do that, you click a couple of buttons and Galaxy magically is allocated onto the cloud with a whole bunch of tools, a whole bunch of indices, for running advanced high-performance, in particular, high throughput sequencing tools like mapping tools and whatnot. And you have this clustered strictly from your web browser. So, this is a point I'll probably make a couple more times in our presentation -- that Galaxy is meant to be possible to access all the functionalities strictly through a web browser. And here, I'll show you how you can do it pretty easily with Cloud Launch if they were having to launch a cloud cluster, take advantage of these cloud resources that are out there on Amazon, maybe I don't want to wait for the queue on the public server, or maybe I don't have access to high performance computing resources at my institution, so I'm going to use the Cloud to do this analysis. Galaxy makes it possible to start up a Galaxy instance of your very own on the Cloud. And we have-- it looks just like any other Galaxy instance at the end of the day. It's just that I'm-- Welcome to Galaxy on the Cloud, OK. So why do you want to use the Cloud in particular? I often use the example of read mapping versus assembly. So when you map your short reads that you get off with the DNA sequencing machine, you want lots of [inaudible], versus when you try to assemble a genome or a transcriptome or some subset of DNA, you want a lot of memory. With the Cloud, you can allocate your resources as needed. So you can say, I'm going to allocate a bunch of cores with a bunch of CPUs, I'm sorry, a bunch of nodes with a bunch of CPUs, or I want very few nodes but I want them with a lot of memory. And so the Cloud is nice-- very nice as you can provision these resources as needed to meet your bioinformatics needs. Galaxy makes it possible to autoscale your cluster. So if you-- for instance your full utilization on your cluster, you can say build out and add more nodes to this cluster so I can get my analysis done faster. And then finally with Galaxy, you know, we place a large emphasis within the team to ensure that everything within Galaxy is shareable because shareable means that other people can use it and build on it and that's how you can help promote reproducibility and enable this building of blocks. So if you have these initial Galaxy pipelines, somebody builds on those and you continue to build out rather than starting from scratch. This sharing and this focus on reusability of the Cloud is manifested as something called Snapshotting, where you can take your complete Galaxy instance including the tools you have, the analyses you've done and share a copy of that with somebody else. And so, not only that you get the tools, you get the workflows, you get the history, you get all the data. And you don't need to put that on. And so it makes it very nice for you to say OK, but you want to reproduce some analysis, here is how you can do it on the Cloud using Galaxy. OK. So just a bit about visualization with Galaxy, this is something relatively new but Galaxy has traditionally been a tool where you run your analysis, you run your pipeline and then you have to get your data out of Galaxy and visualize it somewhere else. That's changing and we're doing that by introducing web-based visualization with Galaxy. And so here's a really simple example of a scatterplot. And so we're looking at differential expression data as it turns out in this particular scatterplot. And this is a nice way to show that you can keep your data in Galaxy using this remote computing model but visualize it as well in real time. All these visualizations I show you are interactive as well so we can see that in this particular screenshot we're mousing over looking at gene names and we're looking at treatment versus control [inaudible] values. We also have our own genome browser. I mentioned this before. It's driven by this notion that you only retrieve the data that you're actually looking at. And so it scales quite well. It will work over, you know, it works in a standard web browser first of all, so it takes advantage of [inaudible] technology to draw it quite quickly. But it also doesn't tax the network at too high a level and download say a whole band. So for instance the fourth track down, the track that's blue and green and a bunch of reads-- RNA-seq reads. So, there is no doubt that this file's at least, you know, a couple of gigabytes in size but when you visualize it in Galaxy and view it, you're only going to get a small subset that you're looking at. You know, when you zoom somewhere else, you're going to get a different subset. We have Circos plots as well. Same principles apply here where we index the data, we can process the data and then you can view it in real time. And so here, we're looking at I think a bunch of read tracks from some cancer tissue, cancerous tissue. And it's all been converted into small numerical formats, so we can look genome-- look at genome-wide expression as well as these fusions that go across the middle. Phylogenetic trees as well are available in Galaxy. So, just to-- everybody always asks, you know, what does Galaxy look like in terms of its usage. So here's a little bit of information about that. Our public website is very good. We get about 500 new users every month that does a bunch of data. And I think this figure is outdated at this point. I think last month, we read about 150,000 jobs because we got these additional computing resources down from [inaudible]. And so this number-- the number of analysis jobs continues to grow at Galaxy. One limitation that we did have in our public web survey now is data restrictions. We implemented these about a year ago, and I think the way they work right now is that there's a maximum size per data set. A maximum size for history which is your-- it's kind of like a folder in Galaxy, it's a set of analysis jobs that you've ran. So I think the maximum size per data set is 50 gigabytes right now. The maximum size for history is 200 gigabytes. And then there is a total maximum size that you can use across our public instance. And we do this because we're actually having more trouble managing the data right now, the large amounts of data that people are providing to our public website than the compute jobs. The compute jobs we can set up and people may have to wait for a couple of days to run their jobs; wait in the queue but their jobs run. On the other hand, the data that they're generating is just so especially large that we do have to limit it right now. Galaxy has been -- switching slightly topics. Galaxy has been used and cited in more than a thousand publications. So not only are people are using Galaxy but they seem to be using Galaxy to effectively make scientific conclusions that they can include in publications. Galaxy is popular in terms of the local installations. So I talked about how you can install Galaxy on your local computing cluster as well and Galaxy has been used-- it's being used in lots of U.S. institutions, European institutions as well as over in Asia. And then the BioTeam has now recently set up. BioTeam is a bioinformatics consulting firm out of Boston. And they recently set up a high-performance computing box where Galaxy has installed by default a bunch of prominent schools that are made available as well. And so, this is a nice way that if you don't want to try to set up your own Galaxy and you want some kind of dedicated hardware, you know, because you had sensitive data that you don't want to ship across the network or you just want it nearby, you've seen it [inaudible] by Slipstream applies from BioTeam and set all that up this way. OK, what? OK. So, I have like 10 minutes left. So I'm going to breeze through this. There are lots of ways that you could extend Galaxy and there are lots of reasons why you might choose-- OK, so let me back up and say that with Galaxy, you can't do everything. So that we get lots of feature requests about can Galaxy do this and can Galaxy do that? And how we usually decide about what to do with Galaxy is we try to engage in a genomics research project and then figure out what are the challenges that we encounter in this project when we try to use Galaxy for it and then design the features to mitigate those difficulties that we have. And when we finally do this process we get out a nice research product where we've enhanced Galaxy to solve a real biological problem and at the same time, we've made advances in biology as well. And it's just kind of-- fun to study biology and do science along with the computer science that we do so that's an added bonus as well. So, all that is kind of contextualization, to say that the current project that I'm involved with is a personalized oncology project, trying to use Galaxy to personalize oncology and this is joint work with the Emory Winship Cancer Institute. What we started with is we're looking at pancreatic cancer and we happened to have six patients and we had RNA-seq data, so whole transcriptome sequencing of their primary tumor. Here, we're looking at mixed populations of cells. So there's a great deal of really fascinating cancer work that's sequencing single cells. But here, we're looking at mixed population. So we get the tumor. We pull some section of it out and we sequence it. We're actually looking at three patients that have high ERCC expression, ERCC is the DNA repair gene. And so the question is if we look at three patients with high expression and three patients with low expression, how do they differ and how might their treatment differ? At the same time, we also have three pancreatic cancer cell lines that we've been looking at down here, looking at them for high throughput screening purposes. So, high throughput screening is this idea that if you have a cancer cell line and you have the robotics set up, you can apply lots of drugs simultaneously to the cancer cell lines, figure out which drugs the cancer cell line is most susceptible to and then use that information to potentially guide treatment. So, this is a very small experiment in the grand scheme of things. But, it's a nice starting point. The big questions we have here is, can we determine what drugs the patient is likely to respond to given his or her genomic profile? And in the past, genomic profile has meant often as mutational data, maybe gene expression data pulled from micro array analysis. But increasingly, and with this experiment, we tried to view what we said was an integrated genomic profile, a profile that includes more than just one type of genomic data. So in this case, we're looking at mutation and gene expression. And then secondly with cancer, there are lots of fantastic public resources that are available these days. Things like TCGA, IPGC, cancer cell lines like PD and whatnot. And the question is how can we combine private patient data with a public data that's out there? So you're going to get and try to personalize treatment. And broadly you can imagine two ways to do this. You can take the genomic profile of patients and match them directly to drugs if that information is available or you could potentially match a patient to a cell line that looks similar and then use that cell line either through high-throughput screening or known drug response information to again guide treatment for this patient. OK. So, [inaudible] if we try to use Galaxy for this particular process, what we're going to do is we're going to introduce a new tool for the Galaxy. We're going to introduce new workflows and we're going to introduce new visual analysis application. So, here's a workflow that I have developed for moving directly from exome target, in this case targeted exome, but exome sequencing data all the way to potential drug. And there are a lot of steps here. Let me see if I can breeze through this and talk briefly about what we're doing. So, we're starting at the top-left, we're going to map some reads that we get. We're going to move down. We're going to do some quality control in terms of removing duplicates -- to compile up. At the top of the next column, this bar scan step is calling the variant. So after we've mapped the reads, we're going to find the mutation that the patient has. We're going to do some annotation of these mutations and remove the common ones. So, we're going to do some processing. At the top of the third column, we annotate our VCF with all this information such as a minor allele frequency based on 1,000 Genomes data or the Exome Sequencing Project. Any kind of common variance we find, we're going to remove immediately because they're unlikely to be the cause of cancer and are unlikely to help guide treatment. Then we're going to filter-- finally, the second to the last step in the lower right, what we're going to do is we're going to annotate with the drug-gene interaction database. And finally, we're going to get out a list of potential drugs. So, I want to highlight a couple of things with this workflow. Number one is that we're actually drawing on a large amount of public information to make it possible. So, in particular the Annotate VCF staff were pulling in 1,000 Genomes data, we're pulling in the ESP or the Exome Sequencing Project information that states, any variants that we've seen before and especially common variance, were going to remove it. And then the Drug Gene Interaction Database is something that's relatively new from Washington University. And it's a really nice example of what you can do with-- computing technology when you apply it and try to aggregate information. So in this case, this Drug Gene Interaction Database aggregates information for many common drug databases and makes them widely available through a web API, that we can access. I would also make the point that we could black box this, right? So we could just provide a tool that says, here, give us our [inaudible] data, at the end of the day, we're going to be using drug information as well as annotated variance and help you guide treatment. But the advantage of doing this in a workflow with Galaxy that is understandable, so you-- you know, I walked through this very briefly and I'm sure most scientists, if not, all scientists, who work in research oncology areas and understand pretty much what's going on here. Now, so because it's understandable, it's also going to be editable. So, say for instance you don't like [inaudible] as a variant color, you want to use something else, maybe you want to use [inaudible] tool, maybe you want to use Freebase or [inaudible]. Well, you could plug that in and try that instead. You can also change parameters if you want to change parameters at any one of these individual sets. So it's quite editable and amenable to being edited both by bioinformaticians as well as for oncologists and researchers that don't have computational experience. And then finally, this workflow is shareable. So I can take this workflow, somebody else, in a completely different Galaxy instance, can import this workflow. Set up the appropriate tools and just go and run with it. OK. So how would you validate a workflow like this? We're going to validate using public data and I'm quickly running out of time, so I'm going to breeze past some of this. But, validation, Cancer Cell Line Encyclopedia, more public data, it's a huge encyclopedia of cancer cell lines as well as the drug response information. It has mutational information and also has drug response information. And so together, we can put through our cell line data and, this particular cell line is quite popular from MiaPaCa2, we were able to recover all mutation including the short insertions and deletions that are apparent in the Cancer Cell Line Encyclopedia, and we found drugs that match what they saw, roughly match, I mean, if there is a recent-- there is an interesting Nature paper recently that looked at drug response across different cancer cell line projects and found that there's a large amount of variance. But, that notwithstanding, we have relatively good matches with looking and finding drugs, that were applicable to the cell line because that cell line had responded well to in the past. So we found 16 druggable drugs, 98 potential-- or 16 druggable genes and 98 potential drugs. OK. One last point, cancer drugs are inhibitors in general, right? And so, gene expression becomes quite important. So how can we leverage that? Well, I'm completely out of time, so I'm going to say that there are fancy workflows available in Galaxy, a few analysis of RNA-seq data which is going to give you amongst other things in the middle expression levels. You can take that, the differential expression levels, convert them to gene regions and then slice down those potentially druggable genes and potential drugs looking only for-- looking only for mutations that's only for drugs that would impact highly expressed genes. And when you do that, we reduce the number of druggable genes from 16 to 6. And we don't reduce the number of drugs quite as much from 98 to 62, but we'll still get a nice reduction there. I have no time to talk about matching patients and cancer cell lines. Suffice it to say that we can do this in Galaxy, and we do it with workflows. I'm actually going to get out of presentation mode quickly and just go to pen. And if you're curious about this, send me an email, I'll happily share my slides with you. So concluding thoughts. I've tried to argue that Galaxy is a useful platform for high-throughout genomics if foundations are accessible, reproducible, and collaborative research and it can be used in public on usegalaxy.org as well as run locally or on the Cloud. And in terms of extending Galaxy for cancer analyses, we've introduced new tools, workflows, and visualization that allow for this. It's currently driven by a personalized oncology project that we're working on. And we would argue that integration is key here. You need to be able to integrate not only different types of genomic data such as mutation and gene expression but also you need to be able to integrate your private data and your public data. And I really view this as a great future opportunity, not only for Galaxy but for the bioinformatics community in general, to take all these fantastic tools and resources and data that are being produced and put them together to make sense of them which is hard. But I thought there's fantastic-- fantastic findings waiting to be discovered there. The Galaxy team is large and we're spread across Emory University, Penn State, and George Washington right now. We were generously supported by NHGRI as well as NFS. And our principal collaborator at the Winship Cancer Institute is Mike Rossi. Thank you very much.