Keeping up with The Human Genome

FEMALE SPEAKER: All right, folks. Well, it sounds like it's about time to get started. So good morning, ladies and gentlemen. We are pleased to have with us here this morning Dr. Tim Hubbard, who is the gentleman at the Wellcome Sanger Trust Institute who is responsible for the group that-- what is the correct term? TIM HUBBARD: Annotates the human genome. FEMALE SPEAKER: Thank you very much. And it was responsible for annotating one third of the human genome sequence. So without further ado, I will turn the mike over to Dr. Hubbard, who will give us his presentation on keeping up with the human genome. TIM HUBBARD: Thanks very much. So it's nice to be back at Google. Oh, this is interesting. My mouse has died. I was here for the Google Foo Camp in August, and I didn't really talk about this particular thing. I talked more about openness things then. So I thought since I'm in the area for the CASP competition, which has just finished yesterday, that I'd come and drop in and talk about this. So this is about genomes. Genomes are very, very recent occurrences, as far as having the sequence is concerned. Since just 1995 when we had the first one and-- for instance, about 2000, we've had the human genome, which is huge. And the problem is that lots of people want to look at this data, use this data, integrate it. So I'm going to talk about how we do that in Ensembl, which is one of the big three browsers that provides access. A bit of background, since I'm in the US, and a company, and you probably don't know some of this background. Wellcome Trust Sanger institute-- we sequenced one third of the human genome as part of the public international partnership that put it out in the public domain. We're a big center by pretty much any standards-- not as big as you, but biggish. So we have a 1,000-square-meter computer center, for example, which we're doing rotation farming on, filling it up with nodes each year, killing off one of the nodes and replacing it, that sort of approach. We're funded by the Wellcome Trust. Wellcome Trust used to be the richest charity in the world until the Gates Foundation overtook it, which they're kind of upset about, but anyway. It's a very big charity and we spend around 15% of their spend each year. So we're doing big science, high throughput data. And all of our data is put out into the public domain, all the large-scale projects. And we're based on a campus just outside Cambridge in the UK. And we have another institute on that campus, the European Bioinformatics Institute. That's the equivalent to NCBI, so you've got Sanger Informatics and EBI Informatics next door, maybe 400, 500 informaticians, as well as another 400, 500 experimental scientists in the Sanger. So one of the big projects I'm involved in, I'm one of the leads on, is Ensembl. This started out dealing with just the human genome. It's now got around 30 genomes in it, 30 big genomes. It's got references going right down to yeast, but most of what it's got is the large three gigabase genomes. So a lot of the important stuff about having these genomes is the sequence of relationships between them. Some things are extremely distanced-- this is a family tree of genomes. Some things are very related, such as-- you know, right at the top, you've got chimpanzee. Then you've got a lot of mammals. Then you've got more distant things, going down to other model organisms colored in blue, which have been completely sequenced. Most of the other ones are in a fairly rough state. Now in terms of informatics, the natural way of looking at these things is a continuous coordinate system. That's the karyotype of humans. So you've got these 22 chromosomes and X and Y. They vary in length from around 250 million letters down to around 60 million letters. So that's how we'd like to address them, 1 to n, with the coordinates. Now in actual fact, with the way these things have been sequenced, is as thousands of pieces, or millions of pieces, depending on the technique. And so the informatics challenge is organizing that, handling it. Quite a lot of the problem is sequence transformations, being able to handle between these different coordinate systems. It's not just within a genome, it's between genomes that are related to each other, and also more exotic transformations. You've got the genes which are only sub-parts of those genomes. Now, we deliver more than just the sequence. So here's a piece of sequence. This is a tiny piece of sequence. It's less than a millionth of the human genome. But most of what we do is serve information on top of that. So here's one of these views. It's a small piece of sequence. The boxes represent things like genes and other evidence for those genes being there, maybe up to 80 different types of information layered on top. And what's the reason it's coming here? Well, it's not completely unlike-- when I get through this-- under 10% of what's in Ensembl is sequenced. It's not completely unlike what you do here, putting annotation on top of maps. We're putting annotation on top of the sequence. You're in a 2D coordinate system with Google Maps. We're in a 1D coordinate system with Ensembl. So we add gene annotation. And for gene annotation, all I'm going to say in this talk is that it's hard, still hard. We still don't know where all the genes are. Gold standard right now is manual curation, because the automatic algorithms produce too much uncertainty. And although we rely on experimental information from sequencing the fragments of genes that are actually turned on in cells, they're very noisy. So having humans check that is the most reliable thing. So that's one thing we do is produce sets of genes which are used pretty much around the world. This is why it's hard. Because in bacteria, you've got a continuous-- just genes made up of one unit. In higher organisms, it's fragmented, and that makes it much harder to identify where the genes are. So I won't say any more about gene building in this talk. You want to ask me later. The other thing we do is comparative genomics. I've already said that we have a pile of different genomes in there. And so we can calculate the relationships between those, look at the rates of evolution between the genes and other parts of the genome. We can look at add information such as the variation within a genome. So there's one human genome, but there's six billion of you, and you've all got individual variations within you. And the population structures of those, and those can be related to disease. So it's interesting to store that information and make it available to other researchers. And then for all this, it's infrastructure. In some ways, we're just organizing this data into a way where other people can get access to it. And some of that we do that with the website. Some of it's via API, via an open-source environment. So the API just means that-- I said, you'd like to be able to address the thing 1 to n. So coding API means you can do that. Although it's made up of lots of fragments, you can address the whole thing here. The middle thing is saying, fetch the whole of chromosome 22 as a virtual fragment, which you're going to do calculations on. That API is now quite flexible, quite extensive. It allows you to compute across different organisms. You can project information for one coordinate system or one genome onto another one. You can also layer information. We just put in a lot of data from different mouse strains. People do experiments on different mice, and so it's relevant to be able to look at it. We haven't got the complete sequence of these different mice, but we can project between them. And you can access that information either via the web displays or programatically. So conceptually, behind Ensembl-- complete openness. Codes available via BSD. We can make all the data dumps available. We code as a single common CVS. It's about 30 people-- well, there's another maybe 20, 30 people who put code in internally, and some external people. It's based on MySQL with Perl and Java APIs, different layers of objects, so that you have objects for things like genes, but the way we handle that underneath can be changed. And continuous improvement-- so this site gets updated once every two months. And there's nearly always a schema change at that point, because we've added some new data, and we've had to extend something, or maybe completely refactor the way we store things, because the quantity of data's changed. We're split up into little sub-teams-- this is not that interesting to you-- and we have a process of handling that release cycle within the organization. We have a healthy paranoia because we are still, in terms of usage, we're probably the leading one in the world of the big three. NCBI does a lot of things, but its browser is arguably less competitive. UC Santa Cruz, just up the road-- they have a simpler browser engine. They're very much our direct competitors. So we're worrying about is what we're doing still relevant. How can we tell if it's relevant? We're interested in ways of looking at usage, of our API accesses and web pages. And I guess that's pretty similar for all bioinformatics resources. It's becoming relevant. If you're going to-- we're supported by the Wellcome Trust. We need to justify our existence. And in fact, all web resources need to justify. Now that's what we are right now, but there's lots of interesting things happening. And that's connected with data. So this is sequence growth. So people may think we sequence the human genome then it all went away. It's not true at all. The assembled sequence is following pretty much a 13-month doubling time. It's been doing that for a long time. It's kept on doing it. And in the meantime, we've constructed a new archive for the raw, unassembled sequence, because there's a lot of things that come out of sequencing machines which never get assembled into full genomes. But it's still useful data to search. That's doubling every 11 months. And the database for that is currently 35 terabytes. It's one of the bigger Oracle databases out there. So this complete sustained behavior that we're seeing-- there are very few things which really work in an exponential way. Here we've got, obviously, Moore's law, information processing computers. That's been following exponential growth for a long time. This is another technology which is doing the same thing. And basically, the point is that it's just information. And at the moment, there's really unbounded amounts of information. In terms of sequencing, we've just scratched the surface of what could be sequenced out there in the sort of natural world. And up to now, we've mainly been collecting a representative of the sequence of an individual. But the next thing is going to be collecting that across many individuals. And there's a current revolution in technology going on. So new technology is available now. We have development machines in house. There are a number of companies using new techniques for sequencing. This means that a single machine can produce between 100 and 300 more per machine, and the costs have already gone down by a factor of 10, as a result of this new technology, and are promised to come down much more. So you can do actually a first pass across an individual. It won't get you all the pieces, because you're collecting it randomly, but maybe for around $30,000. And there's a target to get that down to $1,000 a genome and get a higher quality than this. So we're still maybe a couple of order magnitude away, but you can see that it's within sight now, considering how much it cost to do the first genome. So we're going to be collecting a lot more of this stuff. And we're going to have to organize that. And future human health research and development is going to be increasingly dependent upon this data. And you can kind of see this is-- sort of overview, that I've stolen from somebody-- you see the inputs here. It's this annotation. You take the reference. You layer the variation on. You try and interpret it, work out where the genes are and look at the variations in those and relate those to medicine. And you end up with this complete sort of understanding, or at least set of things that you understand. And then you take an individual sequence, and that's going to be achievable quite soon. And then you start relating those to the database with the individual set. And you won't be able to interpret all of it. In fact, there's very little we really understand properly at the moment. But the amount that we understand will increase gradually over time as this collective database increases. You only have to sequence the person once, and then that will allow you to start understanding more about the medical consequences. And it's not all predicting probability about whether you're going to die of x or y. A lot of it is quite practical things, that this person shouldn't be taking drug x because it's going to kill you. The fourth biggest killer in the US is adverse drug reactions, allegedly. Of course, that's a sum of all drug reactions. Of course, it's skewed towards-- you know, you're treated with a lot of drugs when you're about to die. But it's still a very significant number. And so it's very relevant to be working on this. So we have our database. We're kind of doing the preliminaries of this. We're starting to pulling in resequencing data that's being shown on our website so it's available. We have to work on how to do this in a compressed way. But we're already up in this massive 380 gigabytes, our complete database, and that's changing every two months. And we provide data mining interfaces to that-- I haven't really talked about. And they can be aggregated and integrated with resources over here. But that's not really the solution, because we only put things in the data mining interfaces where we know people want to ask particular questions. That's how we denormalize the database in that way. If we really want to provide complete flexibility then you can get at the data using the APIs. But you have this download problem. So we provide access now remotely through-- we host the databases. People can download the code and talk to us. And that's becoming an increasing way of people integrating data with their own data. But what about beyond that? Because that's still not completely neutral in terms of the sort of democracy of integration. And this is the kind of way I look at it. So complete genomes provide this framework that we can organize stuff around. But we're in a strong position, because we've got this big database and resources. We have a lot of power, then, and in fact, that's not ideal. We'd actually want to have no monopoly on this, because you'd like anybody to contribute, no matter how small they are. The more organizations provide data, though, the harder it becomes for anybody to use the results. So how can you address this problem? And this kind of fits into other sort of openness issues, which I'll just mention. So just the general issue of data sharing. All of the human genome, because it's been open, has been quite a driver for this, the idea that you could release data immediately, and make people available. It was also this kind of idea which came up, of course, at the Google camp here that maybe you actually require open models. Vista was just announced a couple of days ago. People are saying, well, maybe the difficulties in doing that suggests that centralized projects are just not very scalable. Science has always been highly cooperative, but it's not been so data-rich as it is now, not in biology. We have to find better ways of handling this. And then I'll just put up this. This was in Berkeley. I saw a few years ago. All these arguments about ownership, patents, things like that. I've given other talks about this. But basically, if it's open, it's better. It's going to be easier to share things. So there are various problems to solve. In the community of scientists, we're having to get used to the idea of sharing data more and allocating credit appropriately. But there's also these practical issues of data sharing. So we have this camp here, and it was kind of similar issues. How do we increase integration and processing bandwidth? Now bioinformatics-- so we have these databases. They're kind of linked together. But there's a lot of databases. In fact, there's a danger that we have too many databases and too many diverse interfaces. It actually reduces impact because scientists just get lost in different ones. So can we find a way of splitting data and presentation so we get competition between those two mechanisms and find optimal tools for visualization? So I'm going to introduce this DAS idea, because we're now using this very heavily, and it seems to fit within this paradigm. So what is DAS? So this is the idea. So we're a big data provider, and there's lots of people viewing our data, our service. And that's kind of a monopolistic view. Because somebody else has got some data-- it's hard-- they've got to persuade us to put that data in our system. And we don't always know the quality of what somebody else's annotation is. Should we include it or not? So external contributors-- they might set up their own database. But then of course that's a big overhead, to set up something competing with us. And probably it means not many people go and look at their data, even though it might turn out to be valuable. So the DAS idea is you just serve the little bit of extra stuff that you've got. You use some infrastructure to make sure the coordinate systems are synchronized, and then you make the viewers cleverer so they can integrate things on the fly. And once you've done that once, of course, you can have as many servers as you like. And users can control this. They can turn things off. So this is the opposite to the links model. Linking takes you into different websites. But they've got different interfaces. It becomes harder to compare the scientific data. The DAS approach is the opposite-- standardized servers, and viewers which just can integrate and users can choose which ones to turn on. So it's kind of like a very simplified web services. Web services are fine, but if they all have their own protocols, then it's a programming nightmare to use those. Here we've got a very small set of standardized ways to pull these things together. And the kind of ways we're using this-- here's an example to link all the way from sequence to protein structure. So there's this viewer for structure-- it's got a protein structure there. It can pull all the stuff out from existing DAS sources and integrate on the fly and allow you to see things like variations in genomes mapped onto protein structure, all pretty transparently. So I think what's possible using this approach is you've got a set of data from different providers. At the moment, we have a set of services. The big ones are big integrators like us. But you have these problems like this one here, a small group that's serving its own data. It's not going to get much usage, because it's only got the resources to serve this. DAS environment, maybe it doesn't need to provide anything at all because its data can just be integrated. And then down here, here's somebody who was just using their own data. Now they can pull in other people's data, and maybe they can be a full competitive client by themselves. So we have quite a lot of that infrastructure in place, which we're using. We have around 200 different servers in this registry, quite a lot of different coordinate systems. And you can start thinking about servers that build on servers. So you have consensus servers, that process other servers, provide simplified views where there's duplication. So that's where I think the annotation world is going right now, at least one sub-strand of it. How can you pull this stuff in together so that then scientists really can compare different results and try and work out which ones to believe? Because at the moment, if they're sitting on different sites, it becomes very hard to do that. Now I want to say just something about prediction, because this is where I've just come from, this meeting. So that's my background, as a predictor, trying to compute biology directly. And you can look at it in terms of these extremes, pragmatic versus pure. So in protein structure prediction, this is trying to work out how the protein structure folds up. At one end, you've got comparative modeling-- pragmatic, basically saying, we don't know the structure, but we know that this sequence is related to a sequence where we do know the structure. And they're close enough that we can maybe infer the thing. It's not real prediction. It's kind of cheating in a way, but it's quite a practical solution. The other end, you'd like to just do some pure physics, just take the thing. We know it folds up by itself. Make it simulate. But simulation doesn't really work. A few years ago there was this thing called fold recognition, which was really comparative modeling at a long distance. And in fact, what's actually proving practical is fragment-based assembly, which is not quite pure physics, but it's much closer to real, proper stimulation. And that's being popularized by the Baker group, who are again successful at CASP this year. So in my genome annotation thing, it's kind of pretty similar. At one end, we have ab initio gene prediction. It kind of works but is problematic in vertebrate genome annotation. Then we have the evidence-based approaches which I mentioned before, which can be automated with lower accuracy. We hoped that comparing genomes would make things easier, but in fact it doesn't, because there's a lot of genomes which is actually similar, outside the gene regions, because they're involved in regulatory regions. But now we're beginning to learn that one of the reasons why we're so bad at this is because there's lots of other factors influencing the structures of genes, lots of other finding factors which turn things which look like fragments of genes off, make them inactive. And so properly to do this, we have to predict motifs. We have to predict these little signals and build genes. That's the way to do it properly. But maybe that involves a lot of computes. So if you look at what happened at CASP this year, it's kind of interesting, the different approaches. You've got two groups that came out very well, one of which is not using very much CPU and is using evolutionary information a lot, different models, different examples, and merging them, but limited CPU. And then you've got the Baker approaches, which is directly fragment based. And It's quite costly in terms of CPU. It's very costly if you want to refine things. But the refinement accuracy can be very good, getting quite close to crystallography for at least a small set. So it's definitely making progress, but the cost is enormous. So 1.8 angstrom structure, for one example, they quoted a million CPU days. I'm not quite sure how precise that is, but it's a distributed model on 100,000 nodes. So where are we going to go with the genome? Because we're collecting this data. We're integrating all these different bits of evidence. But ultimately we're going to understand it at the level where we can interpret individual mutations. We're probably going to have to do similar sorts of very high CPU intensive things. So that's where I think ultimately we'll have to go. For the moment, the value comes from integrating data. So I'll just acknowledge a load of people. These are all the different groups-- the people involved in such an Ensembl, and the different groups, the annotators, and people who support that operation at Sanger. So I'll stop there. Thanks very much. OK. Yeah? AUDIENCE: You mentioned the continuum of proteins between the comparative modeling approach and molecular assimilation. I think you mentioned this too. Of the kind of [UNINTELLIGIBLE] strategies where you can match up-- it's hard to find another protein where the entire sequence is similar, but you can find other proteins where you know the structure. And little pieces of each of them are similar. And so this is more like an area where you can match up those pieces, and then suppose each of the [? set ?] components [UNINTELLIGIBLE] and then they do an optimization over fewer-- TIM HUBBARD: That's kind of the fragment approach. Although the fragments could be quite small. AUDIENCE: Could you repeat for other offices? TIM HUBBARD: OK. So this was a question about assembling protein structures. So the question was whether there's an intermediate approach of taking matching fragments of structure and putting them together. And that kind of is the fragment approach that I mentioned, although the fragment approach goes to very small fragments-- three residue pieces and nine residue pieces. Different groups use slightly different approaches, and some of them-- there are hybrid strategies, yeah. AUDIENCE: So the DAS model is based on an agreed-upon XML description? TIM HUBBARD: Yes. AUDIENCE: So how did the larger community arrive at that XML description? And has that been agreed upon? Can you talk about some of the-- TIM HUBBARD: So this was proposed by Lincoln Stein. Yeah, sorry. So there's a question about the DAS protocol, how it was agreed upon. So it was basically proposed by Lincoln Stein, who had the first grant to set up the first of the client server libraries for this and specify the protocol. And it's been basically evolved, this protocol. We've extended it in various ways to handle other data types, but it's still quite compact. There is an effort to make a DAS 2 specification, which will be more flexible, will handle more metadata properly and handle things like searching. Because one of the things is distributors searches across these sources. So most of the models-- things like Google Map, I think, is based on the idea that you deposit your annotation in a central place. Whereas this model is-- of course, you could set it up like that, but you have servers scattered around the world. And each center hosts that server, and it's up to them to keep it up. We, with our central registry, we do monitor whether servers are still up and warn them if they've gone down and things like that. But it's much more fine grain distributed model. So in terms of the evolution, basically it was imposed. And it's gradually being adopted. AUDIENCE: These competing providers that you mentioned earlier do not use this protocol? They do not have this approach? TIM HUBBARD: So NCBI certainly doesn't do this right now. UC Santa Cruz does have some DAS capabilities now, although they have another protocol for uploading data directly, which, in fact, we have as well. You can upload data and we'll run the DAS server for you. So that you can-- it's integrated, but it also appears externally as DAS. AUDIENCE: I have another [UNINTELLIGIBLE PHRASE] question. You mentioned earlier that you've refactored your entire data representation at some time at some intervals. TIM HUBBARD: Multiple times. AUDIENCE: Multiple times. So how do you do this without interrupting your user's work and has the API for the user's or client's perspective remained the same throughout? TIM HUBBARD: So this is a question about Ensembl and its development strategy. So the refactoring-- yes, the refactoring goes on continuously. But there have been major points. So we have a number which gets associated-- a release number. The release number, in fact, always used to be the schema version number. And the problem was that it changed so many times that it's kind of become the version number. So that's where we are. We're at version 41 right now. So there was a big change between version eight and nine, and between 19 and 20, where we really restructured some things. A lot of the other stuff has been relatively minor addition of columns and things like that. How do we handle it? Well, you get into this culture of you have 30 genomes, and we don't update the data in these genomes every month. But quite a lot of them change. And some of the common databases, the comparative genome databases, have to be recalculated every time. So in every cycle, either the data gets updated or the schema gets converted. So in the distribution, there are SQL files for schema conversion for every point. So if you go through our CVS tree, you'll see a whole load of patch files. And so people who've downloaded stuff could also patch their systems, because other people externally also using the Ensembl framework to store genomic data. And of course they want to keep migrating and keep up to date as well. I think it's a cultural thing. In terms of supporting these sort of things-- we have 30 engineers working on this. There's a kind of feeling that a lot of infrastructure isn't well enough funded. And it's only got enough funding to just keep going. We have enough where we can keep going, but we can also be reengineering things on the side. And that's what you need to keep things moving. You have to have enough resources for those two things. And we do get pressures. We used to do a one-month cycle and that just became unworkable, because everybody was running the cycle rather than doing any new development work, but two months seems more sustainable. AUDIENCE: Is there a standardization effort underway for these protocol? TIM HUBBARD: For which ones? AUDIENCE: [UNINTELLIGIBLE] protocol, or the DAS protocol? TIM HUBBARD: So the DAS protocol is kind of standard. But because it's XML, it can be extended. So we've added extensions and other people can propose extensions. There's a proposed extension for interactions between objects, biological objects, so there's lots of cases where people propose a list of proteins which interact with each other. And that's another thing where you'd like to integrate in this framework, be able to see all these different opinions integrated in one client. So it's an ideal thing to work this way, rather than relying on people publishing their own integration, which of course only takes the data that's available at that particular moment. AUDIENCE: You were tracking-- I'm sorry to keep asking these questions. TIM HUBBARD: That's all right. AUDIENCE: And just to review, you said you were tracking a lot of user data. [UNINTELLIGIBLE] access them. You didn't show any of the-- TIM HUBBARD: OK. So user data access. We have our web logs, basically. And we know that the usage goes up when we have user surveys. And we look at publication records, where we've been cited. But that's all, right now, and one question for us is exactly is how much more should we do. And in a wider infrastructure question-- I'm on various European infrastructure committees-- over here, things are quite well funded, actually. The NCBI's, structure of funding of NCBI, and other resources is quite stable. And we don't have this in Europe right now. It's not as stable. But government's always worried about funding things on an ongoing basis, perpetually, without having some measure of value. And so working out what the value is of these resources, and whether they're still valuable, whether they should be merged or closed down or things like this, it's a real question. And so working out not only how to integrate data but then work out who contributed what is, I think, quite an important question in terms of the sort of long-term funding. Yep. AUDIENCE: So if I had protein structures, can I use DAS to annotate genomes with the structures? TIM HUBBARD: So you could set up-- so if your structure-- I was talking with some people at CASP about this. So there's quite a few databases of models, for example, protein structure, around. And most of them can be related to UniProt sequences, which are standard protein sequences. And so there's all kinds of annotation out there against UniProt sequences. So if you make the connection between the two, yes, you can use DAS to display the annotation on any of these models that have been constructed using protein structure prediction. AUDIENCE: [INAUDIBLE] TIM HUBBARD: No, but they were talking quite enthusiastically about-- some of the people were talking about doing that. So yeah. I think within Europe, there's this thing called BioSapiens, which is an EU-funded project to link a load of different bioinformatic people mainly working on protein sequences. And they've adopted DAS as their mechanism for exchanging data. And so that's seemed to be quite successful and is actually providing an interoperability layer between all those groups. So a lot of the sources are coming for those groups. Thank you.