Jim Ostell Speaking at The Genbank 25th Anniversary

All right. Thank you very much. So when GenBank came to NCBI in 1992, NCBI had already existed since 1988. So this was actually a merger of two different entities into a single enterprise. And throughout this talk, I'm going to be talking a little bit about this sort of dual nature of GenBank and NCBI, which sometimes come together into a sublime unity of two things, and at other times can seem a little bizarre in it's dual nature. Another aspect of this duality is that even if you consider GenBank and NCBI as one thing, it's still really part of a partnership that includes EMBL and DDBJ and is governed by its own rules and its own history as you can see here. And once again, this can result in some bizarre situations as well as overall a very sublime unity. So going back a little before GenBank came to NCBI - this was 1988 - these fine individuals here - this is myself on the right, David Lipman in the back and Dennis Benson in the foreground. NCBI was a very unusual place. Before I personally came to NCBI, say, in 1982 when GenBank was created, I was a 32-year-old graduate student at Harvard University in the biolabs who was busily setting about destroying any chance of getting an academic job in biology by getting into computers. Since I was unable to get an academic job - I had to support my company and sell a product - in the course of that, of course, I was unable to ever get an NIH grant to do what I was doing, so I sold the commercial product to people and they bought it off their NIH grants. I was sitting in a study section next to David Lipman one day, and he sort of leaned over and whispered in my ear and said would you ever think about working for the government. And I thought, well, no, I never really thought about doing that before but, you know, okay. And he said we should talk. And we went out to walk around NIH, and he talked about this place NCBI that was going to be created where actually you were supposed to work with computers in biology, and it was okay. So that was pretty amazing, and I was delighted to come here and, in fact, it was totally sublime to be here, and we were unifying computers and biology. We actually since we were setting out after some discussion with the understanding that we weren't actually building any databases. We were just supposed to be using them. And so like many things at NCBI, David Lipman made a simple suggestion. And the simple suggestion was really based on this little model here, which is this protein sequence is the human colon cancer sequence. That is damage to this protein seems to predispose human beings to get colon cancer, and yet when you search computationally this sequence, you hit these other two sequences as being very similar, from yeast and E. coli, and the thing that that tells you is that everything that you know about this sequence doesn't apply. These things are not human. They don't have colons, and they don't get cancer. So in a sense all the information about it doesn't help you. What you do discover is what you know about the other two sequences is they're DNA repair enzymes, and so immediately what you know about them sort of comes together in a surprising way through computational biology through comparing the sequences. And, in fact, that's the power of molecular biology and the power of sequencing. So David's simple suggestion was, well, can't we develop something that would allow us to recreate this path. We find something about the sequence. We go through computational relationships, and we connect it so something else we know about the sequences we connect it to. All right, definitely a sublime idea. So in response, I tried to come up with a simple solution which was, well, we could take - say what we knew about things is basically Medline. It's the published literature. There's literature citations in the sequence databases, so we could get all the nucleotide sequence databases together, hook them up to Medline, and of course nucleotide is code for proteins, everybody knows that from Biology 101, so we should be able to get over to the proteins, and then we can store sequence alignments and find other protein sequences which are like this one, and we can recreate the whole process. We can look up human colon cancer. We can find the sequences about it. We can find their proteins. We can get related proteins, and then we can go back and read about DNA repair enzymes in yeast and E. coli. A simple idea and sublime. So we went to the GenBank flat file. This is the GenBank flat file from 1989. And of course, I had spent several years working with the GenBank flat file and had learned to hate it deeply. And the reason is, for example, that if you look at this there is no protein even though this is the source, the genetic source for proteins. It has a coding region, which is an annotation about a protein, but no protein sequence. Okay, so that's a little bizarre. But if we go to the protein database, here is the protein sequence, and here's the same citation but there's no DNA sequence, and there's no link back to the DNA sequence, so once again we're lost - also bizarre. So what do we do? Well, we're in the situation really which is a bit like the allegory of Plato's cave. For those of you not classically educated, it's an allegory which is actually fairly complicated. I'll give the simple version here which is that people are sort of observers sitting here, and in kind of an artificial world of their own creation. There's a fire in the back here, and they're sort of cutouts that represent real things here, but all the people can see are these shadows on the wall of the cave, and so everything that they're really sort of seeing is this faint shadow of reality. And, of course, we keep that up until today and cling on to that, because it makes us feel more comfortable if we live in a simple world. What was going on in the sequence databases was there was a natural model which is there were organisms and they had genomes and with chromosomes and genes. These are transcribed and transcripts are spliced. They're translated to pre-proteins. These are cleaved up into final proteins and fold up into structures and these have functions and malfunctions cause disease. But the problem was that in GenBank, we had - the GenBank model was sort of like this except that with tunnel vision. So we were going to stare very, very hard at a tiny little section of the universe here. And, in fact, what happens is we just look at one nucleic acid sequence all by itself, and everything else is kind of shadows on the wall that we see, and we're missing out on this greater reality here. So we were thinking, back in 1990 that we could do a model that actually represented all of this, that in fact we could represent organisms and chromosomes and genomes, all the way up to here, and mature peptides, and then with the help of Steve Bryant, we actually got up into structures. We didn't really know how to represent function in disease, and so that's text, that's basically the published literature where you can express concepts in a fuzzy way, and we just need to hook them back up through this hierarchy, and if we represented this explicitly we could move around in it, and we could do new and interesting things. So in this model, we modeled lots of different kinds of DNA sequences. The usual kinds with A, G, C's and T's, but also for example things that had a bit of sequence and a gap, another bit of sequence representing a higher structure. They are now known as scaffolds. Things which represent overlapping pieces of sequence put together which are now known as assemblies or CONtigs, maps. Things which are annotations on sequences could actually be more interesting than that. They could specifically refer to a location like the coding region of a gene, and they could refer to a destination like an actual protein that was made. We're actually hooking up molecules here. We're not just describing them. And we could have specific structures representing how you do that, how you translate this thing from DNA to protein. To record our alignments, we realized that while most alignments sort of look like this, you don't want to reproduce the sequence. You want to actually point to the sequences and just describe in numbers where all the pieces went together, because the same sequence could participate in lots of different alignments, and you didn't want to lose it by embedding it in this. Finally, you could take all these bits and pieces - if you encapsulated them correctly, you could get transcription initiation sites from one place. You could get genes from another - coding regions from somewhere else, assemblies form somewhere else, and put the whole thing together and make a composite which represented a large organism. And these are the sort of things that you now see as tracks on genome browsers and things like that. Now we had a question is how do you say all of this, because that's an awful lot to say. That's going through lots of dimensions. Should we use the GenBank flat file? Well, we looked at this. This is really in a way for humans to read. It wasn't really built for computers to read, and it was really looking at a small part of the universe so, nah, we're not going to use that - bad - instead we're going to represent it in a structured language - in this case called ASN1, which in 1990 was and iso-standard, still was an iso-standard. It was specifically designed for communicating over the internet data structures and to finding them very explicitly - in fact, very complicated ways which let you have the full representational complexity but present it in lots of different ways to human beings and computational tools. So in my mind this was sublime. This was the way to do it. So what we would do is we would take all the existing databases. We'd put then all into this form. It would be parsable. We'd make it available, and then other software developers could develop all sorts of tools to use it and browsers, and everything would be unified at the base, and our role as a government agency would be to just put this together from the existing databases and make it available. So we had a conference of software developers and database providers where we presented all of this, and it was not very well received. In fact, the guy who ran the biggest commercial sequence software lease at the time came up to me afterwards and says wait a minute, I think you're expecting us to get serious here, and we don't want to do that. So we thought all right. So that isn't going to happen. People aren't going to do it, but we really think this is kind of a good idea, so I guess we'll go ahead, and we'll build it ourselves as a demonstration product to prove that we're just not making this up, you can really do this. And Entrez was born, which I'll get to. But to stay this for a minute, I still kept flogging this idea of this data model, and maybe people could finally use it, so in 2000 XML became popular. We said, great, we can represent exactly the same thing in XML. It's six times bigger because it's wordier, but it's the same idea. Maybe everyone would use this. No, they didn't. It's still too complicated. It's still too hard to use. Okay, well, as we became part of the International Sequence Collaboration, we agreed internationally on a representation in XML of the contents of the GenBank flat file. Now, of course, the problem with that is it doesn't have any richer representation than what you see in the flat file anyway, and the other thing we discovered is still almost nobody uses it. A few places use it, but very few. People really stick to the flat file. So we went ahead and made our simple solution here. We went back to trying to build this, and we thought, okay, so we know we have trouble hooking this up. We're going to have to work on that, and develop protein sequences and stuff, but an easy thing to do would be to hook this up to Medline, because there's literature citations in here, so we'll take our sublime idea, and we go back to the flat file and we're going to love the flat file now. Okay? It's okay, we can work with it. Now we're going to just start taking this. We love the flat file. It's really going to work out - no, no, no, we love it. Go away. And we're going to look at the reference here, and we're going to match it to the other references in the database, and then match them to Medline, so going to PIR, which is also a flat file, which we're happy with, and we can use, and it's okay. No, it's really okay. No, we love it. Actually, it turns out it's not okay, because if you compared the two references they actually don't match. They come in a different order. They use different abbreviations. It's not easy to even hook up this little bit, so we undertook a much bigger project than we ever imagined, which is matching up all these citations and correcting them to Medline, and then trying to get them back into the sequence databases again, so we can hook up all these pieces. Yes, it was bizarre, but we managed to build the first version of Entrez, and this was a huge success. People really loved this. It first came out on a few CD-ROMs, and there were a number of new pieces to it, which were sort of the forerunners of other things to come. For one thing, this became PubMed. This, looking for articles like other articles became related articles that you still see today in PubMed, and from our viewpoint, this over here was actually GenBank. GenBank was a projection through this connected space that contained Medline, nucleotide sequences, and connected protein sequences that we had to construct out of the nucleic acid database - and this was the sort of situation we were in in 1991 when GenBank moved to NCBI in 1992. And so we thought, okay, it's getting better. We'll like the flat file. We can - no, no go away; we like the flat file because the coding region while still not really a protein at least had a translation now. So there was kind of a protein in there but you still couldn't talk about it, because this is a nucleic acid database. It doesn't have proteins, and so you had to say it was sort of like, you know, the artist formally known as Prince. You had to say it's the protein that's the first coding region in this record or it's the third coding - and of course if it's updated or you re-annotate it, it's the second one, it's the fourth one, so it's very hard to talk explicitly about this, but it was progress. So as we came to the GenBank and the international collaboration, we brought our own what we felt were sublime ideas. The realization that GenBank is actually the biggest protein database in the world if you just would see the proteins instead of hiding them. Furthermore, it was clear that they'd made a lot of progress, but you still couldn't talk about sequences explicitly. That's because that accession number talks about the record. It doesn't talk about the sequence. And so you could make what's called a minor correction to the sequence where I might add ten base pairs to the beginning, and the accession number would stay the same, but if I had computed and stored anything about that sequence like an alignment, like where a promoter was or something, it's now off by ten base pairs and I don't know. So having put together the Entrez system and stored alignments and things, we knew it was critical that you'd be able to record the sequence itself. So we need a stable base pair or amino acid coordinate system, both of them needed to be linked. Entrez had both, and we thought we should support this in GenBank. So we proposed it. An initial response of course of our collaborators was this is a nucleic acid database. We're talking about feature identifiers. We're not talking about proteins, so we came up with our own solution, which is we started putting these GI numbers, which are integer identifiers into the comments. So this is the identifier for this DNA sequence, and here in this, in the comment, we put in the identifier for this sequence, so we could talk about them, and they came out of Entrez. We thought that would be helpful. It as helpful to people using Entrez, but it wasn't that helpful to the collaboration. So it was kind of bizarre. After further discussion though and the involvement of Swiss-Prot, Amos Bairoch making the point that a lot of his data came from the nucleic acid databases. He would like it if there were protein identifiers. Gradually there was an agreement across the collaboration, and we started moving on to NIDs or nucleic acid IDs, which are really these numbers with a G in front of it if it came from GenBank, and then a D in front of it if it came from DDBJ and E in front of it if it came from EMBL, and they went on proteins as well. So protein identifiers began to appear, but these were very confusing and not very informative to people, because they really didn't track with the entry very well. Okay, so that was a little bizarre, too. So we moved along and EBI came up with the suggestion, which is well since we have an accession we could put a version on it instead of this separate little number down here, and we'll just increment the version each time. Well, we didn't listen to them for a while and argued about it, but because we'd already done all this with these numbers down here, but in fact it was a better idea. So eventually, we relented, and we switched it now to support a version here, and in fact we supported an accession and a version on the protein, except - a little bizarre - it's not called accession on the protein and that's because it's not a protein database, and we're not accessioning proteins, so instead - it looks like an accession and it behaves like an accession, but it's a protein ID. So, okay, it's still a little bizarre. We've still got all this stuff in here, all these extra numbers, so the last round what you see today when you look in a GenBank flat file, is we still stuck the GI up here in GenBank. You don't see that in EMBL and DDBJ, but the accession and version are there, same thing for the protein ID and we still keep the GI in the GenBank version for people who use our resources, but it's just as a separate cross-reference from another database. And in fact, that works really well. That's been a huge improvement and it joined sort of some of the notions that we came with and really good feedback from the community and from our collaborators. Other things have been added here over time to the flat file. For example, references now actually do get corrected back to Medline or to PubMed, and in fact the PubMed idea appears in the record and is being exchanged among the collaborators, so that we actually have a way to correct these and keep them up to date, and we don't pretend that we do it, we let the librarians do it. Very cool. Another thing which has been added is for organism information. The organism name, the taxonomy is now linked in fact to a key. There's an identifier in the taxonomy database. This is another big success and we're using this common taxonomy throughout the three-way collaboration, and it's also used by every sequence database in the world, including PDB and Swiss-Prot and Human Prot now as well, so we can at least talk about the same organism. Also very cool. The number of taxa have been growing up tremendously. Since we first began the taxonomy project at NCBI to the present, there's been a huge increase in the number of organisms and that's because people are sequencing out further and further and further into the living world, and it means that not only do the taxonomists have to keep up with all those names and come up with meaningful names, they have to place them in a taxonomic hierarchy, which at this point I think what we're saying is as far as we know it doesn't fly in the face of what we know of evolution. We're not claiming it represents evolution. This has also been very successful. It's done by these guys. This is the taxonomy group, and of course they're very cool. Okay, to shift gears a little bit, to sort of advance over the last couple of years. The first microbial genome appeared in 1995. Haemophilus influenza, and to quote Dr. Evil it contained more than 1,000,000 base pairs, which seems like not too much today, but it was a big deal at the time. Contained 1,700 genes. That was definitely very cool. However, the problem was that most people were using desktop software that they bought from various vendors, including my former company, and these things couldn't handle a million base pairs, so right away, as soon as we got to 1.8 million base pairs, we agreed that no GenBank record would be longer than 350 kb. So instantly that we had assembled an entire genome, we took it apart, and we distributed it in 350 kb chunks, and of course since the GenBank collaboration had no way to represent a scaffold or a CONtig in a GenBank record there was no way to put it back together again, so each of our different sites did it different ways. At NCBI, we created Entrez Genomes, a new genome division where we were piecing all the pieces of complete genomes back together so you could see them as a whole even though we were busily taking them apart on the way in through GenBank. So out of sort of a lot of this churning came another simple suggestion from David Lipman, which one should always be wary of, which is there's lots of different copies of things and bits of things and stuff in GenBank, let's make a list of the best, most complete sequences in GenBank, so you can use those as kind of your best exemplars and your best references, and we're going to call that GenBank Select, selected bits of GenBank. Okay, so that seemed reasonable. So we thought, well, what we'll do is we'll start with complete genomes, because that's easier to tell if it's complete or not if it's the whole genome. And there's been lots of viral genome sequenced, so those should be the ones that we'll go for first. *** is a major research target for NIH. This was actually at the height of the interest in ***. It's in the news, so that's an easy one, should definitely get it. No. There was no complete *** genome in GenBank. It had been resequenced and completed many, many times in many different labs, and nobody wanted to submit the complete one, because it wasn't news anymore. There was no reason to publish it. So we thought, all right, so we understand why, but we should at least have a complete *** genome in GenBank, so we'll collaborate with experts that have done this re-sequencing, and we'll get them to deposit a new and complete record with good annotation, and we'll work with them to get a really good record in. So we worked with the people writing the retroviruses book out of Cold Spring Harbor to put in about a dozen retroviruses including ***, in conjunction with publishing this book. And in fact in the supplement of this book are references to these sequences that went in, and finally an *** genome appeared in the GenBank database submitted by Colombe Chappey from NCBI in collaboration with the people who did the sequencing here. Okay, that's good. So now we've got a bunch of retroviruses in. We're on our way to GenBank Select. A little ways after that, the yeast genome was completed. It was actually completed over many years, chromosome at a time was published, each one a big event, but having the whole complete yeast genome that was definitely cool. That was a good thing. That was a big step. However, each of the chromosomes was separately owned in the database. That is in the GenBank database the primary submitter is the owner of the data, and they choose to update it or not update it as they see fit. About half the yeast genome was transferred to the yeast database, SGD, from the centers in the United States that had done the sequencing, but the other half was not. So we found ourselves in a strange situation again, where we worked out with SGD that they would be updating regularly from the SGD databases to half of the yeast genome they owned, and those were kept nicely up to date and annotated, but the other half wasn't. So we found ourselves in the situation where in GenBank what we had hoped would be GenBank Select actually for yeast looked like that, where half of it was in GenBank and half of it you had to go somewhere else to get, and actually it wasn't just yeast. Yeast was sort of the tipping point, but this was happening with nematode and with various other things as well, so we realized that we have to do a different model here. And instead we created the RefSeq database, which is derived largely from GenBank but allows input from other sources as well and is not GenBank. That is it is not the primary archive. It's sort of a an invited review article by NCBI where we invite reviews where they're available and where they're not, we do it ourselves, and we point back to our original source. So that was cool. We finally started to have complete genomes in a single place. However, it was confusing because when people sort of lost track of all these connections, the fact that this heavily overlapped that led to confusion. And people really are still confused to this day about the difference difference between things being in GenBank and things being in RefSeq, especially when people are doing updates, and we're trying to encourage them if they're the primary sequencer to submit their update to GenBank, not to RefSeq, because it will go to RefSeq, but if you submit to RefSeq first, it doesn't go back to GenBank, because this is the primary archive. So we still have this balance, and I think we're still having a bit of difficulty convincing people of the sublime duality and unity of this. All right, so I'm going to now switch to a couple little historical bits here. This is a graph of the growth from the base pairs of GenBank in 1992. And so when NCBI took responsibility for this everyone was looking at the shape of this curve and saying, danger. This is obviously a problem, because everything is just shooting off the end here. This is the same curve in 2008, and there's that danger point. In fact, you can't even really see it here. It's kind of down there. And now we're looking at this saying, danger. So back here we didn't actually have a good answer for how we were going to deal with this. We were just cocky and young and naive. Up here, frankly, I am worried about this, but I'm an old guy now and maybe we should follow Rich Roberts' suggestion here and stop funding the old guys like Rich, and get some young people in here that are still sort of cocky and naive enough to not be concerned about this and just plow ahead, and that's certainly what we're trying to do in NCBI is bring in young people to keep us going. But there's another aspect to this, which is it's not just the base pairs. If you look at the growth in the number of entries, that's also going up really fast. So it isn't like we're just getting huge, long sequences here. We're also getting lots and lots of the smaller type of sequences, because sequencing is in every lab there is. It's a way of answering all kinds of questions. It's being applied in all sorts of ways, so it's not like dealing with these very big sequences meant we didn't have to deal with all the other kinds we were already dealing with. In fact, we have to deal with them both. This obviously is a concern because when you start talking about these smaller records, these aren't big machine-generated type of things. They still have to be handled by individuals. They're still coming in from individual scientists that you have to talk to, and you have to deal with. In 1998, these are the people that dealt with all of that. These are the GenBank indexers, a smiling crew. There's your host from earlier this morning as well as a number of people that are still here. Obviously a very cool group. If we look at the GenBank indexers in 2008, also clearly a cool group - but the thing that's striking about this is this is about the same number of people. We don't have a huge number of additional people compared to the growth of the number of entries or base pairs here. In fact, if you look at this over time here's the growth of sort of the individual hand-held class of sequences and here's the growth in the number of GenBank indexers up to about here when David Lipman said we can't keep hiring all these more indexers. We've got to do something else. And in fact we found we could do something else, and this was a combination of a lot of hard work and a lot of thinking about what is the process we can do. What is realistic for us to do in terms of curation to sort of make sure that good quality data keeps flowing into the database versus what's too much and what's too sort of theoretical for us to spend time doing? And in fact we were able to bring the number back down but maintain the quality of the database, and this has now held remarkably steady through the years and through this growth, and in fact through this later period a number of the people listed here have been routed off onto other tasks handling some of these big input flows from other sources like bacterial genomes or large-scale sequencing projects. Part of what made that possible is the GenBank software and support group seen here, and of course if you see a bunch of geeks like this, you know they must be cool, and one of the things that this group dealt with in addition to all the other we've seen is this phenomenon, which I've alluded to a little bit, but over the course from 1992 - actually, this only goes up to 2007, but it still applies - the content of GenBank changes, so what this is is percent of the database by the division that it falls into where these are the sort of the organismal divisions, bacteria, invertebrates, and then up here this is ESTs, and you can see we didn't have any in the beginning. We got a lot. It's still a significant portion of the database, but a declining portion. You can see here, for example, this is the high throughput genome sequences, so this is when the Human Genome Project went to using draft sequence. This was a whole new class of human genome sequence coming in in pieces, sort of developing within the database. The collaboration had to introduce the concept of CONtigs and scaffolds and things to handle this. This was handled just by NCBI. This was handled by GenBank, EBML, and DDBJ. We had to find ways to exchange all this data, to accession it, to understand what it was, to set up rules for how to handle it. Here you can see the high throughput draft sequence going down, and whole genome shotgun coming up. That's this green area. Whole genome shotgun now makes up half the base pairs in the database. Environmental samples, you can't really see, but they're starting to come up here. We've already sequenced the Sargasso Sea. There's lots more coming and, of course, huge population studies on their way, thousand human genomes, high throughput sequence. And really in a sense it's not just the massive data, it's this changing landscape that we have to deal with both within GenBank and across the collaboration. So over the years from '92 to 2008 what seemed to be GenBank has broken out into Entrez, other related resources like dbEST, Taxonomy system, sys mission tools, and new databases for genome survey sequences, Web-based submission, high throughput genome sequences, the Trace archive which is now coupled to GenBank through shared IDs, proteins being mapped into this system, the RefSeq project, whole genome shotgun assemblies, high throughput cDNAs, CONtig division for putting scaffolds together, whole genome sequences, environmental samples, way of grouping whole genome projects together that contain multiple chromosomes, third party annotation databases - way for people to put their own annotations on the database or provide supplementary experimental information, assembly - how to represent hierarchies of assembly of pieces, and that has been an amazing process, coming together into a unified whole of these many, many pieces, more than two, and in the end or one more - not quite the end - we've dealt with this huge increase in base pairs. This changing landscape of different types of sequences, much more sophisticated tools for examining the data, much broader depth underlying basis of the data as well as the meaning across putting the pieces together, bringing us in this unity of pieces to 2008 where now at this point through all of this process when you ask someone when you think of GenBank what do you think. They see the flat file. Laugher Sublime or bizarre, you be the judge. Thank you. [Applause]