Graham Cameron Speaking at The Genbank 25th Anniversary

Thank you. Well, it's a pleasure to be here and talk about the 25 years of GenBank. I've just done - I'll talk about my 25 years at EMBL. The EBI, the European Bioinformatics Institute is what has grown out of the ENBL data library. It's part of EMBL, and in a sense it's the nearest thing to a European analogue to NCBI, so it's interesting to be here. I started back in '82 on the ENBL Data Library Project, and this all started from discussions in the scientific journals that I saw from 1980, where basically in Nature there was an editorial saying basically that there was a major problem with DNA sequence data, because we'd already got an overwhelming amount of data which was a 100 kilobases, and how was anyone going to deal with this flood of data, which was going to be difficult to deal with? And it was pointed out that sequencing as a way of doing research was a very risky business that a graduate student might stumble on an iconoclastic stretch of DNA or it might take a bevy of experts several years to reveal nothing more than a megabase of interesting sequence, so there was some concerns there. And this editorial pointed out that there didn't exist any central databank in which all sequences could be deposited and made available. So EMBL - the European Molecular Biology Laboratory - decided to act, and the two key players In the EMBL at that time were Ken Murray, now Sir Ken Murray, and Greg Hamm, and they gathered together a group of experts both from the US and Europe and the rest of world, indeed, to talk about the possibility of setting up a data library or a databank, and they decided that EMBL would do this. Actually, the minutes of the first meeting about it said "and it could require a full-time staff member to run this." And after a further meeting in 1981, they began to iron out the details, and Greg Hamm took on the leadership of the EMBL Project, and this is Greg Hamm here, and Christian Burks, who you'll hear from later today. Who was - Greg - this is them collaborating on a riverbank somewhere. When I joined the EMBL, we had an office, a telephone, a computer terminal, and 1,500 letters from people asking us to send them the data. We didn't have a database or really any data. EMBL had seen fit to announce this to the world without actually putting any resources in to do the job, so there was a little bit of pressure on us. And Greg got to work on designing the first EMBL data library format, and actually at that stage what we were talking about was formats, not really databases. And if you look at this format that Greg was scratching out, and this I got out of his archives, and compare it to what we see today in the EMBL database, you'll see that it's no profoundly different, which means that either we were very wise and got it right or else we poisoned things forever. I'm not sure which. And, of course, Europe being pretty snotty about this - we boasted about it, and there was a publication in Nature saying Europe leads on sequences, and we presented a nice little entry in the database here, and of course, we always do acknowledge where the thing had been published, and we'd started the way we managed to go on and we'd actually succeeded in screwing up even the entry that we presented in the first publication about the database, and there was a quick rebuttal in Nature saying that we'd actually cited the wrong paper and if we looked at the Georgetown database, we'd find the right paper for that entry that we presented in the journal article, so experience I suppose enables us to - we've made so many kinds of screw-ups that now we probably have more knowledge of what to avoid than most people who've ever done this job. Not everyone was so rude to us. In fact, in 1982 when we sent the first release of the database out, we got a nice thank you letter from somebody called Takashi Gojobori, who was at the University of Texas at Houston, and you'll hear from him later on in the program, so this was in 1982. Thank you, Takashi. Within a few months of establishing this project, of course, we've heard about the GenBank Project was established, and Ken and Greg both took the view immediately that it was best to cooperate on this, and Greg set off to the USA to talk to people about how we could work together on the production of this information resource, and this was the beginning of the EMBL/GenBank collaboration, which is now the EMBL/GenBank/DDBJ collaboration, of course. In the GenBank Call it was also worth noting that not only were we worried about the amount of data, the 100,000 kilobases or so - with 100,000 bases or so, but we actually were also worried about the number of users. We felt we might have as many as 300 users out there in the world, so it was something we needed to take care about. Right in the beginning when sent out the first release of the database, and after discussion, we had this sentence: "This manual and the database it accompanies may be copied and redistributed fully without advance permission, provided that this statement is reproduced with each copy." And that was an important statement, and actually if you go to the documentation of the latest releases of the EMBL database, you'll see exactly the same statement still appears. And I think that this collaboration was important in maintaining the right of way to these data. So I think that was one of the great achievements of the collaboration. Of course, in these early days we were much criticized. There were two databases actually. There was EMBL and there was GenBank. What was not desperately clear to the community - actually they both were seeded by collections that already existed. The EMBL had adopted a collection produced by Kurt Stuber, in the University of Cologne, and GenBank, of course, was seeded by the work of Walter Goad, and the two were different. We had extensive discussions and exchanges. We actually had our first what has now become routine annual collaborative meeting in 1985, and actually there was no disagreement between the two groups at all. We were actually quite hurt by statements that said like they should be locked up in a room until they agree. We were actually struggling and cooperating to try and find ways of presenting this information that avoided the weaknesses of both our ways of doing it. There was no disagreement whatsoever. So they didn't actually lock us in a room. They did something much worse. They sent us to a meeting in Tyson's Corner and locked us up in Tyson's Corner for a couple of weeks to actually try and resolve the differences between the two databases and in fact that was a very productive meeting despite the fact that we were in a hotel that didn't even have a bar. And the new feature table was designed, which actually combined - which drew the most difficult part of the two databases together, and it's actually still in use today after many rounds of revision. Also at that time we were faced by the white-hot technology that was coming along, so we were being told that we had to respond to this new thing called CD-ROM and produce the data on CD-ROM so that takes you back a little bit. But in fact, if you looked at the data submissions that we were getting, they weren't always that desperately high tech, so this is an example of a data submission from the Chambon in 1987 that we were working from to include the data in the database, so it was quite challenging and quite frustrating sometimes. In 1987, NIH and the EMBL arranged a joint meeting at the EMBL in Heidelberg, and I as a young researcher then had the responsibility for the Heidelberg organization of this meeting, and basically it stressed the importance of this information infrastructure. I think the most interesting thing from an organizer's perspective in the meeting was that I was very nervous about it, and I arranged all the details down to telling people how to get from the meeting venue back to the guesthouse where we were staying, and then the day before the meeting, the foresters felled about 20 trees across the path that I had told them to use, and then in the night it snowed, and despite that when the bar closed at the end of the first day's session, Fred Blattner managed to carry a case of beer all the way back from the bar to the guesthouse over the trees and through the snow, so - we had a number of documents that were prepared as input to that meeting, and Jim Fickett, who I don't think has been mentioned so far, was working at Los Alamos, and he prepared this paper about the rate of sequencing. It was quite interesting. He said correctly at that time we were getting about 10,000 nucleotides a day going into the databases, and he suggested that in 15 to 30 years, we could be seeing as much as 1,000,000 bases a day. Well, Jim, close but not really spot-on. Now 20 years later, we're actually seeing about 100,000,000 base pairs a day going into the databases, so it's two orders of magnitude more than we were predicting. Christian Burks, who will be speaking later, had started to notice the importance of the proliferation of information resources, provide a little database called LiMB, which was a list of databases, and there were 20 data banks included in that. Rich Robertson has already mentioned there are actually now about 1,000 or more databases in molecular biology, which we have to make sense of. The concrete thing that this meeting did was it established an international committee to guide the collaboration on DNA sequence databases and probably resulted in that well-known phrase that "He who pays the piper, establishes a committee to call the tune." At least that's the way it goes through funding agencies. Actually, this committee turned out to be extremely useful, and it's still in existence today. But as has already been pointed out, it didn't seem desperately sane to have people tying A's, C's, T's, and G's into the computers when, in fact, the scientists had their data on the computer. They put it on the pages of journals, and then we read the journals and put them back in the computer, and so we decided that it was time to do something more sane and approach the journals to try and get some movement on direct submission of machine-readable data. And, in fact, at this time Rich Roberts - who's actually instrumental in this with his work in NER that he's talked about - had written to Jim Cassatt, who's here today - Jim was the project officer for the first GenBank Project, and Rich had written to him saying, you know, we need to have a scheme where the author submits the data to the database back in 1987. And we were doing this work on the European side. In fact, Patricia Kahn here was tireless in her communications with the journal editors to get the data submitted directly. This daughter of mine has just started university, so this was a while ago, as you can see. And so Patricia wrote to dozens of journal editors with a scheme whereby accession numbers would be made available to authors who submitted the data. And annoyingly, the most resistant journal at this stage was Nature who regarded themselves as the record of science, and these database guys were a bunch of - as far as they could see - secretaries and computer geeks, and why should they be fiddling in the record of science. Though they did say there was some precedent in crystallography, which wasn't a very good precedent actually. So they said that their journals couldn't act as watchdog, and Nature's blind was that they shouldn't put obstacles between the author and publications, and we had a long backwards-and-forwards with them in 1989. They said that they were not willing to exact unenforceable promises to submit data, which actually has proven to be complete nonsense. So we wrote back robustly and kept going on about it, and they never actually climbed down, but eventually they came around, though I don't think they ever published a formal statement on the issue. Of course, one of the things you have to realize about these days was that when biologists said database, they didn't mean database; they meant some kind of collection of information in some vaguely unified way in a computer, and it was in the mid-'80s we started to realize that actually we needed to use proper database technology to get our act in gear, and the EMBL group started to work with Oracle. GenBank used Sybase, and we still use Oracle today, and I think GenBank uses Oracle as well, if I remember correctly. David Hazeldine, sadly dead now was the guy who did all the database work at the EMBL at that time. Also, in the early days we simple used to throw the data at people and tell them to get on with it. We gave them no help in searching the data whatsoever, and our own little tool that we started to produce at that time was a SARC system for our CD-ROM data, and at the same time CBI was producing Entrez and the tools and the software both became necessary to the task. We also realized that actually a lot of the useful annotation of data should come along after the fact. So basically one of the problems with DNA sequence data is that what one tends to capture is just the knowledge about the sequence when it's deposited, but not corrections or updates or new annotation, and Rainer Fuchs and I proposed a model for this which we never managed to really get implemented, which is effectively equivalent to the modern distributed annotation system, which is in use in the genome world. Of course, throughout all this I was being rather ambitious and saying that we needed more than the EMBL data library; we actually needed an institution whose mandate was to deal with broad bimolecular information, and I started writing proposals for something that became the EBI back in 1987. Actually, I wrote the first draft in Pasadena after visiting Lee Hood and realizing that if people were thinking-big this side of the Atlantic, we should be doing the same on the other side. So we started proposing the EBI, and eventually after battling with our council, we got an agreement in 1993 to start up a European Bioinformatics Institute with 69 staff, and we had costed them all. I have no idea why this had to be 69 not 70, but for some reason it had to be 69. And so now, we'd calculated a budget for these 69 staff. I'm glad to see the first 347 of these were already in post in 2007, and if you scale the budget appropriately, actually we were within 2.4 percent of the overhead cost. So I don't know what anyone is complaining about. If that's true, we've got a few more staff than we said we would have. And the decision was then taken by the EMBL council to locate this European Bioinformatics Institute in the UK on the Welcome Trust Genome Campus alongside the Sanger Institute. The initial assumption had actually been that it would be in Heidelberg beside the parent Laboratory, but there was an open competition, which included bids from Germany, the UK, and Sweden. At the launch of the EBI, we saw our mission as at the core really, the service activities we'd always been involved in, but also having a research component and, of course, the industrial relevance of all of this had burgeoned in the meantime, and we had a special program to support industry. We had three research leaders, Chris Sander, Liisa Home, and Christos Ouzounis, and the director was Paolo Zanella with Michael Ashburner and I, working with him. Actually, the first EBI was an office somewhere in this horrible little building here. I was employee number one. And this was the Welcome Trust Genome campus, the Sanger Institute was here. It had been a material science research laboratory where they did things like shoot aircraft wings until they fell to pieces. So it was rather weird to see people doing sequencing in a laboratory that had a 30-ton beam crane for moving equipment around. These temporary buildings, then being this box here became the EBI provided by the MRC, the Medical Research Council, in the UK in the interim period. Meanwhile, they gift wrapped Henson Hall and did some work on it, and started to build the EBI very rapidly, actually, over a period of a year and a half, and eventually we took delivery of this rather nice building on the Genome Campus. I think any of you have been there will know that it's a very pleasant location. The first director was Paolo Zanella, as I said. This is him with the then director general of EMBL Fotis Kafatos. We've moved out of our temporary building. This is Peter Stoehr, who is one of the key players in EBI. Moving us off this building into our new building. We look a scruffy lot, I admit. And, eventually, there was a formal opening of the building. I don't know whether anyone recognizes this person here. For those of you who don't, it's Princess Anne who came and opened the new building for us. And at this stage the EBI had increased its mandate beyond the first three prongs of research, service, and industry to include a heavy training component. The core of the EBI at that stage, of course, was what we'd, you know, we'd come to call EMBL-Bank, which is the department which collaborates with GenBank, but also in Heidelberg, we'd already started a collaboration with Swiss-Prot, now at the Swiss Institute of Bioinformatics. It was at the University of Geneva, initially. When we started up the EBI, we got a very strong steer from our guidance committee to get involved in the three-dimensional structures, and we're now the European partner in the worldwide PDB database of macromolecular structures. Other projects that have subsequently become important are the InterPro database of protein families, Ensembl on genomes, and the ArrayExpress repository for transcriptomics data. And of meantime, Swiss-Prot has actually attracted US funding and become UniProt, and is, we think, the world's definitive collection of protein sequence data. There were, of course, many other things that we were doing at EBI, a whole range of other information resources, and most challengingly, all the connections between them, which we still don't do a good job of representing in our services really, to tell the truth. And, of course, like anyone else we run our services over our website, and we see several million hits a day from people coming in and taking advantage of this information from all over the world. We've since grown out of our old building and we've moved into a new building, which has been built by the Welcome Trust MRC and other UK funding agencies, and it's got a wonderful training facility, which reflects our expanded mission. And this is me writing the next proposal for the next round of funding for the EBI, which always requires a little bit of lubrication. So all of this has been possible because of the help and the work and the dedication of a huge number of people, but basically actually the project has been supported through four directors general of EMBL who was initiated by Sir John Kendrew. Lennart Philipson actually made it all happen. I was mostly working under Lennart, Fortis Kafatos, and now the current director general, Iain Mattaj. Ken Murray and Greg Hamm effectively started the project of EBI, and Patricia Kahn was really the scientific brains of the project in the early days, then joined by Rainer Fuchs and Peter Stoehr, David Hazeldine on the database site. The three directors of the EBI, Paolo Zanella, Michael Ashburner, and now Janet Thornton have all been great supporters. and John Sulston was actually one of the great supporters in making the transition to the UK. We've had support always, of course, from EMBL but also from the Welcome Trust, the Medical Research Council, and other UK research councils, very substantial support from the European Commission. Now we have quite a lot of support from our industry partners, and, of course, all the staff of the EBI. Thank you. Applause