Tip:
Highlight text to annotate it
X
Thank you. Well, it's a pleasure to be here and talk about the 25 years of GenBank. I've just done -
I'll talk about my 25 years at EMBL. The EBI, the European Bioinformatics Institute is what has
grown out of the ENBL data library. It's part of EMBL, and in a sense it's the nearest thing to a
European analogue to NCBI, so it's interesting to be here. I started back in '82 on the ENBL
Data Library Project, and this all started from discussions in the scientific journals that I saw
from 1980, where basically in Nature there was an editorial saying basically that there was a
major problem with DNA sequence data, because we'd already got an overwhelming amount of data
which was a 100 kilobases, and how was anyone going to deal with this flood of data, which was
going to be difficult to deal with? And it was pointed out that sequencing as a way of doing research
was a very risky business that a graduate student might stumble on an iconoclastic stretch of DNA
or it might take a bevy of experts several years to reveal nothing more than a megabase of
interesting sequence, so there was some concerns there. And this editorial pointed out that there
didn't exist any central databank in which all sequences could be deposited and made available.
So EMBL - the European Molecular Biology Laboratory - decided to act, and the two key players
In the EMBL at that time were Ken Murray, now Sir Ken Murray, and Greg Hamm, and they
gathered together a group of experts both from the US and Europe and the rest of world, indeed,
to talk about the possibility of setting up a data library or a databank, and they decided that EMBL
would do this. Actually, the minutes of the first meeting about it said "and it could require a full-time
staff member to run this." And after a further meeting in 1981, they began to iron out the details,
and Greg Hamm took on the leadership of the EMBL Project, and this is Greg Hamm here, and
Christian Burks, who you'll hear from later today. Who was - Greg - this is them collaborating
on a riverbank somewhere. When I joined the EMBL, we had an office, a telephone, a computer
terminal, and 1,500 letters from people asking us to send them the data. We didn't have a database
or really any data. EMBL had seen fit to announce this to the world without actually putting any
resources in to do the job, so there was a little bit of pressure on us. And Greg got to work on
designing the first EMBL data library format, and actually at that stage what we were talking
about was formats, not really databases. And if you look at this format that Greg was scratching
out, and this I got out of his archives, and compare it to what we see today in the EMBL database,
you'll see that it's no profoundly different, which means that either we were very wise and got it right
or else we poisoned things forever. I'm not sure which. And, of course, Europe being pretty snotty
about this - we boasted about it, and there was a publication in Nature saying Europe leads on
sequences, and we presented a nice little entry in the database here, and of course, we always
do acknowledge where the thing had been published, and we'd started the way we managed
to go on and we'd actually succeeded in screwing up even the entry that we presented in the
first publication about the database, and there was a quick rebuttal in Nature saying that we'd actually
cited the wrong paper and if we looked at the Georgetown database, we'd find the right paper
for that entry that we presented in the journal article, so experience I suppose enables us to -
we've made so many kinds of screw-ups that now we probably have more knowledge of what
to avoid than most people who've ever done this job. Not everyone was so rude to us. In fact,
in 1982 when we sent the first release of the database out, we got a nice thank you letter from
somebody called Takashi Gojobori, who was at the University of Texas at Houston, and you'll
hear from him later on in the program, so this was in 1982. Thank you, Takashi. Within a few
months of establishing this project, of course, we've heard about the GenBank Project was
established, and Ken and Greg both took the view immediately that it was best to cooperate on this,
and Greg set off to the USA to talk to people about how we could work together on the production
of this information resource, and this was the beginning of the EMBL/GenBank collaboration,
which is now the EMBL/GenBank/DDBJ collaboration, of course. In the GenBank Call it was
also worth noting that not only were we worried about the amount of data, the 100,000 kilobases
or so - with 100,000 bases or so, but we actually were also worried about the number of users.
We felt we might have as many as 300 users out there in the world, so it was something we needed
to take care about. Right in the beginning when sent out the first release of the database, and after
discussion, we had this sentence: "This manual and the database it accompanies may be copied
and redistributed fully without advance permission, provided that this statement is reproduced
with each copy." And that was an important statement, and actually if you go to the documentation
of the latest releases of the EMBL database, you'll see exactly the same statement still appears.
And I think that this collaboration was important in maintaining the right of way to these data.
So I think that was one of the great achievements of the collaboration. Of course, in these early days
we were much criticized. There were two databases actually. There was EMBL and there was
GenBank. What was not desperately clear to the community - actually they both were seeded by
collections that already existed. The EMBL had adopted a collection produced by Kurt Stuber,
in the University of Cologne, and GenBank, of course, was seeded by the work of Walter Goad,
and the two were different. We had extensive discussions and exchanges. We actually had our first
what has now become routine annual collaborative meeting in 1985, and actually there was no
disagreement between the two groups at all. We were actually quite hurt by statements that said
like they should be locked up in a room until they agree. We were actually struggling and cooperating
to try and find ways of presenting this information that avoided the weaknesses of both our ways
of doing it. There was no disagreement whatsoever. So they didn't actually lock us in a room.
They did something much worse. They sent us to a meeting in Tyson's Corner and locked us up
in Tyson's Corner for a couple of weeks to actually try and resolve the differences between
the two databases and in fact that was a very productive meeting despite the fact that we were in
a hotel that didn't even have a bar. And the new feature table was designed, which actually
combined - which drew the most difficult part of the two databases together, and it's actually
still in use today after many rounds of revision. Also at that time we were faced by the white-hot
technology that was coming along, so we were being told that we had to respond to this new thing
called CD-ROM and produce the data on CD-ROM so that takes you back a little bit. But in fact,
if you looked at the data submissions that we were getting, they weren't always that desperately
high tech, so this is an example of a data submission from the Chambon in 1987 that we were working
from to include the data in the database, so it was quite challenging and quite frustrating sometimes.
In 1987, NIH and the EMBL arranged a joint meeting at the EMBL in Heidelberg, and I as a young
researcher then had the responsibility for the Heidelberg organization of this meeting, and basically
it stressed the importance of this information infrastructure. I think the most interesting thing
from an organizer's perspective in the meeting was that I was very nervous about it, and I
arranged all the details down to telling people how to get from the meeting venue back to the
guesthouse where we were staying, and then the day before the meeting, the foresters felled
about 20 trees across the path that I had told them to use, and then in the night it snowed, and
despite that when the bar closed at the end of the first day's session, Fred Blattner managed
to carry a case of beer all the way back from the bar to the guesthouse over the trees and through
the snow, so - we had a number of documents that were prepared as input to that meeting, and
Jim Fickett, who I don't think has been mentioned so far, was working at Los Alamos, and he
prepared this paper about the rate of sequencing. It was quite interesting. He said correctly at
that time we were getting about 10,000 nucleotides a day going into the databases, and he
suggested that in 15 to 30 years, we could be seeing as much as 1,000,000 bases a day.
Well, Jim, close but not really spot-on. Now 20 years later, we're actually seeing about
100,000,000 base pairs a day going into the databases, so it's two orders of magnitude more
than we were predicting. Christian Burks, who will be speaking later, had started to notice the
importance of the proliferation of information resources, provide a little database called LiMB,
which was a list of databases, and there were 20 data banks included in that. Rich Robertson has
already mentioned there are actually now about 1,000 or more databases in molecular biology,
which we have to make sense of. The concrete thing that this meeting did was it established an
international committee to guide the collaboration on DNA sequence databases and probably
resulted in that well-known phrase that "He who pays the piper, establishes a committee to call the tune."
At least that's the way it goes through funding agencies. Actually, this committee turned out to be
extremely useful, and it's still in existence today. But as has already been pointed out, it didn't
seem desperately sane to have people tying A's, C's, T's, and G's into the computers when,
in fact, the scientists had their data on the computer. They put it on the pages of journals, and
then we read the journals and put them back in the computer, and so we decided that it was time
to do something more sane and approach the journals to try and get some movement on direct
submission of machine-readable data. And, in fact, at this time Rich Roberts - who's actually
instrumental in this with his work in NER that he's talked about - had written to Jim Cassatt,
who's here today - Jim was the project officer for the first GenBank Project, and Rich had
written to him saying, you know, we need to have a scheme where the author submits the
data to the database back in 1987. And we were doing this work on the European side. In fact,
Patricia Kahn here was tireless in her communications with the journal editors to get the data
submitted directly. This daughter of mine has just started university, so this was a while ago,
as you can see. And so Patricia wrote to dozens of journal editors with a scheme whereby
accession numbers would be made available to authors who submitted the data. And annoyingly,
the most resistant journal at this stage was Nature who regarded themselves as the record of
science, and these database guys were a bunch of - as far as they could see - secretaries and
computer geeks, and why should they be fiddling in the record of science. Though they did say
there was some precedent in crystallography, which wasn't a very good precedent actually.
So they said that their journals couldn't act as watchdog, and Nature's blind was that they shouldn't
put obstacles between the author and publications, and we had a long backwards-and-forwards
with them in 1989. They said that they were not willing to exact unenforceable promises to submit
data, which actually has proven to be complete nonsense. So we wrote back robustly and kept
going on about it, and they never actually climbed down, but eventually they came around,
though I don't think they ever published a formal statement on the issue. Of course, one of the
things you have to realize about these days was that when biologists said database, they didn't
mean database; they meant some kind of collection of information in some vaguely unified way
in a computer, and it was in the mid-'80s we started to realize that actually we needed to use
proper database technology to get our act in gear, and the EMBL group started to work with
Oracle. GenBank used Sybase, and we still use Oracle today, and I think GenBank uses Oracle
as well, if I remember correctly. David Hazeldine, sadly dead now was the guy who did all the
database work at the EMBL at that time. Also, in the early days we simple used to throw the data
at people and tell them to get on with it. We gave them no help in searching the data whatsoever,
and our own little tool that we started to produce at that time was a SARC system for our
CD-ROM data, and at the same time CBI was producing Entrez and the tools and the software
both became necessary to the task. We also realized that actually a lot of the useful annotation of
data should come along after the fact. So basically one of the problems with DNA sequence data
is that what one tends to capture is just the knowledge about the sequence when it's deposited,
but not corrections or updates or new annotation, and Rainer Fuchs and I proposed a model for this
which we never managed to really get implemented, which is effectively equivalent to the
modern distributed annotation system, which is in use in the genome world. Of course,
throughout all this I was being rather ambitious and saying that we needed more than the EMBL
data library; we actually needed an institution whose mandate was to deal with broad bimolecular
information, and I started writing proposals for something that became the EBI back in 1987.
Actually, I wrote the first draft in Pasadena after visiting Lee Hood and realizing that if people
were thinking-big this side of the Atlantic, we should be doing the same on the other side.
So we started proposing the EBI, and eventually after battling with our council, we got an agreement
in 1993 to start up a European Bioinformatics Institute with 69 staff, and we had costed them all.
I have no idea why this had to be 69 not 70, but for some reason it had to be 69. And so now,
we'd calculated a budget for these 69 staff. I'm glad to see the first 347 of these were already
in post in 2007, and if you scale the budget appropriately, actually we were within 2.4 percent
of the overhead cost. So I don't know what anyone is complaining about. If that's true, we've got
a few more staff than we said we would have. And the decision was then taken by the EMBL council
to locate this European Bioinformatics Institute in the UK on the Welcome Trust Genome Campus
alongside the Sanger Institute. The initial assumption had actually been that it would be in
Heidelberg beside the parent Laboratory, but there was an open competition, which included bids
from Germany, the UK, and Sweden. At the launch of the EBI, we saw our mission as at the core
really, the service activities we'd always been involved in, but also having a research component
and, of course, the industrial relevance of all of this had burgeoned in the meantime, and we had
a special program to support industry. We had three research leaders, Chris Sander, Liisa Home, and
Christos Ouzounis, and the director was Paolo Zanella with Michael Ashburner and I, working with him.
Actually, the first EBI was an office somewhere in this horrible little building here. I was employee number one.
And this was the Welcome Trust Genome campus, the Sanger Institute was here. It had been a
material science research laboratory where they did things like shoot aircraft wings until they fell to pieces.
So it was rather weird to see people doing sequencing in a laboratory that had a 30-ton beam crane
for moving equipment around. These temporary buildings, then being this box here became the EBI
provided by the MRC, the Medical Research Council, in the UK in the interim period. Meanwhile,
they gift wrapped Henson Hall and did some work on it, and started to build the EBI very rapidly, actually,
over a period of a year and a half, and eventually we took delivery of this rather nice building
on the Genome Campus. I think any of you have been there will know that it's a very pleasant
location. The first director was Paolo Zanella, as I said. This is him with the then director general
of EMBL Fotis Kafatos. We've moved out of our temporary building. This is Peter Stoehr,
who is one of the key players in EBI. Moving us off this building into our new building. We look
a scruffy lot, I admit. And, eventually, there was a formal opening of the building. I don't know
whether anyone recognizes this person here. For those of you who don't, it's Princess Anne
who came and opened the new building for us. And at this stage the EBI had increased its
mandate beyond the first three prongs of research, service, and industry to include a heavy
training component. The core of the EBI at that stage, of course, was what we'd, you know,
we'd come to call EMBL-Bank, which is the department which collaborates with GenBank, but
also in Heidelberg, we'd already started a collaboration with Swiss-Prot, now at the Swiss
Institute of Bioinformatics. It was at the University of Geneva, initially. When we started up the
EBI, we got a very strong steer from our guidance committee to get involved in the three-dimensional
structures, and we're now the European partner in the worldwide PDB database of macromolecular structures.
Other projects that have subsequently become important are the InterPro database of protein families,
Ensembl on genomes, and the ArrayExpress repository for transcriptomics data. And of meantime,
Swiss-Prot has actually attracted US funding and become UniProt, and is, we think, the world's
definitive collection of protein sequence data. There were, of course, many other things that we
were doing at EBI, a whole range of other information resources, and most challengingly, all the
connections between them, which we still don't do a good job of representing in our services really,
to tell the truth. And, of course, like anyone else we run our services over our website, and we
see several million hits a day from people coming in and taking advantage of this information
from all over the world. We've since grown out of our old building and we've moved into a new
building, which has been built by the Welcome Trust MRC and other UK funding agencies, and
it's got a wonderful training facility, which reflects our expanded mission. And this is me writing
the next proposal for the next round of funding for the EBI, which always requires a little bit of
lubrication. So all of this has been possible because of the help and the work and the dedication
of a huge number of people, but basically actually the project has been supported through four
directors general of EMBL who was initiated by Sir John Kendrew. Lennart Philipson actually
made it all happen. I was mostly working under Lennart, Fortis Kafatos, and now the current
director general, Iain Mattaj. Ken Murray and Greg Hamm effectively started the project of EBI,
and Patricia Kahn was really the scientific brains of the project in the early days, then joined by
Rainer Fuchs and Peter Stoehr, David Hazeldine on the database site. The three directors of the
EBI, Paolo Zanella, Michael Ashburner, and now Janet Thornton have all been great supporters.
and John Sulston was actually one of the great supporters in making the transition to the UK.
We've had support always, of course, from EMBL but also from the Welcome Trust,
the Medical Research Council, and other UK research councils, very substantial support from the
European Commission. Now we have quite a lot of support from our industry partners, and, of course,
all the staff of the EBI. Thank you. Applause