Tip:
Highlight text to annotate it
X
FEMALE SPEAKER: All right, folks.
Well, it sounds like it's about time to get started.
So good morning, ladies and gentlemen.
We are pleased to have with us here this morning Dr. Tim
Hubbard, who is the gentleman at the Wellcome Sanger Trust
Institute who is responsible for the group that--
what is the correct term?
TIM HUBBARD: Annotates the human genome.
FEMALE SPEAKER: Thank you very much.
And it was responsible for annotating one third of the
human genome sequence.
So without further ado, I will turn the mike over to Dr.
Hubbard, who will give us his presentation on keeping up
with the human genome.
TIM HUBBARD: Thanks very much.
So it's nice to be back at Google.
Oh, this is interesting.
My mouse has died.
I was here for the Google Foo Camp in August, and I didn't
really talk about this particular thing.
I talked more about openness things then.
So I thought since I'm in the area for the CASP competition,
which has just finished yesterday, that I'd come and
drop in and talk about this.
So this is about genomes.
Genomes are very, very recent occurrences, as far as having
the sequence is concerned.
Since just 1995 when we had the first one and--
for instance, about 2000, we've had the human genome,
which is huge.
And the problem is that lots of people want to look at this
data, use this data, integrate it.
So I'm going to talk about how we do that in Ensembl, which
is one of the big three browsers that provides access.
A bit of background, since I'm in the US, and a company, and
you probably don't know some of this background.
Wellcome Trust Sanger institute--
we sequenced one third of the human genome as part of the
public international partnership that put it out in
the public domain.
We're a big center by pretty much any standards--
not as big as you, but biggish.
So we have a 1,000-square-meter computer
center, for example, which we're doing rotation farming
on, filling it up with nodes each year, killing off one of
the nodes and replacing it, that sort of approach.
We're funded by the Wellcome Trust. Wellcome Trust used to
be the richest charity in the world until the Gates
Foundation overtook it, which they're kind of upset about,
but anyway.
It's a very big charity and we spend around 15% of their
spend each year.
So we're doing big science, high throughput data.
And all of our data is put out into the public domain, all
the large-scale projects.
And we're based on a campus just outside
Cambridge in the UK.
And we have another institute on that campus, the European
Bioinformatics Institute.
That's the equivalent to NCBI, so you've got Sanger
Informatics and EBI Informatics next door, maybe
400, 500 informaticians, as well as another 400, 500
experimental scientists in the Sanger.
So one of the big projects I'm involved in, I'm one of the
leads on, is Ensembl.
This started out dealing with just the human genome.
It's now got around 30 genomes in it, 30 big genomes.
It's got references going right down to yeast, but most
of what it's got is the large three gigabase genomes.
So a lot of the important stuff about having these
genomes is the sequence of relationships between them.
Some things are extremely distanced--
this is a family tree of genomes.
Some things are very related, such as-- you know, right at
the top, you've got chimpanzee.
Then you've got a lot of mammals.
Then you've got more distant things, going down to other
model organisms colored in blue, which have been
completely sequenced.
Most of the other ones are in a fairly rough state.
Now in terms of informatics, the natural way of looking at
these things is a continuous coordinate system.
That's the karyotype of humans.
So you've got these 22 chromosomes and X and Y. They
vary in length from around 250 million letters down to around
60 million letters.
So that's how we'd like to address them, 1 to n, with the
coordinates.
Now in actual fact, with the way these things have been
sequenced, is as thousands of pieces, or millions of pieces,
depending on the technique.
And so the informatics challenge is organizing that,
handling it.
Quite a lot of the problem is sequence transformations,
being able to handle between these different coordinate
systems. It's not just within a genome, it's between genomes
that are related to each other, and also more exotic
transformations.
You've got the genes which are only
sub-parts of those genomes.
Now, we deliver more than just the sequence.
So here's a piece of sequence.
This is a tiny piece of sequence.
It's less than a millionth of the human genome.
But most of what we do is serve
information on top of that.
So here's one of these views.
It's a small piece of sequence.
The boxes represent things like genes and other evidence
for those genes being there, maybe up to 80 different types
of information layered on top.
And what's the reason it's coming here?
Well, it's not completely unlike--
when I get through this--
under 10% of what's in Ensembl is sequenced.
It's not completely unlike what you do here, putting
annotation on top of maps.
We're putting annotation on top of the sequence.
You're in a 2D coordinate system with Google Maps.
We're in a 1D coordinate system with Ensembl.
So we add gene annotation.
And for gene annotation, all I'm going to say in this talk
is that it's hard, still hard.
We still don't know where all the genes are.
Gold standard right now is manual curation, because the
automatic algorithms produce too much uncertainty.
And although we rely on experimental information from
sequencing the fragments of genes that are actually turned
on in cells, they're very noisy.
So having humans check that is the most reliable thing.
So that's one thing we do is produce sets of genes which
are used pretty much around the world.
This is why it's hard.
Because in bacteria, you've got a continuous--
just genes made up of one unit.
In higher organisms, it's fragmented, and that makes it
much harder to identify where the genes are.
So I won't say any more about gene building in this talk.
You want to ask me later.
The other thing we do is comparative genomics.
I've already said that we have a pile of
different genomes in there.
And so we can calculate the relationships between those,
look at the rates of evolution between the genes and other
parts of the genome.
We can look at add information such as the
variation within a genome.
So there's one human genome, but there's six billion of
you, and you've all got individual
variations within you.
And the population structures of those, and those can be
related to disease.
So it's interesting to store that information and make it
available to other researchers.
And then for all this, it's infrastructure.
In some ways, we're just organizing this data into a
way where other people can get access to it.
And some of that we do that with the website.
Some of it's via API, via an open-source environment.
So the API just means that--
I said, you'd like to be able to address the thing 1 to n.
So coding API means you can do that.
Although it's made up of lots of fragments, you can address
the whole thing here.
The middle thing is saying, fetch the whole of chromosome
22 as a virtual fragment, which you're going to do
calculations on.
That API is now quite flexible, quite extensive.
It allows you to compute across different organisms.
You can project information for one coordinate system or
one genome onto another one.
You can also layer information.
We just put in a lot of data from different mouse strains.
People do experiments on different mice, and so it's
relevant to be able to look at it.
We haven't got the complete sequence of these different
mice, but we can project between them.
And you can access that information either via the web
displays or programatically.
So conceptually, behind Ensembl--
complete openness.
Codes available via BSD.
We can make all the data dumps available.
We code as a single common CVS. It's about 30 people--
well, there's another maybe 20, 30 people who put code in
internally, and some external people.
It's based on MySQL with Perl and Java APIs, different
layers of objects, so that you have objects for things like
genes, but the way we handle that
underneath can be changed.
And continuous improvement--
so this site gets updated once every two months.
And there's nearly always a schema change at that point,
because we've added some new data, and we've had to extend
something, or maybe completely refactor the way we store
things, because the quantity of data's changed.
We're split up into little sub-teams--
this is not that interesting to you--
and we have a process of handling that release cycle
within the organization.
We have a healthy paranoia because we are still, in terms
of usage, we're probably the leading one in the world of
the big three.
NCBI does a lot of things, but its browser is arguably less
competitive.
UC Santa Cruz, just up the road-- they have a simpler
browser engine.
They're very much our direct competitors.
So we're worrying about is what
we're doing still relevant.
How can we tell if it's relevant?
We're interested in ways of looking at usage, of our API
accesses and web pages.
And I guess that's pretty similar for all
bioinformatics resources.
It's becoming relevant.
If you're going to--
we're supported by the Wellcome Trust. We need to
justify our existence.
And in fact, all web resources need to justify.
Now that's what we are right now, but there's lots of
interesting things happening.
And that's connected with data.
So this is sequence growth.
So people may think we sequence the human genome then
it all went away.
It's not true at all.
The assembled sequence is following pretty much a
13-month doubling time.
It's been doing that for a long time.
It's kept on doing it.
And in the meantime, we've constructed a new archive for
the raw, unassembled sequence, because there's a lot of
things that come out of sequencing machines which
never get assembled into full genomes.
But it's still useful data to search.
That's doubling every 11 months.
And the database for that is currently 35 terabytes.
It's one of the bigger Oracle databases out there.
So this complete sustained behavior that we're seeing--
there are very few things which really work in an
exponential way.
Here we've got, obviously, Moore's law, information
processing computers.
That's been following exponential
growth for a long time.
This is another technology which is doing the same thing.
And basically, the point is that it's just information.
And at the moment, there's really unbounded amounts of
information.
In terms of sequencing, we've just scratched the surface of
what could be sequenced out there in the
sort of natural world.
And up to now, we've mainly been collecting a
representative of the sequence of an individual.
But the next thing is going to be collecting that across many
individuals.
And there's a current revolution in
technology going on.
So new technology is available now.
We have development machines in house.
There are a number of companies using new techniques
for sequencing.
This means that a single machine can produce between
100 and 300 more per machine, and the costs have already
gone down by a factor of 10, as a result of this new
technology, and are promised to come down much more.
So you can do actually a first pass across an individual.
It won't get you all the pieces, because you're
collecting it randomly, but maybe for around $30,000.
And there's a target to get that down to $1,000 a genome
and get a higher quality than this.
So we're still maybe a couple of order magnitude away, but
you can see that it's within sight now, considering how
much it cost to do the first genome.
So we're going to be collecting a lot
more of this stuff.
And we're going to have to organize that.
And future human health research and development is
going to be increasingly dependent upon this data.
And you can kind of see this is--
sort of overview, that I've stolen from somebody--
you see the inputs here.
It's this annotation.
You take the reference.
You layer the variation on.
You try and interpret it, work out where the genes are and
look at the variations in those and
relate those to medicine.
And you end up with this complete sort of
understanding, or at least set of things that you understand.
And then you take an individual sequence, and
that's going to be achievable quite soon.
And then you start relating those to the database with the
individual set.
And you won't be able to interpret all of it.
In fact, there's very little we really understand properly
at the moment.
But the amount that we understand will increase
gradually over time as this collective database increases.
You only have to sequence the person once, and then that
will allow you to start understanding more about the
medical consequences.
And it's not all predicting probability about whether
you're going to die of x or y.
A lot of it is quite practical things, that this person
shouldn't be taking drug x because it's
going to kill you.
The fourth biggest killer in the US is adverse drug
reactions, allegedly.
Of course, that's a sum of all drug reactions.
Of course, it's skewed towards--
you know, you're treated with a lot of drugs when you're
about to die.
But it's still a very significant number.
And so it's very relevant to be working on this.
So we have our database.
We're kind of doing the preliminaries of this.
We're starting to pulling in resequencing data that's being
shown on our website so it's available.
We have to work on how to do this in a compressed way.
But we're already up in this massive 380 gigabytes, our
complete database, and that's changing every two months.
And we provide data mining interfaces to that--
I haven't really talked about.
And they can be aggregated and integrated with
resources over here.
But that's not really the solution, because we only put
things in the data mining interfaces where we know
people want to ask particular questions.
That's how we denormalize the database in that way.
If we really want to provide complete flexibility then you
can get at the data using the APIs.
But you have this download problem.
So we provide access now remotely through--
we host the databases.
People can download the code and talk to us.
And that's becoming an increasing way of people
integrating data with their own data.
But what about beyond that?
Because that's still not completely neutral in terms of
the sort of democracy of integration.
And this is the kind of way I look at it.
So complete genomes provide this framework that we can
organize stuff around.
But we're in a strong position, because we've got
this big database and resources.
We have a lot of power, then, and in fact, that's not ideal.
We'd actually want to have no monopoly on this, because
you'd like anybody to contribute, no matter how
small they are.
The more organizations provide data, though, the harder it
becomes for anybody to use the results.
So how can you address this problem?
And this kind of fits into other sort of openness issues,
which I'll just mention.
So just the general issue of data sharing.
All of the human genome, because it's been open, has
been quite a driver for this, the idea that you could
release data immediately, and make people available.
It was also this kind of idea which came up, of course, at
the Google camp here that maybe you actually require
open models.
Vista was just announced a couple of days ago.
People are saying, well, maybe the difficulties in doing that
suggests that centralized projects are
just not very scalable.
Science has always been highly cooperative, but it's not been
so data-rich as it is now, not in biology.
We have to find better ways of handling this.
And then I'll just put up this.
This was in Berkeley.
I saw a few years ago.
All these arguments about ownership,
patents, things like that.
I've given other talks about this.
But basically, if it's open, it's better.
It's going to be easier to share things.
So there are various problems to solve.
In the community of scientists, we're having to
get used to the idea of sharing data more and
allocating credit appropriately.
But there's also these practical
issues of data sharing.
So we have this camp here, and it was kind of similar issues.
How do we increase integration and processing bandwidth?
Now bioinformatics--
so we have these databases.
They're kind of linked together.
But there's a lot of databases.
In fact, there's a danger that we have too many databases and
too many diverse interfaces.
It actually reduces impact because scientists just get
lost in different ones.
So can we find a way of splitting data and
presentation so we get competition between those two
mechanisms and find optimal tools for visualization?
So I'm going to introduce this DAS idea, because we're now
using this very heavily, and it seems to
fit within this paradigm.
So what is DAS?
So this is the idea.
So we're a big data provider, and there's lots of people
viewing our data, our service.
And that's kind of a monopolistic view.
Because somebody else has got some data--
it's hard-- they've got to persuade us to put that data
in our system.
And we don't always know the quality of what somebody
else's annotation is.
Should we include it or not?
So external contributors--
they might set up their own database.
But then of course that's a big overhead, to set up
something competing with us.
And probably it means not many people go and look at their
data, even though it might turn out to be valuable.
So the DAS idea is you just serve the little bit of extra
stuff that you've got.
You use some infrastructure to make sure the coordinate
systems are synchronized, and then you make the viewers
cleverer so they can integrate things on the fly.
And once you've done that once, of course, you can have
as many servers as you like.
And users can control this.
They can turn things off.
So this is the opposite to the links model.
Linking takes you into different websites.
But they've got different interfaces.
It becomes harder to compare the scientific data.
The DAS approach is the opposite--
standardized servers, and viewers which just can
integrate and users can choose which ones to turn on.
So it's kind of like a very simplified web services.
Web services are fine, but if they all have their own
protocols, then it's a programming
nightmare to use those.
Here we've got a very small set of standardized ways to
pull these things together.
And the kind of ways we're using this-- here's an example
to link all the way from sequence to protein structure.
So there's this viewer for structure--
it's got a protein structure there.
It can pull all the stuff out from existing DAS sources and
integrate on the fly and allow you to see things like
variations in genomes mapped onto protein structure, all
pretty transparently.
So I think what's possible using this approach is you've
got a set of data from different providers.
At the moment, we have a set of services.
The big ones are big integrators like us.
But you have these problems like this one here, a small
group that's serving its own data.
It's not going to get much usage, because it's only got
the resources to serve this.
DAS environment, maybe it doesn't need to provide
anything at all because its data can just be integrated.
And then down here, here's somebody who was just using
their own data.
Now they can pull in other people's data, and maybe they
can be a full competitive client by themselves.
So we have quite a lot of that infrastructure in place, which
we're using.
We have around 200 different servers in this registry,
quite a lot of different coordinate systems. And you
can start thinking about servers that build on servers.
So you have consensus servers, that process other servers,
provide simplified views where there's duplication.
So that's where I think the annotation world is going
right now, at least one sub-strand of it.
How can you pull this stuff in together so that then
scientists really can compare different results and try and
work out which ones to believe?
Because at the moment, if they're sitting on different
sites, it becomes very hard to do that.
Now I want to say just something about prediction,
because this is where I've just come from, this meeting.
So that's my background, as a predictor, trying to compute
biology directly.
And you can look at it in terms of these extremes,
pragmatic versus pure.
So in protein structure prediction, this is trying to
work out how the protein structure folds up.
At one end, you've got comparative modeling--
pragmatic, basically saying, we don't know the structure,
but we know that this sequence is related to a sequence where
we do know the structure.
And they're close enough that we can maybe infer the thing.
It's not real prediction.
It's kind of cheating in a way, but it's quite a
practical solution.
The other end, you'd like to just do some pure physics,
just take the thing.
We know it folds up by itself.
Make it simulate.
But simulation doesn't really work.
A few years ago there was this thing called fold recognition,
which was really comparative modeling at a long distance.
And in fact, what's actually proving practical is
fragment-based assembly, which is not quite pure physics, but
it's much closer to real, proper stimulation.
And that's being popularized by the Baker group, who are
again successful at CASP this year.
So in my genome annotation thing, it's
kind of pretty similar.
At one end, we have ab initio gene prediction.
It kind of works but is problematic in vertebrate
genome annotation.
Then we have the evidence-based approaches
which I mentioned before, which can be automated with
lower accuracy.
We hoped that comparing genomes would make things
easier, but in fact it doesn't, because there's a lot
of genomes which is actually similar, outside the gene
regions, because they're involved
in regulatory regions.
But now we're beginning to learn that one of the reasons
why we're so bad at this is because there's lots of other
factors influencing the structures of genes, lots of
other finding factors which turn things which look like
fragments of genes off, make them inactive.
And so properly to do this, we have to predict motifs.
We have to predict these little
signals and build genes.
That's the way to do it properly.
But maybe that involves a lot of computes.
So if you look at what happened at CASP this year,
it's kind of interesting, the different approaches.
You've got two groups that came out very well, one of
which is not using very much CPU and is using evolutionary
information a lot, different models, different examples,
and merging them, but limited CPU.
And then you've got the Baker approaches, which is directly
fragment based.
And It's quite costly in terms of CPU.
It's very costly if you want to refine things.
But the refinement accuracy can be very good, getting
quite close to crystallography for at least a small set.
So it's definitely making progress,
but the cost is enormous.
So 1.8 angstrom structure, for one example, they quoted a
million CPU days.
I'm not quite sure how precise that is, but it's a
distributed model on 100,000 nodes.
So where are we going to go with the genome?
Because we're collecting this data.
We're integrating all these different bits of evidence.
But ultimately we're going to understand it at the level
where we can interpret individual mutations.
We're probably going to have to do similar sorts of very
high CPU intensive things.
So that's where I think ultimately we'll have to go.
For the moment, the value comes from integrating data.
So I'll just acknowledge a load of people.
These are all the different groups--
the people involved in such an Ensembl, and the different
groups, the annotators, and people who support that
operation at Sanger.
So I'll stop there.
Thanks very much.
OK.
Yeah?
AUDIENCE: You mentioned the continuum of proteins between
the comparative modeling approach and molecular
assimilation.
I think you mentioned this too.
Of the kind of [UNINTELLIGIBLE] strategies
where you can match up--
it's hard to find another protein where the entire
sequence is similar, but you can find other proteins where
you know the structure.
And little pieces of each of them are similar.
And so this is more like an area where you can match up
those pieces, and then suppose each of the [? set ?]
components [UNINTELLIGIBLE]
and then they do an optimization over fewer--
TIM HUBBARD: That's kind of the fragment approach.
Although the fragments could be quite small.
AUDIENCE: Could you repeat for other offices?
TIM HUBBARD: OK.
So this was a question about assembling protein structures.
So the question was whether there's an intermediate
approach of taking matching fragments of structure and
putting them together.
And that kind of is the fragment approach that I
mentioned, although the fragment approach goes to very
small fragments--
three residue pieces and nine residue pieces.
Different groups use slightly different
approaches, and some of them--
there are hybrid strategies, yeah.
AUDIENCE: So the DAS model is based on an agreed-upon XML
description?
TIM HUBBARD: Yes.
AUDIENCE: So how did the larger community arrive at
that XML description?
And has that been agreed upon?
Can you talk about some of the--
TIM HUBBARD: So this was proposed by Lincoln Stein.
Yeah, sorry.
So there's a question about the DAS protocol, how it was
agreed upon.
So it was basically proposed by Lincoln Stein, who had the
first grant to set up the first of the client server
libraries for this and specify the protocol.
And it's been basically evolved, this protocol.
We've extended it in various ways to handle other data
types, but it's still quite compact.
There is an effort to make a DAS 2 specification, which
will be more flexible, will handle more metadata properly
and handle things like searching.
Because one of the things is distributors searches across
these sources.
So most of the models--
things like Google Map, I think, is based on the idea
that you deposit your annotation in a central place.
Whereas this model is-- of course, you could set it up
like that, but you have servers
scattered around the world.
And each center hosts that server, and it's up to them to
keep it up.
We, with our central registry, we do monitor whether servers
are still up and warn them if they've gone down
and things like that.
But it's much more fine grain distributed model.
So in terms of the evolution, basically it was imposed.
And it's gradually being adopted.
AUDIENCE: These competing providers that you mentioned
earlier do not use this protocol?
They do not have this approach?
TIM HUBBARD: So NCBI certainly doesn't do this right now.
UC Santa Cruz does have some DAS capabilities now, although
they have another protocol for uploading data directly,
which, in fact, we have as well.
You can upload data and we'll run the DAS server for you.
So that you can--
it's integrated, but it also appears externally as DAS.
AUDIENCE: I have another [UNINTELLIGIBLE PHRASE]
question.
You mentioned earlier that you've refactored your entire
data representation at some time at some intervals.
TIM HUBBARD: Multiple times.
AUDIENCE: Multiple times.
So how do you do this without interrupting your user's work
and has the API for the user's or client's perspective
remained the same throughout?
TIM HUBBARD: So this is a question about Ensembl and its
development strategy.
So the refactoring--
yes, the refactoring goes on continuously.
But there have been major points.
So we have a number which gets associated--
a release number.
The release number, in fact, always used to be the schema
version number.
And the problem was that it changed so many times that
it's kind of become the version number.
So that's where we are.
We're at version 41 right now.
So there was a big change between version eight and
nine, and between 19 and 20, where we really restructured
some things.
A lot of the other stuff has been relatively minor addition
of columns and things like that.
How do we handle it?
Well, you get into this culture of you have 30
genomes, and we don't update the data in these genomes
every month.
But quite a lot of them change.
And some of the common databases, the comparative
genome databases, have to be recalculated every time.
So in every cycle, either the data gets updated or the
schema gets converted.
So in the distribution, there are SQL files for schema
conversion for every point.
So if you go through our CVS tree, you'll see a whole load
of patch files.
And so people who've downloaded stuff could also
patch their systems, because other people externally also
using the Ensembl framework to store genomic data.
And of course they want to keep migrating and keep up to
date as well.
I think it's a cultural thing.
In terms of supporting these sort of things--
we have 30 engineers working on this.
There's a kind of feeling that a lot of infrastructure isn't
well enough funded.
And it's only got enough funding to just keep going.
We have enough where we can keep going, but we can also be
reengineering things on the side.
And that's what you need to keep things moving.
You have to have enough resources
for those two things.
And we do get pressures.
We used to do a one-month cycle and that just became
unworkable, because everybody was running the cycle rather
than doing any new development work, but two months seems
more sustainable.
AUDIENCE: Is there a standardization effort
underway for these protocol?
TIM HUBBARD: For which ones?
AUDIENCE: [UNINTELLIGIBLE] protocol, or the DAS protocol?
TIM HUBBARD: So the DAS protocol is kind of standard.
But because it's XML, it can be extended.
So we've added extensions and other people can propose
extensions.
There's a proposed extension for interactions between
objects, biological objects, so there's lots of cases where
people propose a list of proteins which interact with
each other.
And that's another thing where you'd like to integrate in
this framework, be able to see all these different opinions
integrated in one client.
So it's an ideal thing to work this way, rather than relying
on people publishing their own integration, which of course
only takes the data that's available at
that particular moment.
AUDIENCE: You were tracking--
I'm sorry to keep asking these questions.
TIM HUBBARD: That's all right.
AUDIENCE: And just to review, you said you were tracking a
lot of user data.
[UNINTELLIGIBLE] access them.
You didn't show any of the--
TIM HUBBARD: OK.
So user data access.
We have our web logs, basically.
And we know that the usage goes up when
we have user surveys.
And we look at publication records,
where we've been cited.
But that's all, right now, and one question for us is exactly
is how much more should we do.
And in a wider infrastructure question--
I'm on various European infrastructure committees--
over here, things are quite well funded, actually.
The NCBI's, structure of funding of NCBI, and other
resources is quite stable.
And we don't have this in Europe right now.
It's not as stable.
But government's always worried about funding things
on an ongoing basis, perpetually, without having
some measure of value.
And so working out what the value is of these resources,
and whether they're still valuable, whether they should
be merged or closed down or things like
this, it's a real question.
And so working out not only how to integrate data but then
work out who contributed what is, I think, quite an
important question in terms of the sort of long-term funding.
Yep.
AUDIENCE: So if I had protein structures, can I use DAS to
annotate genomes with the structures?
TIM HUBBARD: So you could set up-- so if your structure--
I was talking with some people at CASP about this.
So there's quite a few databases of models, for
example, protein structure, around.
And most of them can be related to UniProt sequences,
which are standard protein sequences.
And so there's all kinds of annotation out there against
UniProt sequences.
So if you make the connection between the two, yes, you can
use DAS to display the annotation on any of these
models that have been constructed using protein
structure prediction.
AUDIENCE: [INAUDIBLE]
TIM HUBBARD: No, but they were talking quite
enthusiastically about--
some of the people were talking about doing that.
So yeah.
I think within Europe, there's this thing called BioSapiens,
which is an EU-funded project to link a load of different
bioinformatic people mainly working on protein sequences.
And they've adopted DAS as their mechanism
for exchanging data.
And so that's seemed to be quite successful and is
actually providing an interoperability layer between
all those groups.
So a lot of the sources are coming for those groups.
Thank you.