Tip:
Highlight text to annotate it
X
[Vince] Okay so now we're gonna switch to the first
of our keynote talks. It's my absolute pleasure
to welcome Donald Hobern. Many of you will be familiar with Donald he's
been involved with GBIF for a long time
he recently took on the role of its executive secretary
so he essentially runs
the entirety of the GBIF program and
is hugely influential and has been hugely influential
in terms of the development up the whole field of biodiversity informatics
and we gave him this rather challenging title of GBIF and the Biodiversity Informatics Landscape
so I'm looking forward to hear what t Donald has to say. Donald Hobern [claps]
[Donald] Thanks very much indeed Vince
it's a great pleasure to be here
certainly at the Natural History Museum and all it represents
is at the heart of what
GBIF has always been what it's always aspired to be
although I believe we've gots decades of our own to get to where we would really
like to be
I'm going to
I've been given this grand title and I've got about 20-25 minutes to try and cover
thing so I will I will be
brief and focus on just a few things festival since
does this work could do it ok
as good I'm since since the the title includes G def and I'm here representing
Jibbitz
I assume most if you are aware of the global biodiversity
information facility it's been in existence are since 2001
I it represents a
collaboration between a significant number of governments
governments and international organizations with
the mission of mobilizing biodiversity information
now I belief that really
this this vision that biodiversity information should be freely in
university available
Science Society in a sustainable future should relates to
all aspects all everything the human race knows
about the world's biodiversity and that clearly goes
a long way beyond the areas that the GV
work programme covers i think is fair to say that apart from
certain roles with plates in
supporting the mobilization of species checklists
I in collaboration with activities such as species 2000 catalogue like
I the vast majority of what chief has done has been to focus on one specific
problem which is the organization
all of the evidence we have for the
recorded occurrence or any species at any time and place
I it's a little bit loose where out polio day tourists
almost non-existent would not taken that seriously but thats
is effectively the goal with focused on and in particular
going back to the origins RGB the focus has always been
all on natural history collections as being
the repository for about church and that foreign principal high-quality
data that includes not only I contemporary
observation so West be so be found but also very importantly for many of the
questions
they've been syndicated in his slights at the kind of information that helps us
to look back it's the past
and understand a little bit more about some thirty the distributions and
patents for species
so I'm right now this is the distribution of GB if
participants ignore the colors it its it's all to do with governments
I'm however as as you can see we have
very brought participation across western europe
and the Americas Australia New Zealand South Africa very strong
a scattering its other countries iPad still some very significant gaps
and that's reflected after 12 years in the actual coverage
updater I'm talking about this is coverage
is immediately glossing over many things because I
even though we're here now talking about more than 400 million
records all some species have been recorded at some time in place with some
kinda evidence
for that assertion I'll for many species it's just one or two
records I in many cases
they may be records here which have never been identified down
I even to the generous level at like two species of but it gives an indication
of the spread that that we're now seeing
and the pattens become more obvious as you move in close to see
shipping lanes you see grid-based survey activities
I you see census of population roads
and national parks et cetera as very obvious
sensitive interest
I think the situation that you could find itself in after
at twelve years is that it is now
a proven infrastructure it has a
global set of practices and a global network
that is increasing the effective unrecognized
as effective in managing and organizing
this type data from many different sources
are supporting I have very different communities and different language
groups
I in contributing to a globally integrated few
however there is a great deal more to do and I'm gonna spend most of the
the rest its this presentation talking about some others aspects because
issues around quality around Providence
around understanding are just how much evidence
there is understanding who the authority is for the identification for a
particular record
the ability to know whether or not the coordinates
I'm are reliable or whether they're I'm based on
somebody guessing potentially what a locality name represents
all of these a critical issues understanding
I demeaning all of other statistical
significance all data brought together from so many different sources
some other its are very ad hoc I'm as I would argue
much collections by stater really is
summer that much more systematically organized through
field activity such as that we're seeing a long-term monitoring projects
all of these have different strengths and weaknesses and I don't think yet
we've solved all the problems
as to how we deliver that information robustly to users
in a way that helps them to make the best possible judgments but these things
however as a positive note I'm later this year
I'm October I'm GVF
similarly to some other things that have been described here for the museum will
be revamping
its own portal and its its web infrastructure are
these are snapshots halts the the current
it somewhere between alpha and beta atm I'm I think website
that we've been developing I'll its offers
basically similar sorts of functions to those that you find right now
energy before upside although I'm much more scale ability in zooming
on the maps and probably more interesting for most researchers
much more flexibility in being able to download data
for various kinds analyses and purposes up to now
the G def networks really ended up de facto
place is a quite significant restrictions on how easy it is
a particular to get larger datasets I out to
out the system and what are these problems can be solved and
from the standpoint of those who are publishing data
there will also be much more rapid I response
on the GV offside for seeing those changes reflected in the data index and
and being discover I the goal is that
if somebody registers dates setting the GV network
I it will immediately start indexing its I'm
and within a metre minute sorry
or or ours it'll be possible for the
the publisher to see but throughout the day trip interruption in the way they
intended
and potentially to I'm fixing things and and upload again
so a lot of allotted rather technical
improvements the project I think would make a big difference for many users
but I think still any scratches the surface have some other real problems
so the problems not so
let's take this this example here I'm imagine
a weevil I in a knot sry collected somewhere
and a curator I as part of that twenty million
specimens I entering the the label information
for the specimen into and institutional system and sharing that
do something cheap network
a a distribution model may find this record 3G before some other
route and decide to use it as part some
on some view that building perhaps I'm trying to understand
whether or not this is likely to to spread or be
impacted under climate change scenarios
somebody else I may recognize that
in fact those corners and nowhere near Germany inside me just check the record
I'm probably won't or may have difficulty
in notifying the out the source institution but the fact that found a
problem
somebody else may recognize that in fact is just the coldness have been
transposed
fix it for their own purposes and go on happily but still
at no benefit no improvement in the source data recorder
are and then of course if attacks on a Miss comes in and recognizes that this
this beetle was misidentified all along everybody's been wasting their time
and the State Fair really been that narrowly interest in a specific species
and none of this again typically
I in today's world at very least makes a quick change
to the the view we have of the data
and I believe that this reflects
a world in which the I what I would characterize from
princes past presentations as as a paper-based knowledge base
a world in which what we doing is taking some other analog information putting it
up in the window for people to see
but not yet serious abouts a full-scale shift
from fat analog world to a true the digital one
and what I'd like to suggest is that I
for my on view out over the next few decades
our focus should be on enabling
a globally connected digital knowledge base
for all of that and local information are
and thats is clearly an enormous and expensive challenge
but I think we have many of the tools and I hands right now
for going down that path and if we think more broadly
all the world in which we live we think of
I'm financial markets medical systems
obviously climates and whether are
we live in a world where we expect up-to-date
predictions and models based on historical and contemporary
data and trends in order to inform all kinds of aspects relies
our government's expect these things industry in corporations do as well
we live in a world which is increasingly driven by
models that are based on digital data
I'm as has been says for all the challenge challenges
and the the questions that I think may rightly be asked about what
is realistically gonna be feasible for building full-scale
models all I'm entire ecosystems and the flight itself
are de Vries some first order approximation
week Mike or at least some rudimentary starts we can make
towards I being able to represent some aspects
all life in a way the could inform decisions
by our governments for planning I could inform
intergovernmental activities such as the CBD I'm
and are Indies help to drive whole new questions
at the level ecology and taxonomy I'm itself
so if we turn it around and think alt
a different world in this world yet again the curator somehow
places a digital record out for the world to see
are based on the the specimen in the collection
and here the only difference you can see is that I've given a version number
to this digital record in this particular case I'm not pretty enough
and a little dotted space which indicates that somehow
everyone sees this and everyone to work on this record
so the first thing is that somebody uses this Sunday
I am they they perhaps go ahead and build a model which may
have Sun I'm some erroneous basis because a bit
are the second user does detect that there was a problem in slacks the issue
they may not yet realize I'm exactly what the problem is but they know that
this doesn't come
within the specified country could be an automated tool
begets s to this point on
and so we have rushing to the record which says this is what the record says
but
compass an issue has been reported the third user may come in
and this is highly simplified it could well be that they weren't I workflows a
confirmation yes you're right
got coordinates from up but when moving towards
and improved few this data recorder and then
are when the taxonomists re identifies it
or when subsequent work takes place in the institution
this past me guess initial sequenced all it s
is somehow seen as connected accessible
and immediately improving the
at the digital representation of that analog world
of the collection on and what is suggested here
is that nothing is lost but somebody finds out something somebody records and
information
and somehow that needs to be visible unacceptable to everyone else and I i
believe that their presentations later today
on Linked Open Data which would certainly be one of the parts we could
look at
for achieving some of these things without necessarily going down
some super GenBank type of having to put
all information in some primary for him one place
and the beauty of this is that if we get this right
and if we can reference everything then
all of the secondary products the models the the monographs
I the the subsequent interpretations that the stator
that follow-on from from what's been recorded
can themselves be connected discoverable traceable I
and people can follow these parts back reports on this case the model is model
and hopefully this time for the right species I ends up
I connected so
what I wanted to think about in the remaining minutes have got
are some other things from the most foundational things we need to get a
in order to get to this point because the technologies
are not necessarily that hot there are many
challenges that I'm each institution each organization face itself
around how it overcomes or the degree to which it is ready
to go down the path all I participated fully
in a digital Olympic CA a free and open
digital world which will be arguing is particularly important
but there are certain things that we need
I say global level to start thinking about buying into and I'm going to talk
about some of those
I'm I'm not gonna talk about data standards because
you they're even more boring some the other things many of you have sat in
data standards development workshops it's important
on just and there's a lot more work still to do in that area
but it's something that our community as a whole
is actually pretty good working on over the years I'm
and its when I look at the the state data standardization even in
some feels that way totally born digital
I'm always amazed at that how much the
the taxonomic you create collections community actually has got right
but I think about some of these other things festival persistent storage
this is the about the most worrying thing I think I'll
that on of the 405 million records in the Jeep
network today are there is a significant proportion
I'm and its it's in at least the eight digits
thats arm no longer
are accessible on the web except energy index
data sets that were put online I'm made
publicly available through the GB of data sharing agreement
the intention at the time that she first started was that
each really just give federated search across many datasets
and certainly we still are not in a position where GV itself regards itself
as a
long-term archival repositories for anything
but realistically I'm if you want to find those records the place you can get
them from
is from out later index with all the loss of detail that that may imply
in comparison with the original datasets
from which they came and thats
I believe is the most fundamental challenge we fights
that I'll whether it's in a scratchpad
whether its digitizing Natural History Collection weather is taking a
photograph a sequence whatever else
if we have some digital artifacts some object
that represents a a set of facts or search shins
measurements observations about biodiversity we want to be able to refer
to it
in a reliable way into the future we want to be able to retrieve it we want
to be able to curate
and improve it in anyways we can
and GPS has struggle with this over the last
better part of a decade thinking about how to
establish a workable approach for persistent
I identifiers for data records and data sets
and other artifacts but in fact
once we can do a lot to put some mechanics in place
the real issue remains the question whether or not
the digital objects themselves have been placed
in some kinda institutional or national
or I'm spose collaborative
environments in which there is some real confidence that
they will still be retrievable in a complete form
in 10 years twenty years perhaps even many more decades
napped I and this is this is an issue which
perhaps our science peculiarly faces
on that we care about I the history
up very specific objects from decades and centuries past
but somehow if we're interested in the state so if we believe these data have
relevance
I we've got to get more serious about finding ways I
to get real guarantees for that long term
I'm archival I'm done that
then we can start referring reliably to the objects curating them
I'm and treating us in my slide earlier
I every annotation every correction somehow
as a propose new version for the same
at eight record a second thing which
an probably I'm overcomes anything that vince was talking about in terms have
done this is the culture about
reused when expressed in terms of data licenses
I'm I data licenses I think
are really important I v
the world is increasingly familiar with Creative Commons license is there also
Open Data Commons license is there a more specifically targeted
towards I'm digital assets and and and databases
I'm the thing that is important here is that unless
we have the ability for any arbitrary
object in each data recorder we found find on the web
to know how liberal on
the the publisher intended to be
make you that object available and if we
Saab I've lost my negatives in this but we have to understand that we have to
understand
I'm what we are entitled to do and what have obligations are
in making use of this object some things are effectively
are seeded the public domain others are
under licenses that indicate a very least attribution
is required and that's probably reflects a good scientific practice anyway
I'm there's a apiece in I'm this month by science
on that subject I'm
at the same time though we find that many digital objects
are published with weird text
restrictions as well as those that are in Cody's
in digital licenses I'm and when
uses see to aggregate information from dozens perhaps hundreds different
sources
realistically understanding whether a particular institution with think that
what they were about to do with the data was somehow
for a commercial purpose or I'm
for them to have to hunt down I'm some additional documentation
on the institutional website is an enormous challenge
a very least I'm and commercial non-commercial is
is probably the biggest pitfall we face because it is exceedingly difficult
to define arm and much that goes on in our world
is effectively I in the eyes and most people not-for-profit
probably from the EISA I standpoint the lawyer maybe could still consider
commercial so
that there's a lot of issues we've got to seoul but we really need to be in a
good position to understand how we can use
all of the objects safely and reliably are in line with the intentions
of the UCITS now the fed's
thing I just wanted to emphasize I'm gonna use a couple citizen science
examples here on not because I couldn't have done something this
up by going and pickets scratch as examples but because
I think this this suggests a few things a real importance
if we are going to do this rights if we are going to care for
to curate the world's knowledge a biodiversity
we need to be able to mobilize all the expertise there is
wherever it is and to be able to make good use of that expertise
according to just how expert it really is
there are now thousands love and matters in the UK
you could probably reliably identify ninety percent at the macro moths in
this country
from a photograph there are the others which is difficult but
ninety percent probably without too much difficulty
that expertise I is replaced also many other northern European countries
at the same time there is typically
a greater expertise that is represented I'm
in what is known by the taxonomists on specific group
many field ecologist another's and somehow what we really need
is a mechanism to allow everyone to stand around the data and make their
contributions to corrections
I in a way that on isn't a whole new task upon them but he somehow
fits in with their their day-to-day activities actually improves their own
work I'm but which
ties back to who made this assertion who's agreed with that assertion
and therefore whether or not in a moderately
automated preliminary where at least we can we can trust those kinds of things
and the the i Sport site on
I'm sure many of you are familiar with it there are certainly I'm quite a few
museum staff who involved in in some other groups on
on this site arm is has established something of a
self-tuning mechanism for understand he something at the confidence
around different contributors expertise
and they specifically in the area identification
I'm and I think it interesting for us to think about what would be
the the mechanisms that we would need to have been placed in
a decade I don't see this is coming quickly so that
a tax on a missed in arm in Russia
I a field naturalist in the UK I E
and ecologist in some other parts the world could actually
work together I'm on curating
the the data on a particular group spaces for example
how would that work and how would they be able together to build
and more more trusted pushing the truth
just to indicate that even though citizen science projects by a large
are I'm relating to
on to recording I'm records of
on to be current subspecies in many cases arm
%uh certainly looking at the I naturalist site which is a
another prime or primarily us-based SITE I some these tools are completely hidden
away
not work that's gone into some these things but they've already started to
implement tools for
at least I'm for at least
that those splits that I'm though a geographically
easily easily addressed within the taxonomy a species so he a great fan
tales in Australasia
I recently according to cristatus
I'm bowls is two species across the Tasman
and so they think they've gone in Dave at its mapping tool that said
the specs occurred here and you probably can't see
is that then the observers being prompted
to us whether or not they confirmed that this automated
change their records the new taxonomy is accepted
I'm these at these a small items and the kind of items which
I scratch pads in various ways also been working on solving many of these kinds
of problems and you could imagine
many of these things be played together in interesting ways but somehow
on the ability for somebody to record
the details of a taxonomic split whether or not we got accepted it
and for the implications on the data tend to be sifted by the community
and needs to occur now I've almost finished but I'm gonna show a rather
dull but I think
important diagram I'm Vince mentions
the the the report which
I'm I apologize to anybody with an interest in this is taken as a year
since
last July's more the last July's conference
that's not why this unit's like 2012 I'm in order to get to
producing the report on in some ways it's not a report it's more of a
a framework document for biodiversity informatics buddy it will indeed
be released in the next few weeks I'm
we we brought together around a hundred
I'm experts I in
various aspects about diversity science informatics and policy
on and spent three days looking at the kind of problems that we wished
biodiversity informatics for soap including the high end things like
modeling I'm the the coast here and
as well as the Aichi targets on and tried to
think about what would be required to take bits and pieces we have today in
projects like GV
Catholic reply secure like institutional activities
I'm putting together to something that might actually start to address some
those questions
on and in doing so I'm week we've really come up with four broad areas of focus
each
these have multiple building blocks in and I'm not going toward the detail
right now
but the things that i've been talking about just now the the licensing the
persistent storage the culture
of being able to work together in a collaborative fashion
I to curate I a
on open shared pool a biodiversity data and knowledge
on is is what's here refer to as infrastructure to miss the big box
because it involves everything else we need this
to be the culture in which we live for everything
the the second big area that
we need to focus on and the $20 million which excites me very greatly
I'm is on is a part of this
it's taking the the analog assets we have today
and turning them into digital forms it's not necessarily
worry about some perfect ontology have everything into which we napped things
its starting the path getting into the digital and accessible
I am with some interpretation of the field that can be interpreted
but in a form that makes it too easy for to laos
taxonomists or for those managing
the bio to a particular area to continue to create
and improve that stuff this is about the primary data in whatever form it is
gate available but some things that GPS is involved in today and I would argue
that this applies equally to
the catalog life inside Peter life and many others
is trying to provide a particular lens
into all the facts formerly analog information
a particular view for a particular purpose GPS case like I said it's trying
to organize the evidence
that we have for what species happy recorded at what time and what place
and I'm what is the base for evidence is there a
is this an I'm some material that could be revisited is there
DNA sequences their photo or is it just an observation which may have been
according to somebody who really knew
group on but you what's the scale its Caleb evidence
I'm and organizing this to provide
the best possible view the available evidence
this is still not if taking the the the climate
the climatologists analogy which I'm was was on that page
I'm from true purposes paper I'm
we're still not that states this is really about the cleaning the organizing
and the managing of the primary date uniform
that makes it more accessible and more readily available and if
if all that primary data is made available under
liberal licenses that allow other people to mash it up and be used in various
ways
that it doesn't just have to be GB if that thanks to the good way to provide a
lens like this that helps us what
questions anybody could join together now in whatever ways wish
but when we can have good views good organized use that link back
to the the out the source about evidence
then I believe we can start thinking about how we built and models and that's
really where the climatologists have gone they
spent ages cleaning the weather data on
but now that's a training dataset for all kinds of models and in the same way
we can move from knowing thats
in my example earlier I the about the
so often attends the plane mister gyro plate the claim of
I'm on the the cheapest I know he's been recorded in these Western European
countries but we got next to no data from some others including the UK
bizarrely for for that particular species on
that that we can use that to build the best possible model
of where we believe the species is actually found a continuously correct
and improve that do the same for all other species into a start to
understand based on phylogeny and now I'm data
what's likely to be the interactions between 0 species what are likely to be
the effective new invasives coming into
those environments I think we can start going down some %uh those parts
I'm but it's a tiered activity
just to last I'm clicks I'm when we get up to that level of models is the point
when I think we stop really serious the intersecting other domains environment
our climate sociology et cetera I'm and what comes out the top
have a system like this is really the
the setter on on says in response is that the likes it that
John sorry the likes to the CBD and it bears
and others looking for assessments and indicators that can help them
that to make judgments so I'll stop there I'm
thank you very much tunnel contact
so you really come as a huge amount of ground there am
so any questions
dawn yep
mom out
I'm I'm sorry too
rusting that there will be certain our cake
to do too I see it bears the on the
intergovernmental interim or intergovernmental still
it is gone science policy platform on biodiversity can see
system services which now has a I'm a developing secretariat in bonn
I'm as somehow a new stage of maturity
in these kinds of things and therefore I in relation to achieve it ice
I certainly see it as a a forum for us to make some serious changes
in the degree to which assessments
are based genuinely on revisits porttitor
arm and the ability therefore for others
to I'm assess and respond to those assessments
I'm I mean I'm certainly waiting to see how how how the work programme
pans out I'm from from my standpoint I'm look at it from
from bottom up time said I'm sitting in some other the
the second vote to tears inside my you've structured block
I'm I think it it's a great opportunity arm
but it is challenging because you get to level with so many
I so many launch group
of interests are trying to work together that actually getting concrete actions
is difficult are in some ways I still see geo bone
as a more plausible place for us to start demonstrated some other things
that I wish
it best for pickup but even that is its
is a real uphill battle to get to concrete activities there
so on quietly optimistic but an
waiting to see any more questions
now it's a very interesting
thirsting to talk so it was just thinking that'll be available from the
user base and a
its it's like more philosophical point so science
is often even though it may sound like a big question is
climate change affecting up later soft made up with this too is a very specific
questions such as
is a particular species of phytoplankton main shifting in the North Atlantic
or naw for example aw cod fisheries for covering in the North Sea
these are the these are very specific things because I specific pieces of
information
and all that talk so far been focused on this
we draw spectator joined updater in I'd like to see more
specific examples actually some other cities
approach is driven by those sorts of specific examples assume that would
help us on the city's the user basis
trying to understand sorta philosophically routers fits in the
the philosophy of science I think I think that's a fair point if I hadn't
already gone five minutes of my time I might have said if you want these
brother
I'm if you go to the chief website i think is under the news link
you're far or do you just do a google search for GV and debates
gbit/s on the the GB newsletter does tend to cherry-pick summits the
the published scientific examples we tend to very often to be much more
that kinda level in shows how they've connected back to the data
on the the new GV if annual reports
on also has a science supplement tucked in the back
I which summarizes for 2012
I am somewhere over over 200 papers in various can't match but categories
you know food food I'm farming and by if you're going to call it up
but those kinds of categories of some of the uses
at the applied uses on it is clearly important for us to understand
how the dates for being used but to go beyond that and start understanding
what it is about the existing mobilize a shiner
content that limits its usefulness in those areas
I'm not necessary so that we just immediately rush of intron
digitized exactly those things but at least it then
gives us a a more rational basis for looking at those needs across all those
areas
I'm making some decisions because it doesn't make sense
first just pour money into digitize the stuff from eight is that just because we
can
I'm we mail have a deep-seated desire just to see that happen because it be so
great
I'll but we've got we've got to be intelligent and and understand what the
real priorities are and what things which make a real difference
soon okay
while maybe one more question that we must break for coffee down
I think so I
I its
its it's one of my big frustrations that I'm I feel I've been talking about these
issues for
seven or eight years a very least probably better part of a decade
and we don't I don't feel that right now without much close to
however when you look acts arm the number
halts major government-funded activities around the world that are getting
serious
about multi-decade better
duration: and when you start seeing
I'm within in the context achieve it the number of
on national GB notes that really now are in a position
that's arm they could establish repositories
with with guarantees for at least this doesn't sound very exciting but at least
the five to ten year period
I believe thats with a little better coordination with the now network
we could establish a peer to peer model
whereby effectively institutions may say
we can guarantee 10 terabytes of storage for the next decade
we don't know what the situation a baby on that but in the meantime we focus on
using that as part of a big distributed
redundant store on that's where I'd like
sheepish to be going in XP is not necessarily cheaper secretary i think is
more appropriate for
national interest to step forward and do that but a model like bats
I'm could actually allow all kindsa contributors even
I'm individual researchers effectively to be hosting
replicated copies of all the last licks Journal the last eight er
I'm as as a redundancy tactic lost having over the top a bit
a liar that allowed us to refer assist me to these things
on bit torrent and things like that
should have taught us a lot more thing needed
and on that point I think that's a really good opportunity to break for
coffee so thanks very much don't that's fantastic