Tip:
Highlight text to annotate it
X
Good afternoon everybody.
My name is Lori Goetch. I'm the Dean of Libraries.
Most you know that, but we have a few guests in the room who may not.
Welcome you to
our Open Access week activities. I'm very pleased to have the honor of
introducing our speaker this afternoon and I will tell you a little bit about
him.
Our speakers Dr. Philip E. Borne.
Dr. Borne is a Professor of Pharmacology at the Skaggs School of Pharmacy and
Pharmaceutical Sciences at UC San Diego.
He is also the associate vice chancellor for Innovation and Industrial Alliance's,
associate director at the Protein Data Bank, and cofounder and founding editor
in chief
at the Open Access Journal Plus Computational Biology.
He is a past president of the International Society for Computational
Biology,
an elected Fellow of the American Association for the Advancement of
Science
the International Society for Computational Biology,
and the American Medical Informatics Association.
Dr. Borne is as an advocate for open access,
and his research in professional interests focus on biological and
educational outcomes derived from computational
and scholarly communications. His areas of interest include algorithms, text mining,
text mining I think this is supposed to say, machine learning,
meta languages, biological databases, and visualization
applied to a multiplicity of problems including
drug discovery, evolution and cell signaling.
As an open-access advocate Dr. Borne is committed to furthering the free
dissemination of science
through new models of publishing and better integration and subsequent dissemination
of data and results.
Please join me in welcoming Dr. Philip Bourne to Kansas State University
Well thank you
very much for having me here. It's a great pleasure to be here.
Its interesting, we were just having a discussion, in fact, to look at the gender
balance in the room,
it's not actually a balance, and which is completely opposite when I gave a talk
to the editors of the American Chemical Society where
the balance was the other way around and they treated me a
lot worse that I know you will.
I can see that you're a friendly, definitely a friendly, audience already.
So maybe just try and tell you a few things that have
sorta I've been thinking about for some time as many other people have.
I was asked to put more of an emphasis on
data, and open date and what the impact implications are.
I'm primarily just a faculty member whose
interested in pushing this, so
that's my perspective. My things not working anymore- it was before.
Well I can't seem to change to slides at all.
I could just talk about this one slide
for....
(Speaking to self) Sorry, let me just restart this.
(Speaking to self) That is very strange.
(Speaking to self) Well there are other ways to doing this, which don't seem...
(Speaking to self) Whoops, we don't want to look at my email, sorry.
(Speaking to self) You have enough of your own, you don't need any of mine.
(Speaking to audience) So let me just tell you a little about my perspective.
So as was already made clear, I work in the
Biosciences and it's really that perspective I bring to this.
I run a coupe of meta databases and we distribute
at lot of information. We estimate a quarter
of the National Library of Science,
yes the National Library of Science, to
investigators every month. Based on that,
I have a certain perspective on data.
Now what's the new role I have is
looking for innovation within an institution. I've started to think more
about
what the institution itself,
each their own institutions, should be doing around data.
(Inaudible mumbling)
So, in all of this,
and I'm a big advocate for open access, I always use
a caveat associated with that, and that is the notion that there must be a business
model.
You can have sustainability
without a business model and as you know, and it was a lot of questions about open
access,
and when a business model will be sustained
when things started off, but that's clearly...it is sustainable, no question
about that.
Organizations like Public Library of Science, which is involved, I have
already proven that.
I also acknowledge that every discipline is differen, and, in fact, just because
something works well, say in the biosciences, it doesn't necessarily mean
it's going to work
in other places. What is clear,
in my mind, at least, is that my general opinion of
open access is not a question of if we can have open access.
We already have it. It's a question of, you know, when is it going to be
and how is it going to be in the predominant form
of scholarship and those are still
open questions, but it's clear if you talk to
leading editors of the major close access publications,
I think they will each agree that, at some point, there will be models
that provide at least an open access option, if they don't already.
So we're getting there. The momentum is
increasing. I think where
we need to have much more effort and where things will really accelerate
is when we use the open access content
in ways that actually increase our understanding and increase our knowledge
base.
I don't think we're doing that at moment. mMst that what we do with open
access content
is to just have access to it
in a broader spectrum and to a large group of people. But we also have
with that content, the ability to use that content. I don't just necessarily
mean in text mining,
although that's part of it, but really knowledge discovery
from the corpus, and developing
ways of doing that still fairly undeveloped. But we are, as we
have politicians keep telling us, in a,
you know, knowledge-based economy now. As that tends to grow,
I think more more attention is going to come from extracting data
from this available corpus and thats
sort of what my theme is going to be. So let cast that
theme. I learnt at lunch that you're working on an open access policy. The
University California system
is also working on an open access policy and that just to say
to put that debate in its current perspective,
let me, just you know, indicate where we are.
What we're moving towards
in our case, is from an opt-in policy to an opt-out policy.
So right now, you choose as a faculty member or as a scholar
to opt into open access publishing.
What many places- Harvard, Stanford and so on- is
essentially move to an opt-out policy where you
you need to make your work open access,
you can opt out if you so choose.
Already I've heard, I've just sort of summarized
the arguments that familiar to anyone who's involved with open access
because they keep reoccurring, and that of course, is things like
the cost for you. Look at the negatives- the cost for
some disciplines. If you don't have grant money, you have to pay to publish. Where's
that money going to come from?
It's a legitimate concern. Therefore, each display needs to approach this
differently. The impact on society,
which depend, many of whom, depend a lot on the revenues
journals, what happens to that? So there are well-trodden arguments that
are important and we currently debating all of this.
The notion a journal quality: there is this is perception that if you publish
in an
access channel the quality is not there. Well that's,
you know to some extent, is being true because the main (journal),
in the biosciences with journals like Science and Nature,
there hasn't been an open access equivalent that that. That's now changing. E-Life is about to
to start publishing
and that intends to be on a par with Science and Nature,
as an example. And then is this notion of there being a
"big brother," and being told what to do, and so forth.
On the "For side," of course, I think most people would agree
that in a public university, that
you know whatever you do, in a sense, should be made public-
simple as that. You could you can cast it in many different ways.
What I'm been trying to push now to help this
movement move forward is the idea that this
institutional perspective and value in this that
we haven't even begun to address, yet. I'll give you an example or two of
that in the end, but really we haven't really identified if we make
our research data, as an example, and our research knowledge,
open, we can do things with it
as an institution, as well as in a more global perspective, that
hasn't been otherwise possible and that, I think, is an example of where
open access could really be made to take off.
(Inaudible muttering)
(Inaudible muttering)
So why is all this so important to me? Well,
here's an example that's just taken from our own work.
This is from this database that we maintain call the Protein Data Bank.
Now, on the top right hand corner is a graph, that shows the number
h1n1 cases in the United States, maintained by the CDC overtime.
You can see that it grew quite quickly then it flattened off.
Correspondingly, access to items of data, it doesn't really matter what they are,
in a public repository in response to this crisis,
and there's a correlation between how this data was accessed
and the crisis itself. So what it says
is that in a time of crisis, this data becomes very important
and is is accessed correspondingly. It's very important that day to be
available.
At the same time, the Public Library of Science created
a portal, plus currents, in influenza.
That was a place to publish information about the epidemic very
quickly
with minimal review so the information could be put out there-
not just papers, but also data and other things.
There was a huge spike in activity around this portal
during the epidemic. When of course that died off,
then usage of that drop back to very low levels.
Partly because that has to do with the reward system.
There was no reward for putting things into that kinda archive
when you can publish it, even though it might take a year before it appears.
But the crisis shows the people were prepared to do other things.
These are just some examples of
signs that we could have the potential, at least, to change.
It proves, too, that Open Science just accelerates the process.
There's no question there was an acceleration discovery by virtue
through things like the cross-currents portal
and access to the Open Data.
In the mode we're in with presidential debates and everything, what
happens almost invariably is you start off with a general point,
and then you go in to "Well I met someone so
when I was having coffee at so-and-so and they told me,
and of course, to drive home the point, so why should I be any different?"
I'm just gonna tell you a couple stories about people I met that sort of
drive home the whole value of open
access out and open data in particular.
This is someonecalled Josh Sommers. Just out of, interest how many people know
this story?
(Inaudible mumbling)
He was a freshman at Duke in engineering and he started getting these serious headaches.
He went we had a full workup and he discovered, he was told, that he had
Cordoma disease. He's lying there in recovery
after that all these tests using the wireless in the hospital.
He's googling this and he's looking at abstract in PubMed,
but he realizes this is a very serious
situation, but he can't actually get into the details of
the studies to find out more about his condition.
This is just a common situation.
What he does discover is the prognosis is not good.
In fact, he had, at best, seven years to live.
You know, had it been me I probably would have started partying heavily and
and get gone. What he did was to do something quite remarkable.
With his mother, first of all, he went into the lab-
fortuitously, the only NIH funded
Chardoma research in the US at the time was also at Duke University.
He stopped doing engineering, and he went volunteered in this person's lab.
That's a picture him in the lab. At the same time, he and his mother
formed the Chordoma foundation.
This is one of the earlier meetings that they had.
They got a lot of sponsorship and a lot of support for this.
What he noticed in the lab was actually quite appalling.
(Inaudible mumbling)
(Inaudible mumbling)
What he found was that
people even in one lab didn't communicate and share information
quite in the way they should and sharing between labs
was even more problematic, both in terms of data
and in terms of knowledge. It was all about getting those publications,
which takes a long time. If you only have seven years to live,
a year to publication before someone else can even start looking at
what your findings are is pretty devastating.
He became part of a movement which is called "Sage
Bionetworks,"
where the idea is to take this kind of model,
and turn into this kind of model using the so-called
"Power of the Commons." It's actually creating a collection of information
around disease modeling that's accessible to everybody
all the time and accessible more or less immediately.
It begins to change the way,
potentially the way, science works. It's almost like
a commons becomes the equivalent, in physics, of the Hadron Collider-
where you have all these scientists aggregating around, in that case
a large piece of hardware. Here, you have a lot of sciencetists aggregating around
what's actually a virtual space that's full of data and tools
and people's commentary about a treatment of a disease.
These Commons are starting to really take off.
That's important because what it show shows, year
after year, is you show the meaning of the Chordoma Foundation,
and then when he shows in the circles other people have died since
meeting was held last year. You know we're talking about
a disease that affects particularly young children.
You know, the point is we need to accelerate the process
for people like Josh. In the same vein, I'm now moving to different coffee shop and
I'm going to tell another story I heard
and this is the story of Meredith.
This is another example why, frankly, why I'm standing here.
I actually was editor in chief of this journal.
About once a week, someone would send me a paper
directly to the journal, directly to me by email, thinking they're goning get
an easy ride through the review process. Normally, I send it directly to the
general office.
This time, for reasons to become apparent,
I decided to look at it myself.
It was in pandemic modeling. I'm not an expert in pandemic modeling,
but I could see that was something special about this, both in the way tje
work being done
by a single author, and
by the the outcome. I sent it to Simon Levin who is at Princeton and a Kyoto
prize winner.
He looked at this and he said "Now this is really a special piece of
work. So I actually advised her to send it to Science
to get be reviewed. It was reviewed, and, in the end, they didn't publish it, but it
it will appear in somewhere like P&A very shortly. It's a
very good piece to work. What's so special about this is
I met with her because she lived in San Diego
and after I met with her, I invited her to come give a lecture UCSD, which she did.
I'm sitting there, listening to her fend these questions
from high-powered academics, and I'm just marveling at it.
Why am I marveling at it? Because she's 15 years old.
She's a senior at La Joya High School. This started
as a project,
a science fair project. She then went
and started looking at this seriously. She wrote to Wolfram and got a copy of
our package or documents-
a package for analysis.
She wrote to the original offers because the date wasn't online anywhere. She had
to get
data from the authors. She wrote to the supercomputer center in San Diego
when she got thousands of hours of computation time.
And she did this thing all on her own.
It's just a remarkable story. She's now at Stanford.
She is 16 now. She can actually drive.
When she first came, her father had to bring it to the lab because she had no way of
getting there.
So what does all this tell us? It tells us that
openness is, obviously she's an extreme end
the spectrum, I have a daughter the same age
and my daughter so sick of hearing the story.
(Inaudible mumbling)
It's an odd case, but clearly,
what I see in the students that I
see every year, which primarily graduate, and some undergraduates, now
is that some of them are phenomenal. They are
part of this Wikipedia economy, whatever you want to say-
your YouTube generation.
They know no bounds and I really believe that these kinds,
I say under exploited, I mean I have two undergraduates my lab right now
who doing work as good as some in my graduate students. One of them,
I say, is at the post-doc level.
I think they're doing it because they have all this access.
It's another example have how
we can exploit these assets. We need to tell the stories
and really get this to be more common.
Let's explore the notion of what we might expand on this
with an emphasis on data and where we are going with data.
One in the problems, and I'm referring mainly to the biosciences
is that we have these things coalesced into silos
so the majority of the knowledge is
effectively in literature, and the majority the data from which that
knowledge is derived,
is effectively in databases,
at least the digital parts of it.
But they're starting to coalesce in some ways.
You know they are not fully coalesced,
and the reasons are partly technical and partly cultural,
in my opinion. The supplemental information in papers is exploded over the
last year, mainly because all this extra data in there.
Data journals themselves emerging.
I'm involved with an effort, for example, with a faculty of 1000
where they're actually putting together data gems.
It's actually publishing data sets. Why would you publish data sets,
because that's not where you get the credit? You know, we are stuck in this system
where you only get credit for publishing papers.
So what do you do? You write a paper about a dataset, which of course, has no
value in and of itself.
It just gives you a metric in the system.
I have a paper about the PDB database, which I think is the third or fourth
most cited paper in all of
biology. What's ironic about is no one has ever read it.
There is absolutely no reason to read it because it's just
a reference to a database everybody uses. Tut the only way we get credit for doing
it
is to write a paper, a useless paper, about the database.
The system is completely wacky, and when I tell we have,
I'm sure we have the same system here, but maybe with a different name,
"Committee on Academic Promotions," we had use "CAP,"
we hear "I tried to talk to CAP about this, and it's
just,
breaking that system is very hard. But it has to start with the people being
evaluated,
telling the people who are evaluating them, how they should be evaluated.
That's pretty hard if you don't have tenure. I'll say anything. The
chances of me getting kicked out are pretty small. Well they were until
today, but who knows.
I'm digressing. So there is this issue of reward.
What we are seeing is a level coalescence
and the software becoming available. At the same time, the database is becoming more
like journals.
I call themselves "knowledge bases," they try and aggregate more information
about that data.
You can do science on the fly in these resources. You can run
applications and programs in these resources, and
you know there are people who dedicate their careers to making these resources
very valuable-
so called, in this field, said biocurators.
To me, they are some of the unsung heroes
of the kind of science I do, anyway. But all of this leads to say
that things really need to change. Where does this take us?
Really the whole notion, the whole in principle, not in reward, but in principle,
a paper is an artifact of a previous era.
It's not a logical end product of e-science.
Work is omitted. If you've ever tried o use
supplemental information in a paper, you know you can do it
if you look at it by hand. But what you really want is you want a
computer program to go and pull this information in from multiple different
papers,
and make some aggregated conclusions. Good luck.
The visualization that you can do on things in a paper is pretty limited.
There is nothing wrong with the paper, don't get me wrong, but
as you will see in a second, it's really just one way I'm looking at it.
You can't interact with a paper,
and that's really how we get to the next step knowledge is by interacting with
what we have already.
There are lots of other aspects like the use of rich media,
which you know we haven't used very effectively at this time.
That's really where we stand on the paper side. What about the data side?
We now have data sharing policies.
Many you know about these because you either subject to them,
or you're responsible for telling faculty and you haven't got the faintest idea what
they
need to do with these. We all write these grants now with data
sharing policies
which conform to these wonderful notions,
but I almost guarantee, myself included, that anyone who's writing one of these things
doesn't even conform to their own policy. I try to,
but for reasons, some technical, some cultural,
it's very hard to do. But don't worry, no one's really checking.
The NSF don't know how to deal with either, or the NIH for that
matter
It does beg the question, it does take us one step forward
to at least some level of awareness now and the importance of this data.
Tthe awareness is growing
because big data is taking off.
There was the Office of Science and Technology.
The president of the white house put $200 million dollars into Big Data
projects.
That's now trickle-down and we've seeing a variety of projects in the NSA
and NIH around the notion that Big Data.
This more attention being paid to this, I really like this and
put in there.
Even the private foundations, I've been working with the Gordon and Betty Moore
Foundation,
where they've been trying to figure out how they can promote
the idea of value of data to science
in ways that it's not currently being done.
We had a think tank meaning at their
offices in Palo Alto and the brought in this order conceptual cartoonist,
which I really like. While you're debating around the room and discussing this,
this is actually being captured in a single wall
all outcomes associated with that.
They then use this final summary
to, in this case, it was saying"Where are we going be with where we need to be with
data in
the year 2021?" as the
the basis for their funding in this area in the coming years.
I think what you're going to see is this idea of trying to
break this reward system, to reward those people
who manage and produce data and use data,
not from the publication point of view, but with money to continue to do and
development what they are doing.
I think you're going to see some interesting developments
from all agencies. You've got to add all these other things into
the mix
when you start worrying about data. It's not just big data. You've got all these
issues associated with reproducibility,
maintainability, usability, and reward we've already
touched upon. Reproducibility
is, undoubtedly, a myth. Call it a pillar of science- it's not that something is not
reproducible in my opinion,
it's just the effort that needs to go into that reproducibility
is just huge, even when you do it yourself.
I took a paper that we published about a year ago
with four authors in my lab. Four authors had
already left the lab. What we use was a workflow system
to try and recreate that work so the next generation of students coming in
could actually repeat those experiments and build new experiments
in a way that was quite efficient.
Well guess what? We couldn't actually, I could not do it.
I could do it myself, anyway, because I wasn't the one doing it, but even the one person left in
the lab
who knew, you know, the ins and outs of the work, all the bits and
pieces,
realized that without the help the other two authors who left the lab, we couldn't
reproduce it.
I'm being honest about this.
I would argue in many cases
that is the norm, not the exception.
Maintainability is something we might not even be coming to grips with.
DNA data alone, forget about everything else,
is doubling of the order every five months. It is way ahead of the Moore's
curve
law, which says you know you can continue to,
at the same cost, maintain increasing amount of data,
it's just not gonna happen. No one seems to be pressing up to the idea
do we really going to have to decide what data to throw away,
and how we go away throwing it away,
and how we choose what throwaway. Usability I'll come back to in a minute,
but
using data in different forms in different forms is not straightforward.
There is no tenure for just publishing banker.
Dreams to emerge and he's my dream,
all encapsulated in one slide.
I've shown it so many times I think it's going to be encapsulated on my tombstone.
I just want to use is an illustration
for the kind of things I'm trying to say and it kinda puts it together in one
slide.
That's this notion that here's a paper,
this is just one view of the knowledge, it's not the only view.
I can actually generate a series of other views, but let's start with the view that's
familiar.
It has a lot of attraction, it has a lot of value.
I actually understand. It has identified I can use.
It's not like an interface in an institutional repository, sorry,
that basically is unusable. I can use this. I can
go from one journal to another and it's pretty recognizable.
This the same thing. You shouldn''t throw something like this away away.
We probably, we definitely need, different forms.
Then I do what I do now. I click on a little thumbnail
in that online Journal article. Up pops a bigger version
and that's kind of where it stops right now. That's useful.
In this case, it doesn't really matter what this is, but just
for reference, this is a an enzyme and a structure of an enzyme
There's a lot of
amazing information and author-based knowledge in the way that this is
rendered and represented.
But it's a static entity.What I really want to do
is I want to stop playing with that and understanding that better.
so I want to, but I can't do now, is I want to click on this
and I want to retrieve the data that was used to make that figure,
which is actually the raw data plus a whole series of
meta-data that manipulates that raw data to give exactly that image.
In this particular field, there's no no reason why that can't happen
already.
The people doing this kind of work use a very small number programs
and it's kind of all online anyway. The data that supports this is all in
a database that I happen to represent.
Once I do that, I can now do things I couldn't do before. I can render this.
Then what are the things I want to do? When I want to stop mousing over this,
and I want to see
what the commentary is, I want the social
networking aspect of this brought to the floor,
I want to know what other people understand about this particular figure
when it's rendered in a particular way, so I might be looking at something over
here,
and then some aspect to this intrigues me. What about other people had to say about
it? You know right now, if it was in Plus, for example, you can
comment on a paper, but
people don't do that.
They don't do that because there is no reward for it. On the other hand,
what's emerged, which gives me some hope, is that
a very active bloggerisphere out there when there's something wrong with the paper,
but not when there is something right with it. But we are starting to get to that point.
There are now people who build reputation on their blogging, and that's
quite an amazing turn of events. They are revered at these conferences and things.
It's getting to the point where it's more than just papers.
The idea that there's effective
commentary on here is coming. Based on that commentary, I see something
interesting, and I click on it.
That takes me to a mash-up of information
to particular loop in this structure which which I'm interested in.
I get this mashup information about other papers, other
data sets that relate to that.
That drives me to another paper and so the loop continues.
What I've done is I've created a few things of change here.
I've got this seamless interaction between
the data and the paper, and the knowledge associated with that data,
which I didn't have before.
Technically, this is already quite doable
It's starting in a small way to happen.
So that's,
once people see this, I think that becomes, at least my way of thinking,
more a motivator. But it's just the sort of beginning.
Here's an example of something that's actually not looking at one paper.
This is something actually was done, and it's just trolling through the literature.
What it is is effectively pulling together,
each of these nodes in this network, is it could be any
any pieces information, let's say it's a piece of biological
information, it's a gene. These genes are in this and if
two genes exist, are mentioned, in the same paper,
you draw a line between. Say you've got this content network and he got this
other network
over here, in animation mode, this would have all sprung to life.
You getting the whole picture in one shot.
The thicker the lines,
the more into relationship. The more times that these two genes occur in the
same paper.
Well guess what? That alone tells you something important.
This network has topology, and and when you overlay on that,
a very simple thing, namely the type of literature
in which this occurs, you see you've got these two distinct networks which are
connected by one gene.
It turns out that the immunology community,
one community could not really understands what the other one knows
about that particular gene.
Just by doing these things in a fully automated way,
you can start to learn things despite trolling the literature. This is a trivial
example,
but it's kind of an expression where we're probably going.
That's a notion
of the the field discovery informatics,
which i think is just beginning. I went to one of the most exciting workshops
earlier
this year on discovery informatics where a bunch of computer scientists, a
bunch
of main scientists, a bunch of social scientist, some librarians
all sitting in a room figuring out how we're going to deal with this future.
What's clear is Google is incredibly useful,
but it is broad and shallow. When you want to get into the nuances of data and you want to
look into a subject deeply,
you might start with a Google search, but that's not where you end up.
You probably don't get to where you want to be.
Sciences is cross-disciplinary.
You really want to get to a point where the discoveries
you want to make certainly surpass
an individual's tools and you need these intelligent
tools to mine and troll through all this information.
Then you need to increase these connections between
knowledge and data in the way
I just described to you. You need to combine what's clear. You need to
combine a whole group of
different types of people to address this kind of problem.
That's sort of what discovery informatics is all about.
Here is a scenario which just expresses this to
some degree. How many people who use Evernote?
A few. Okay, Evernote is a very trivial
and simple tool for keeping note
for everything from shopping list to your major developments in the lab.
The reason it's, in my mind successful
over other programs which are much more sophisticated
lab management programs, is that it's just
because it is very simple to use for a variety tasks.
When you get old and brain dead like me, it's valuable to able to recall information.
It's all in the Cloud so I can be standing here on my smartphone and I
don't forget to tell you something I can make it look at my notes
I forget to tell you said, I could look down on my smart phone. Beware, you'll be here for hours.
The point is that there is a recording mechanism.
If everyone's using doesn't matter what to list, but
the idea is that there's certain you know this commonalities
in what we do in the lab
at any given time. That can be discovered.
If two people in the lab, it gets to this
issue trying to deal with the Josh situation,
bring all this close together. It's the idea that
you know you can discover some commonalities in what's going on in the
lab by trolling through what people wrote that day.
What's even more important as you can take that
and you can go out and you can search the web. The dream, this is
just a dream, this is kind of a scenario that came out this workshop,
is the idea that those particular
threads and themes that are running through the lab, you can go in search the
the Semantic Web, which we'll get to in a second,
for that kind of information. You can pull back
and based on you know all sorts of criteria
relating to authority and other things, you could potentially rank
information. When you get up next morning you have in your coffee
you can see what the rest of the world has done that relates
to that your particular interest overnight.
This is not that far, at least at some level, from being
reality.
It brings into sort of play the whole notion
that's been around for a long time hasn't gone anywhere,
but technically this kind of thing is already doable.
It's doable in the context of the so-called Semantic Web.
The Semantic Web is being built, it doesn't matter how it's built,
you know into RDF and these other things, it doesn't matter. Over here, down in the
bottom in pink
a whole series of resource from the biological resources.
There are connections between data elements within these resources
which, you know, NCBI and the National Library of Medicine have been
example of putting these things together for years, so is nothing
particularly new about that.
What's interesting is the semantics, if done right,
carry you into completely different other domains,
which have some value, perhaps. Here's
the trivial example I show. Over here is, this is
media, and over here is the BBC database.
If this semantic connection was all in place,
I should be able to find, I maintain this protein database.
I'm always looking for things to help educate students.
I can immediately find what BBC has ever shine on a particular protein structure
just by virtue of these kinds of connections. Its just,
in principle, the technology
exists to do this today. You know, we getting to
had point. But let's
get real okay, so this now, let's be
real about what we can actually do. And this is, I'm probably gonna offend people
and I apologize,
and these are just my own perspectives from trying to do these things.
The realities of today
are a lot different and this idea of just,
you know, of a fully functional semantic web- knowledge discovery,
I get up in the morning and cup coffee not my papers already written,
I mean, this is just, you know, we're a long way from that.
I'd say these are just kind of three realities
my own point of view that I just sort of popped out to think about.
From my point of view, data repositories, institutional repositories,
are just not working. They are not working for me,
even though I'm getting to the point now where I'm
going to be required to use them
if UC passes this open access policy where I have to opt out rather than
opting in. I have to deposit
a paper into my institution repository.
I mean, I tried that the other day and I have to tell you, the whole faculty
would be in an absolute uproar.
I mean I spent 15 minutes trying to do this,
and, you know, I got a message that someone would get back to me in two days.
When I'm publishing a hot piece of science, I'm not gonna to wait for
an institutional repository for two days to tell me,
you know, what's wrong with it. So these kinds of things.
I'll mention the High Noon effect in a minute. I think, you know, there is
no question that what
NCBI's has been able to do over the years
is an example for all fields of science.
I'd say that the reason for my point of view, why there is trouble with these
repositories,
is...the idea building and expecting people to use it
has not worked very well. I mean, I think the usage of our respositories
is just very small.
The idea that they are institutionalized, course, is a problem
in itself. They should be global.
You can certainly have different pieces information would different
access, but overall, the concept should be global.
Then, NCBI kind of works because, first, it is funded well.
It has strong leadership; it has monopoly on some in these things;
and it thought through the IT aspects a long time ago.
You know, I think these models in there.So it was saying, it takes
resource,
strong leadership, it needs institutional support behind the repository-
I'm really behind it to really make a difference.
What do I mean? What else is wrong?
I'd say there is this "High Noon Effect." Some of you,
well, quite a few of you in this room, remember the days of the VCR,
where whenever you'd walk into someone's house,
nine times out of ten, you look under the TV set there was a VCR and it was
flashing twelve. That's what I call the "High Noon Effect," because no one
ever went to the
the trouble of trying to program it. You could set the clock, but
you know trying to record a program, you know, it was too difficult.
The barrier to entry was just too high. DVR's
have just changed that. You pull up the program on the screen, you click a
a button,
it's done. We need to move
from the VCR era to the DVR era
in respect to institutional repositories and
other kinds of tools. Without that,
they just aren't going to get used much. It's just not going to happen.
Publishers, they've sort of got one end of the
situation sorted out. I mention the idea that paper is fairly generic.
Well, not when you go to publish a paper. When you go to submit a
paper,
you know, that there's not that many, but there is a significant number, of different
Journal management systems that you have to fight through.
Data repositories need to create something is more uniform.
What we are seeing, of course, is this merger and that data in journals are, sort of,
coming together. Now there are data journals.
But they, to me, are not really going to get us anywhere.
They are a bit closer to where we want to be, and the image
total integration. Journal publishers
don't know how to handle this. Most publishers do not know how to handle significant data sets.
As a result of that, they sort of pawn the problem off to some
third party. Some more publishers, including Plus, for that
matter are using resources like Dryad
where the data set has to be deposited. That's a great step forward,
and it gets a DOI (Digital Object Identifier) so, at some level, it is retrievable.
Then it goes off into this thing called Dryad, or one of these other
repositories.
It's only the beginning. there's not consistent metadata. There is not
consistent
information about that dataset. We are starting to see
data journal themselves where this is this is being addressed.
One of the beauties of the whole
open access movement is it brings forth
new ways of thinking about problems. New kinds of solutions emerge.
It's sort of a slight sidetrack, but the idea is
how can you move to a new system
by just being a little innovative.
Here's an example that not directly related to data,
but it could be could be applied data. That is the idea of
what Plus started with respect to thsee topic pages.
If you look at what you know, we talked about Meredith, and
the notion of getting knowledge from sources like Wikipedia.
The problem is, most scientists don't stick stuff there because there's no
reward for it.
You don't get tenure for running a Wikipedia page.
How do you solve that problem? Well, you essentially say,
alright, let's
...
solve that problem by
giving the credit. Let's write this page
as a mini-review. When we do that,
we actually published in a journal, it gets PubMed ID,
it gets the credit and all the rest of it. That becomes the copy record.
At the same time, we actually take version in fact we put in Wikipedia
and that becomes the living version.
The authors get reward and we've seen it in Wikipedia.
There is no restriction,
well there is some restrictions, but not very much restriction, in doing that from a copyright point
of view in doing that.
It sort of solves a kind of problem.
It builds more knowledge into Wikipedia and people get credit for it.
I'm just going to close now with a sort of final message.
about the notion of trying, even know global open access is a global
issue,
you know what can we do locally? We've been thinking about this problem.
I want to give you an example of what we're at least thinking of doing
in my place.
First of all, we need to try making institutional repository
that's useful- with common standards. It's
got be vetted by the community, the community have got to be part of the
development process.
(It needs to be) fully out and searchable and it needs to reward
people putting in there.
You need to be out to leverage the asset. That is,
to me, the key. So what does all that mean?
Let me give you a specific scenario in
labs. I fessed up I cannot reproduce my own work.
One of the reasons
is that we talk about Big Data projects.
If you take just the average researcher here,
and you take them from all around the institution, that's going to be much
bigger data
than taking a few of the faculty that have big data.
If you want to have an impact, let's not worry so much.
The big data producers somehow take care of themselves.
Let's worry about the little data producers- the people who have
DVD's and drives sitting on their shelves,
thumb drives sitting around in people's drawers.
Let's try and, you know, try and do something with those people
to create a better situation, to make their work more reproducible.
That way, we can actually comply with the data plans that we
talking about.
So it's really dealing with a long tile folks. How do we deal with a long tile
folks?
Well, a very simple solution is just the idea of having an institutional Dropbox.
How many people here use Dropbox?
A lot of people, right? It's just a great, simple thing.
Problem is, I can't move particularly large files into it.
But on an institutional network, I potentially could because I have more
bandwidth
than I have when I'm passing over the broader Internet.
When I drag something into Dropbox,
there is no particular access control,
although I can define that. You want that kind of level access control
perhaps a little different- you might want it for individuals, you might want it for a lab,
you might want it for institutions.
You might want to collaborate, you might want it for globally.
It's the same kind of thing. to be
As you drag, you want to actually capture a small piece of
meta-data.
This is not too annoying, but actually
means that I might be out to do something with that data set
two years from now because I captured a piece of information about it
I would have otherwise had. All of that
is just trivial stuff. That's just creating a Dropbox.
But what can you do with that Dropbox?
You can
employ it and develop it within the context of the university culture.
It becomes a rich campus resource.
How can you use it? You can put it into the campus culture by...
you solve the data management problem. Everyone who submits a grant,
the people who control the institutional Dropbox are the ones that actually effect
write your data management plan.
What they get in return is when you submit that grant and it's successful,
we hope, the line-item you put in
for that data management plan gets immediately taken off to support to Dropbox and
the other things.
You're sustaining it through
the funding. The real
value of it comes in
in how you begin to give develop around that corpus
in ways you can't even imagine. A simple way would be
I review institutional grants on campus, and
and I'd sit there i read this grant
from someone else on campus for some money.
And I thought "Gaw, I wish I knew that person who was doing it. They are doing what I'm doing."
...
So this kind of automated discovery of people,
if my research data was
in an institutional Dropbox and I allowed it
to be scoured for these sorts of purposes and I would discover taht
I'm working on certain protein domain,
but guess what? Someone in the medical school who I didn't even know it was also been
looking at this.
I know that by virtue of their data. This occurrence occurred in a data they've
been putting in the Dropbox
several times. Therefore, it's probably important to them.
Let's make a contact between two two people. I've done that automatically.
What does that give you? Maybe nothing. It maybe an annoyance.
It has to be done very carefully, but if it's done well
then potentially what I have is a new collaboration, which will then generate
additional grant money which the Deans and everyone else likes, right?
That's just one kind.
And then there is this institutional data associated with what happens to alumni,
you know what happens students when they become alumni and a million things that
can also be fit into this kind of process.
That's why I'm sort of actually working towards
my own campus. This is just a trivial idea,
but something along these lines. How they gonna mind it remains to be seen.
These things already getting out there.
Just to finish off,
I want to be able to answer questions. Right now, I can't do that,
particularly. I can retrieve data. I can't go to resources and answer
questions quite in the way I want. This goes beyond the institutional
repository. This is really more generally.
I've now gone from local to global.
I want to know all there is to know about a biological data.
I can't even find it now. I can find instances of it, but I can't find
all the instances of it. I want to do things
in a way that are simpler, more productive, more reproducible.
Here's a couple ideas quickly about how to get there. I need to have a registry. We
don't actually have a registry for data.
What we have is a Google index.
If there was a data registry, supposing I generate a data set with a whole set of
data items in it, let's just continue to biological theme
with genes in it. Those
are identifiable, so I should be able to register the fact that I have this
dataset
and register the fact that this gene exists in this dataset in a central registry.
Essentially all that it is the name of that gene, perhaps linked to
where it resides. But people can go in use it.
I'm required to do that by the federal funding I get. I have to
to make it known to the registry.
People who come along, they see "Oh, here's the gene referenced in these
several different resources." I go and I try and one's good, one's bad.
I can actually
add a comment on that, I can star it. I can do other things.
I can crowd source the value of that data set in the registry
by the virtue the facts that it is in a registry. That's
essentially the idea. Another
aspect of it is the whole idea of
how I operate on that data. What's really crazy to me as we struggle
to use programs in the lab that people have developed, and
then immediately we get on our smart phones and we download an app and we start
running something which we understand in 30 seconds.
We need more that notion of the "app model"
in science and scientific software.
I've gone on way too long, so I won't elaborate on what I mean by that.
I think it's
an intuitive direction in which to take. I think in many cases,
it could be done. The nice thing about an app is there is the App Store.
The App Store gives credit
for things. You know when you look in there, if it's been downloaded
a bunch of times it's got a five star rating. It probably is
pretty good. When don't have anything like that in science.
You go to a paper, you read it. "Oh this paper was published in the Journal of Molecular Biology."
It must be a good piece of software.
You get it, and it's not. The review is never looked at.
It's all meaningless.
In summary,
we have at hand a way to increase the rate of discovery.
We need to put more value on the date and individuals that produce it
and institutions that maintain it. We're all stakeholders in this
endeavor. I'm just gonna give you one example of how you can get involved.
There is an organization called FORCE 11,
which has come out of a group of people who feel that they can
really make a difference in scholarship
by getting together and working together in this notion in the Commons
effectively
to improve the way scholarship is
maintained and disseminated. I encourage you to take a look, sign up
for the mailing list, and get involved.
There is actually manifesto that came out of
the FORCE 11 work. You can look at. There is also so the fourth paradigm, which
really focuses on the importance of data.
Both of these are obviously open access.
I encourage you to take a look at those. I apologize for going on a bit
long,
and I apologize for the fact that I can't show you was wonderful animations.
Thank you very much.
I'm willing to be yelled down, I'm willing to be abused, I'm willing for
whatever. It's whenever you want to do.
Goetch: Questions for Dr. Bourne? Bourne: Disagreements, anything.
Audience: I'm Marty Courtois. I work with the
repository here at K-State. I had a question.
I think I'm gonna need to look at my computer again.
Do you think repositories would be more useful if we made them
easy-to-use. For example, in libraries,
we're very good at identifying all
the publications from faculty on campus. We have the tools to
identify that.
If we devise a system where faculty
don't have to take the time like you mentioned to sit-down
and deposit in article in the repository, but
if that would happen automatically, and let you know that you that "Professor A"
publishes
this paper and it goes into the repository
automatically. Would that help to make repositories
more useful from your perspective, or are we still...
Bourne:" I think these small steps that make a huge difference.
Just taking the idea of having a
student putting a thesis in a repository.
I'm sure you do too- we have
specific format requirements for what
a thesis should actually look like.
Probably right now, that's a PDA
embedded in a PDF, which might be difficult to get at, but technically, it
doesn't have to be that way.
There could be a way, it could be a a different
cover sheet or whatever else is attached to the thing,
that really automatically provides all the meta data that is needed
for that
thesis as it's dragged and dropped. When the person drags into the
repository, finished,
they are about to go out and party, they've finished their thesis, they drag it into the
repository.
The repository says "this is a thesis, here is all the information," you say "yes that's
correct,
I just what the reviewers to look at it right now."
and click. The reviewer is already there, click. They get an email, they can drag it
back, but
no one else can look at it. Then, after the partying is over and the person has
passed their thesis,
it becomes a part of the public record.
There is nothing technical to stop any of that right now,
as far as I can tell. It's just
the resources and the will to make it happen.
Resources are an issue.
It's really "where is the value coming?"
I think what's going to happen, and perhaps I didn't state this as well as I'd wanted,
in this knowledge economy, it is clear that more and more
there going to be a constant reminder
that mining information brings forth new things that have
some kind of value- it could be economic, it could be other.
I was in a meeting where
they brought forward two editors to the major British publications,
newspapers- Murdoch's "The Times" and "The Guardian."
At the time, they were talking about was what Murdoch did was
they decided to put up a pay wall. They lost 95 percent of
all of "The Times" subscribers to the online version in a month.
Obviously it was a bad pricing model,
but you know, there were too many other places to get news.
On the other hand the "Guardians" model was "okay we're gonna make this content
free."
We're going to create an API around it so people can write applications that
use that content. So one group wrote an application
where they can actually predict voting patterns just by
looking at large amounts of the corpus, and information it was in the corpus,
the news corpus, as to what voting patterns were in particular counties in
Great Britain.
That has value and so you know,
they got a piece of that back action. That's just one
trite example of how that can that Cooper's could be used.
It's a different business, it's an interesting model.
It would be interesting to see if these things work out, but I think
the nice thing is that, you know, that with oppeness
comes innovation. That's the message- there are lots of different
kinds
messages like that. Drug companies are realizing this now that
they can gain much more
by making at least some of what they do open.
The financial industry is another one. World Bank is
essentially making huge amounts of information open access.
...
Goetch: "Other questions?"
Audience: "Hi Dr. Bourne, I'm Shar Simser I'm the coordinator for electronic publishing
here at Kansas State Libraries. We run an open-access press.
My question doesn't have to necessarily
deal with the data, but the UK
government is now funding,
what 10 million pounds
to publishers to support
immediate and open-access to articles published by
their researchers. What do you think about the report that this was
based on and
the adoption by the UK? Do you see that happening here?
Bourne: "No. Not yet."
I must say, I hadn't read the report. I mean, I've looked at it, and
I know
people's opinions and discussions about it.
There is somewhat of a different culture, I think.
But not withstanding, it's a very interesting step.
I have enormous
thanks to likes, in this country, to the likes of the NIH for taking the
stance they have with respect to open access.
It's a huge step and
whether or not you take a very large step or you take a smaller step,
a step is good.
It will be interesting to see how all of that pants out.
Audience: "Dr. Bourne, you mentioned several times
that in order for data to be more accessible, faculty have to
be rewarded and the 10-year promotion
process. Do see that happening
anywhere, yet? Or is it happening on your campus?
Bourne: We just had a report where we looked
at a group charge looking
at the way promotion was done.
They were, from my point of view, it was a little disappointing
because I don't think innovation was rewarded as much as I would have liked
to have seen.
What was clear
is the idea that we have to get away from this notion
are rewarding people in a singular sense. That's ultimately what we have to
do.
Somehow we have to reward them
for the fact that science itself is becoming much more collaborative,
and the whole idea of
"what has value" as part of that collaboration.
That needs to be
assessed. Again, I'm a total radical, but to me, the
university structure itself
is totally broken. I mean, that the fact that we have,
people, and money, effectively, siloed
into departments that don't necessarily make any sense anymore
in the way that people do research. We solve that problem
in the UC system by organized research units, which bring
people together around particular areas of
interest, but they still get pointed to appointments. Students get graduated
through those departments.
The money flows through those departments and not necessarily
all of it going into what is producing the results.
It's kind of a step, but not perhaps as far as we want to go.
There's a lot of issues at stake here.
Changing systems like that is very hard.
I wrote an editorial a while back that says
"How to get promoted as a computational biologist in academia."
It got huge downloads on the Plus side.
There are a lot of people who are interested...
essentially. the message was, as I said
in the talk, you have to educate the committee that's reviewing it.
I'll be quite frank, I sit on tons of these committees,
and you have six people making a decision
will really, probably two people who know that person's work, one perhaps pretty well.
The rest,
how it goes, where that person is published now if you
don't know
the work well enough. Now you might actually use Google Scholar
or ISI Science to actually see
you know what kind of impact it's had. For a lot of reviewers,
that's as far as it got. If I produce software,
or I produce these datasets, it doesn't count for anything.
Yet, you know ask yourself "What's more valuable? A paper that has only been cited by
the people who wrote it, or a dataset
being downloaded hundred of times and it's generated a whole lot new papers
and science?"
It's kind of a no-brainer to me. We had this great idea
when we were discussing this at Gordon and Betty Moore (foundation).
They way to get attention in institutions is with money.
The idea was we would actually create chairs within these institutions.
I guarantee there are probably people that you know
that probably don't get the credit in this institution that they should
because they do things that are not quite traditional,
yet they are absolutely an integral and an important part of the fabric
of research and education. The idea was to identify some of those people
and give them chairs, like an actual named chair
and to elevate name into a position
where they would be recognized by their peers.
I think this is particularly true
in data in the digital realm because
these people underlined what absolutely critical. Some of them are doing absolutely
amazing work, but they're also maintaining resources for others and that
sort of thing.
They don't get quite the credit the deserve. Another
long-winded diatribe, sorry. Goetch: Any other questions?
We still have a little time. Before we end, I just wanted to remind you we have
another program tomorrow
called "Open access in your publications: what's copyright got to do with it?"
It's in Hale room 407 beginning at 1:30.
This is a webinar with copyright authority
Kenneth Cruz who will be talking about how you can facilitate
access to your materials by learning to be proactive.
Please join me again in thanking Dr. Bourne.
thank you for having me it's wonderful