Tip:
Highlight text to annotate it
X
>> Tony Kerlavage: Good morning everyone, and thank you for joining us today
for the NCI CBIIT Speaker Series.
My name is Tony Kerlavage; I'm the Branch Chief
of Informatics Programs at NCI CBIIT.
This speaker series is a knowledge-sharing forum featuring internal
and external speakers on topics of interest
for the biomedical informatics and research communities.
I'll remind you this presentation will be available on the wiki
for the speaker series as a screen cast with a voice over,
and also posted on the NCI YouTube Channel.
If you just Google the NCI Speaker Series, you'll find that Wiki page.
Also information about future speakers is available via Twitter and on our blog,
so check out those sites for the latest information.
Find the blog, Google NCI blog, and it will be the first search result,
and our Twitter handle is @nci_ncip.
Today, we're happy to welcome Dr. Cameron Neylon,
who is the advocacy director of PLOS.
Dr. Neylon previously was Senior Scientist for Bimolecular Sciences,
Science and Technology Facilities Counsel in the U.K.,
and the title of his presentation is "Network Ready Research,
the Role of Open Source and Open Thinking," and with that,
I'll turn over the event to Dr. Neylon.
>> Dr. Cameron Neylon: Thank you very much for the kind invitation.
Just check that you can hear me okay and the network seems
to be telling me that I'm on audio, so that's good.
>> Tony Kerlavage: Yes, we can hear you just fine.
>> Dr. Cameron Neylon: And the presentation is up and clear?
>> Tony Kerlavage: Yes >> Dr. Cameron Neylon: I always like to check that before I start, sort of heading off into the distance, as it were.
But, thank you very much for the opportunity to speak.
What I wanted to do was talk about a framing of how I think about why things
like open access and open data and open source,
and are particularly important in a world where we have the web,
and particularly important in a world where the kind of technology
that we're using today can connect us across, in this case, the Atlantic Ocean,
but obviously globally, and with a wide range of different people.
It's also nice to be here because you may recall the original scheduling
for this talk was actually somewhat disturbed
by the government shutdown last year,
so it's good to be back, and to not be shut down.
I work for an open access publisher.
I just need to get my slides to advance.
I work for an open access publisher, and you can see on the front of this slide,
on the title slide, that there's a license given.
[Speaker had brief problem with slides here.]
Right, we're back. Hopefully, that's working. So, the point I wanted to make is that
I'm not just going to talk about open access and open data in the abstract.
I hopefully will make this a demonstration of why I believe personally
that making the work that I do available makes sense,
but out of a public good agenda, but also through, for my own self-interest.
If you help do the work transmitting of my ideas and by being open,
I can do that more effectively, and that's going to be a good thing.
So, I put the word Networks in the title of the talk,
and there are lots of different kinds of networks.
On this slide, I'm showing two different networks that both happen to come from open access papers.
But, the point being that there are things that we know about networks
that transcend the specific fact that the one on the left is to do
with gene regulation in E. coli, and the one on the right is actually to do
with social interactions in ants.
That, as a model of systems networks
that can be particularly powerful, and that we can tell things about systems
that are networked, that are more than just what we can tell about that specific system.
It's more or less reflective when one talks about networks on the internet
to put up a diagram like this one, and this is ARPANET Internet in early 1977,
so one of the early predecessors of the Internet,
when you could set map out the series of computers that were on it.
The Internet and the web that builds on it, have changed commerce
and communication, and politics, and economics through the provision
of this new network infrastructure.
The system that's available to us to use.
But, this isn't a story that's specific to the internet.
There are earlier networks. These are telegraph lines in the mid 19th century,
and the telegraph also changed economics and politics, and systems of business
and systems of communication.
The fact that you could assume that from many places
in the world you could contact people in other parts of the world
within 24 hours was a radically new capability,
and it changed the way that things could be done.
Of course, we can go back further than this.
This isn't actually a map of rivers in 18th century France.
It's a map of stagecoach routes, and again, the fact that people and mail
and objects and commercial objects could be passed along these new network
infrastructures changed the way that European and U.S. politics
and connections worked, because things could move freely and reliably in a way
that hadn't really been the case prior to that.
So, anything that I say about networks,
anything that we draw out in the abstract sense of networks,
we also have to consider this isn't necessarily a new thing.
It's something that, we've been through several of these revolutions, created
by the communication networks over the past several hundred years.
Indeed, you could argue going all the way back to the invention of writing,
and that that's a similar kind of network technology.
Here's a more recent kind of network.
That is a network that happens, if you mine the metadata from my Gmail account;
an interesting service run out of the media lab at MIT.
What I can say with some confidence, which is a good thing,
is that the large circle in the top left-hand corner is, in fact, my wife,
though you may argue, of course that the fact we're communicating back
and forth by e-mail rather than face to face might be an issue,
but it's interesting that this network connects a whole series
of different groups of people that I connect with in my life.
There's a group of PLOS people.
There's a group of online research communication people, open data people,
and then some interesting people who connect all of these things up,
and then some people on the outside who aren't connected to my work life,
but are connected in other ways.
Here's another network on another communications system.
This is just people who were interacting with me or the subjects I was talking
about on Twitter on a particular day,
and this can tell you about what I was saying at a particularly time,
who I was interacting with, what they were saying,
what the subjects of interest were at that particular time.
One of the interesting things about the graphs that I've drawn so far is
that they are mostly of a tree form;
they're mostly forks in network architecture.
There are other networks obviously where we have what I'm going to come back to
and talk about emerging and re-converging on a network.
This is the commit graph for the IPython Project,
which some of you may well use, and some of you may well contribute to.
But, it's an online open source community where there's been a process
of people forking a code base over time that adds, and this a particular period,
where they're moving towards a major relief.
Those various forks are being merged and pulled back together,
and to create the final product, which is the release,
which is somewhere off on the right-hand side of the screen.
So, we need to think about networks that branch,
as well as networks that re-converge.
So, there are lots of different networks and there's lots of research work
that tells us about the mathematics, the sociology,
the systems which underlie these networks, and the question for me,
and we're starting to focus now down onto the questions
of research communication and effective research communication.
Is what can we learn from all of these different networks that we can apply
in the context of effective research communication and research management?
And obviously, what can we do to exploit the capability that the web brings us?
Which is, to my mind, a qualitatively different network infrastructure because of its scale.
It is the, in a sense, the largest, most densely connected,
most frictionless network infrastructure,
communications infrastructure that we've ever had,
and my argument will be that this changes our capacity to do things,
and we need to figure out where and how those opportunities lie.
So, here's another network, and I'm not sure how well this is going to come
over the wire, because it's a video,
but the point is in the in state at some level anyway.
What I'm doing here is simulating a very simple percolation network.
In fact, I borrowed the code, which was open source off the web.
I did a search for percolation network in Python,
and adapted it to my purpose, and what I'm doing is just simulating,
I'm creating a square grade, and then within each position on that grid,
I put a random number between 0 and 1, and I raise the threshold and say,
if the number in that [inaudible] area is smaller than the threshold
than it's connected to the four positions around it.
So, this is a very simple model
of a player two dimensional square lattis percolation network,
and I've randomized each time,
and I'm rerunning this with different resolutions, different numbers of nodes,
and the point I want to make is twofold.
This is a random network, so I can't predict the details of what's going
to happen at any particular point on this network with any great detail,
but if you look at the graphs that I hope are appearing
over on the right-hand side of the screen, which represent, on the bottom,
the size of the largest cluster that's in that network, and at the top,
the number of clusters, then what you can see is actually the overall behavior
of the system is very predictable, and so we have a system which is random,
which we can't predict in the sense where we think about research.
We know we can't predict outcomes, but with the right way of thinking about an abstracting system,
we can imagine what understanding what the overall behavior looks like.
So, so far, that's just what science is, you know, to be honest.
One thing I want to draw your attention to is in that bottom graph
on the right-hand side, which is the size of the largest cluster,
so you might think of this as analogous to the ability of a piece of data
or a piece of information, or an idea to flow through this percolation network.
And, as we increase the potential where any given position is connected
to the positions around it, actually, not a lot happens for quite awhile.
[pause] Not a lot happens for quite awhile,
until we raise the probability of some sort of threshold level,
and then suddenly the size of that largest cluster grows very significantly,
and so there's a threshold behavior.
There's actually, for those [inaudible] the physical sciences, this is an ordered,
or disordered to order phase transition in the system,
which changes the characteristics of the system,
changes the way in which things can flow through this percolation network.
What I want to argue is that these same kinds of discontinuity
and these same kinds of transitions are available to us
in a much more complex network of our research communication spaces.
Now, we don't do science on a square lattis.
We don't have simple probabilities to change the degree
with which we're connected to people.
Obviously, the situation is much more complicated,
and I'm not going to make the strong case that I'm making a model
versus an analogy here at least in the context of this talk.
But, I think there's ideas that we can draw from this,
parallels that look similar to some of things that happened
through internet communication.
So, here's an example.
Tim Gowers is a mathematician, and some of you may know him as one
of the world's leading mathematicians, and a holder of the Fields Medal.
He is also a person who is interested in collaboration in mathematics,
and so he was interested in this question,
mathematics being a discipline traditionally dominated by the single person
with pencil and paper in their office,
of whether a large scale collaboration could speed
up the process of mathematics.
And, being of an empirical mindset, despite being a mathemetician, he posed an experiment,
he proposed a particular mathematical problem, a proof of a conjecture,
which he thought was interesting.
And, said, okay so I'm going to suggest a framework online,
where people can collaboratively contribute to the solution of this problem
by critiquing small pieces of proof.
As with any good research, a catalyst like this point,
make sure that the grant proposal is set up,
so that even if the experiment fails, it's still a success.
It's not the case that the aim is to actually solve this problem, prove the conjecture,
but actually that the aim is to test a proposed means of creating a proof
that he's proposed, and this makes a lot of sense.
It's a very sensible way to approach it, because the mathematical proof has
to each link on the chain of reasoning has to hold out,
so people attacking each of those links can solve that
and look for issues in a parallel way.
It's also worth noting is that one of the things he said is
that he thought this task would take him personally, again,
remember that this is one of the world's leading mathematicians, about between 6 and 18 months to do himself,
to test whether this approach would work.
What's interesting is that six weeks later, he declares the problem solved.
So about 150 mathematicians, some of them from Fields Medalists
and world leading experts, some of them school teachers,
have combined their expertise and their interests
in the comments section of a blog.
There's actually quite an old example, and it actually solved the problem.
In fact, they solved and proved a more general case of the problem
in a way different to that which Tim had proposed.
So, the point I want to make is that this has gone from a problem
which would have taken a world-leading mathematician 6-18 months to show
that his original ideas weren't going to work, to taking a collaborative group
of people who have networked themselves together
to actually solve a more general case of the problem.
This feels like a [inaudible] change in the capacity to be able
to solve this kind of mathematical problem.
Second example, just quickly, Galaxy Zoo.
We've heard people's story.
I'm sure many of you have heard about in terms
of different science experiments engaging people in the conduct of science.
There are also good examples of this.
But the point I want to pick out about this is that this was again a problem
of dealing with scale, a problem of dealing with the fact
that there were a million images of galaxies available.
The data was available to be used, but there was no computational system
that could classify those images into different galactic shapes,
and this was an important part of solving problems,
in understanding the modeling, the evolution of the universe.
What the Galaxy Zoo Team did was take a network infrastructure, the web,
and the ability to be able to push images to people in a browser,
and the fact that people were good at solving that particular image problem,
and that they could push the data back.
And in doing that, they changed the scale at which this kind
of research could be done, and they changed the expectations of what it takes
to get published in this space at some level
by substantially raising the statistical quality of data acquired
to make a claim in this space.
And, I want to come back to the comment Tim made,
Tim Gowers made talking the about the Polymath Project.
He said that this felt as though it was to normal research as driving is
as pushing a car, that there's a real sense of being able
to do something different to what you were able to do before.
And, my point is not that we can do this across all disciplines and science
and across all the problems we have. My point is,
that if we can do it in some places, if there are opportunities available to us
to radically increase our ability to solve research problems,
then we need to understand where those opportunities are.
And, I want to also argue that both of these projects worked in large part
because they were open for anyone to contribute to,
and for anyone to see that people were working on it.
So my argument is in part, successful open projects find, sometimes through luck,
sometimes through design, these step changes, these order-disorder transitions,
and discover the place where the opportunity is there
to do something qualitatively different at a different scale. Or to put it a different way,
if we can figure out where those potential network transitions are,
we can exploit them by adopting the right kind of open approach.
So, I've used the word "open," and open is a fairly contested term.
So, I want to talk a little bit about what I mean by it,
and why I mean that by it, and where I'm coming from with that.
And, we start, often when we talk about open access or open data,
by thinking about it as opening up access to things that already exist.
So, things that we already have, papers, datasets, and making them available.
But then, what do we mean by "available"?
There's certainly a question of what we mean by open.
We're talking about sharing what we already create,
and we can pick on some of the various definitions of openness out there, the open source definition.
It talks about allowing anyone to do anything in a specific field of endeavor.
The open definition looks at not discriminating
against any person or group of persons.
The one that I continue to use as my source
of inspiration is the Budapest Open Access Initiative,
which to me is the most detailed and some of the most compelling language,
and I have just pickked out a piece of it here.
By open access, we mean permitting any users to use
in this case, peer-reviewed research papers, use them for any lawful purpose.
So, you are talking about allowing anyone to use anything for any purpose,
and of course, you're also saying in most cases,
you need to be able to say they can continue to do this.
You can't just take that right away from them.
But, this way of defining open and it tends to come down to questions
of licensing, are you using the GBL versus the Apache license?
Are you using current attribution licenses, [inaudible],
and this seems to get us boxed down in very legalistic understanding
of what we mean by it, and I want to propose it as a way of thinking
about being open, which is the more general thing to focus on.
So, what do I mean by being open? And I go back to those examples, Galaxy Zoo
and the Polymath Project, and open access and open data efforts often start
from the notion that my work can help someone in a way
that I haven't thought of, or can reach a person that I haven't thought of,
whether that person is a patient, an educator,
a policy person, or whatever they might be.
The notion that by just making what I've always created anyway available,
that that can help someone that I wouldn't have otherwise gotten in touch with,
and this is equation is probably a dangerous thing to put in this context,
because people will perfectly reasonably critique
that the equation is not very well set up,
but we'll talk about what it probably should look like later perhaps.
But there's a fuse of terms that interact here in a way I think is helpful
to think about, and that is, we may maximize the probability of helping someone,
and that's the function of some sort of notion of the people out there
who could actually use this work; at least in the form that it's in at the moment,
multiplied the number of people you actually reach.
It doesn't matter how many people could use your work if you leave it
in the desk drawer and no one is ever going to see it.
And, also the usability; the degree to which someone can actually make use
of it, and that can both in legal terms.
It can be in access terms.
If someone can't access it, then they can't use it.
If someone is not legally allowed to use it, then that's a friction to them.
And, if they're not technically able to use it because you haven't marked it
up properly or there's insufficient metadata or the format's not documented,
then those things are going to make it much harder for someone else to use.
The first side of this, and I think it's equally important,
I think the open source community is much further
down this road in this kind of thinking,
is not just that we can throw things over the fence
and other people can use them,
but someone can help us by contributing back to what we are doing.
And again, there's a similar kind of set of things we can think about.
How many people are out who could in principle help us?
How easy do you make it to contribute, and what are the number of people
who are aware that you have an issue that they might be able to help with?
And our object is really given that probably the number of people who can help
or the number of people who could be helped is in some sense fixed.
I guess we could probably argue about that as well,
but for first approximation, let's assume that's kind of a fixed quantity.
Then, we want to reduce the friction as far as possible,
and we want to reach as many people as possible.
And that friction term, in terms of definitions of openness
with focus very much on the legal friction.
And, services like SourceForge and GitHub and Git and other version control systems
in the open source space have focused on the technical issues,
but in the research literature, we've done very little,
and in the research data space, to be honest, I think as well,
we've not done anywhere near as good a job of thinking
about the technical friction to use, as well as the legal friction.
So, for me, being open and thinking about this as an opportunity
to exploit the network infrastructure we have is essentially acting
to reduce the friction to use or the friction to contribution as much
as possible, and to maximize the number of people we reach.
And one way to do that, at least at a crude level is to make sure
that everything goes online, that the licensing is appropriate,
and that you use the standard format, just to use a very simple example.
And, I think it's important that we think beyond just talking
about giving stuff away that we've already created.
But, if we're really going to take the opportunities that networks bring us,
then we also need to think very carefully about how we reduce friction
in an appropriate way for incoming information,
as well as for the outgoing information.
So, again to return to the open source analogies,
we need to both enable forking of systems to someone taking it away
and using it in their own context.
But, I would argue we could do a lot more I think, if someone has gone
and taken a research article, and written a lay summary of it,
what can we do to bring that lay summary back into the context
of the research article so that it's available to the person coming
from out of the discipline?
If someone's taking data and putting it somewhere else or correcting it
or doing additional quality control on it,
how can they contribute that data back into the original system?
These are the questions that I think are really worth asking.
So what can we do?
What tools are available to us?
I want to focus here on analogies with the open source space.
And, so we have code repositories as one example, the place where,
if nothing else, you can put code up to make it available, and indeed data,
and people use some interesting stuff in those spaces.
Documentation, particularly in the open source space, is critically important.
If you don't document the code properly, then no one is going use it,
because they won't be able to necessarily understand it.
That documentation might be inline, or it might be separate,
or it might be an appropriate structured form associated with that language.
The documentation and being clear about how something works is important,
both for us and for other people.
Virtual machines, focus on analogy, and as a reality I think are a very interesting way
of making things more easily usable by other people,
but I'll come back to questions about that.
Continuous integration is really a fascinating space where the ability
to keep checking whether a whole fleet
of things works together is really quite exciting.
It's exciting if you're building up a code base and you want people to contribute to it,
and you want to be able to relatively easily tell whether a piece
of code that's someone has contributed is still going to work.
But I think it's also really interesting think about the analogy of the research literature.
What would it mean if whenever anyone published a new paper,
the whole literature was tested against that new paper to ask the question
of whether it was compatible or incompatible?
What questions does it write?
Could we do that, and if not, why not?
What are the challenges involving in doing that,
either on a small scale or on a large scale?
On unit tests in my background in experimental biology, we call these controls.
They're essentially the same thing.
They improve usability and reduce use of friction.
If I can download a code base, and check that it works,
that all the tests pass when I've done that,
I'm going to be happy to go and use it.
And, so I take a technique from a paper, and I run all the controls,
and all the controls pan out the way they're supposed to,
then I'm going to be happy to use it.
Those things are ways of reducing the friction to use,
and indeed, to contribute back.
And issue trackers, again, are interesting to think about raising bug reports
in literature is something that lots of people have talked about.
We've not really figured out a good way to manage this in the long term.
Obviously interacts in an interesting way with issues of prestige
and credibility, and any concerns,
perfectly natural concerns, about whether that would be done in a fair and reasonable way,
and the fact that we can push over stuff onto the network, the cloud,
whatever term you feel appropriate, but that notion,
that whole concept that there is an infrastructure that lets you spin
up an instance of something.
That lets you check whether this code generates that data from that data,
that lets you check whether this is useful to you or not.
Or lets you check whether you've done something
which is useful to contribute back.
That notion that there is an infrastructure that supports this,
that sits outside of us and allows us to scale things to a high level,
I think is really interesting.
It's something we can adapt, you know, in a wider research communication space
But all this raises a question,
because basically I've suggested there are all things
that you can do, all these different things that you could do
to make things a bit better.
But, we have limited amount of time.
We have a limited amount of money, and so we have to make assessments
about where to put the resources into making things of our own.
I want to come back to these questions at the end,
because there's an interesting debate going on.
I just picked out two sort of blog posts here
that I think interesting irritate this, irritate, illustrate this, sorry.
And, one's is slightly out of order, but I'll start with the one at the back.
The Recomputation Manifesto by Ian Gint was an interesting piece of work
that suggested that whenever there was a computational result published,
that the minimum level of work that should be done to make it reproducible is
to provide a VM that contains the data and software that allows you
to recreate the original claimed result.
So, that's kind of baseline level of reproducibility
that could reasonably be expected, again, maybe there's some debate there,
for specific kinds of computational result,
and I'm not talking about global circulation models obviously,
but a lot what's in smaller scale in bioinformatics, this would be perfectly feasible.
Titus Brown, who again, I'm sure many of you know, had a different view.
I think it's a very interesting one, and his view was that the creation
of a black box virtual machine that reproduces,
recomputes a result doesn't really do very much to make those tools reusable
for other people to repurpose them in different settings.
Now, both of these things take work.
Taking your code base and making it modular, and having it all properly tested,
and making sure the parts are independent,
and making it well-documented so people can take the pieces
and use them out of context is hard work.
Building up a VM that generates, regenerates the result that you probably got
in a somewhat haphazard fashion on those systems that you run
on your own computer is also going to be hard work.
And, sometimes we have to make choices about which one
of these is more important.
The point is that they're both right.
It would be great to do both of these things,
because different audiences for this work need different kinds of supports,
different kinds of supports to be able to use it.
And of course, that's before we start to talk
about supporting the unexpected use,
where we don't even know what it looks like.
How do we chose what the most effective way to share, to reduce friction is,
in a world where we have limited resources to play with,
and we're not necessarily sure what the downstream potential use of our work is.
And that's the real challenge.
There are some things we can do easily.
There are some things that are harder,
and there are some things that are really quite expensive,
either in time or in money.
And those are the choices that we face in terms of choosing how to try
and enable others to use and to spread, and to communicate and reuse,
and reapply the work that we do.
Which returns us to that same question.
We have limited time, limited money, so what are the choices that we can make,
and how would we think through that process
of deciding what are the right choices to make?
I'm going make some suggestions, and they're only suggestions.
One if there is something to do which is near zero cost and it reduces friction
or increases the number of people who can access something or use it
or are aware of it, then that's a no-brainer, then do it.
This is the "just stick it on the web" argument.
It may not be very helpful, but it takes very little effort,
and at least someone might find it via Google.
If the argument at some level for open access, there's plenty of money
in the system to pay for the subscription content we have at the moment.
Sorry, there's plenty of money in what we pay for subscriptions at the moment
to completely cover the cost of open access.
We're not short of money.
We should just be doing that, because the money is there to make it available,
and to make it reusable, at least in the narrow sense
that the research pays for themselves.
But that's the easy case.
What happens if it actually costs money or costs time and effort?
So, I'd argue that it does cost effort, but it helps a known audience,
so it's part of your target group for this work, likely to be your colleagues,
potential downstream users, either in industry
or the health sector perhaps, so you're helping them.
So, this is part of your core business anyway, but it also reduces friction
or makes things more available for people, for everyone,
so that's a good thing to do.
That's something that makes sense to building the cost and projects, and again,
things like good thinking around data provision and metadata provision,
which directly helps the people for which you're creating data or curating data.
But can also help a lot of other people.
Of that nature, it's fairly rather a long line and that makes sense to do.
The other thing that makes sense to do, but it's harder,
is to simply make it easier in the first place,
is to move the difficulties in the work upstream into the tools themselves,
so they have these open approaches baked into them from the start,
so that it's very easy to share.
So that you capture the context in which you created a dataset
or used a dataset, in a way that makes it easy to put it online,
and then people can work through from there.
I think there's some interesting work being done with competition on notebooks
and on the kinds of things driving
from which your programming has some interesting things
to at least lead us on in that direction.
If we can figure out ways to create shared platforms that again,
spread the cost of the infrastructure.
This is a really valuable way to reduce those costs,
if we can find ways to do it.
And again, if the open approach is baked into the platform,
then you get those benefits for free.
And, if we can share the load of curation
and whether this is crowd sourcing curation or combining datasets together,
whatever it might be, but sharing that load
of management curation is a crucial part of it.
So that's fine, but there's something,
if we're going to create this new infrastructure, this new system,
there's a bit of a problem.
We have something missing here, which is to do with incentives.
It's to do with the fact that building these infrastructures
and these platforms, is generally not the way to get ahead as a researcher,
at least as an academic in a university setting.
Although it's bit harsh, no one ever really got a Nobel Prize
to build an infrastructure, because the prize goes to the people who used it
to do the science, and we can argue about that development again.
So, we have a problem.
We have a problem of incentives.
So, how do we create the incentives that help us to create the systems,
the platforms, the infrastructures,
and the behaviors that make it worth people's while to invest time
and effort in reducing friction and reducing work and reaching more people?
And that's a hard question, because we've got things that are pulling
from two different directions here.
There are several different places along the chain
where we can address these issues of incentives.
One way is to take our existing incentive system and to hack it,
and to basically try and use the system to changing the incentives.
Some of you may have been involved, I was part of a project for doing exactly this,
in the world of open source and high quality software production with a journal
that we tried to set up called Open Research Computation.
The idea was a very cynical one, frankly.
The idea was that if we put together a journal where the requirements
of the publication were extremely stringent
when it came to documentation testing.
So, we started off with, you basically had to explain
to us why your test coverage wasn't 100%.
You had to provide detailed worked examples of all the data that you present
in the paper using the software that you're describing.
Software had to be open source, it had to be properly documented, and the idea,
as I say, was quite cynical.
It was that if you publish a journal that has software that has those kinds
of requirements described in, well, the people who put that kind of work
into their software are the people who were building software,
at least a significant proportion of it, that's used by a lot of people,
and will therefore cite the papers that describe that software a lot, meaning,
you'd end up with a journal that had a very high impact factor.
That would be blunt about this, which their software person could take
to their head of department in biology or chemistry or wherever it might be,
and say, I'm published in this journal, and you can do the science.
It's not too difficult to figure out that if you get,
if you're publishing 50 papers a year, and 10 of those have 200 citations,
then you've got a pretty high impact factor.
Now, as I'm sure several of you are aware,
the problem was rather at the other end.
There simply isn't that much software that's built to those requirements.
I think this is worth trying again.
I think there's ways of approaching this kind of approach
to hacking the incentive system, creating journal-like things,
or validation or celebration that gives people badges or prizes that shows
that they are doing really good work, but it's hard to get right,
hard to get the balance right.
We could continue, as I'm sure several of you are involved with,
working on the policy, slowly building up the tools, but you know,
I've been involved in this for 10 years now; trying to improve the systems.
And, I'm sure many of you've been involved in many of these efforts
for a lot longer, and this gargoyle has been sitting on the side
of Notre Dame Cathedral in Paris for about a thousand years,
and it's beginning to feel a bit like if we just expect things
to eventually fix themselves, we're not going to get there very fast.
So, my argument is that what we can do and what organizations like yours
and your networks do, is we can build these community architectures.
We can try and build these infrastructures where the funding exists,
where the openness is built into the foundations,
where these systems are interoperable, that have the architectural principles
that you guys all know that make easy to use and reuse the work.
To reduce the friction in the system, and support the sharing of all the objects
that we create ideally with the context that we're creating
as we make them in the first instance.
But also not just the pushing out, but how can we build systems that help us
to accept contributions and help us to tell
which contributions are going to be useful to us.
I go back to that thing of continuous integration.
We don't want to create systems where everyone,
anyone, can come and make suggestions, which then take up more time than we gain
from the small percentage of those that are actually directly useful.
But, systems like continuous integration
where you can prefilter these incoming call requests to tell
which ones are compatible with the overall system.
It's a model that I think is potentially very powerful to think about.
We can't do it at the moment.
The architectures aren't there, but how might we think about creating them.
So, for the question I come back to, if you accept the model I proposed
at the beginning, is that there are these potential transitions in the state
of a network if we reduce friction above some threshold.
And, our network is not a square lattis.
It's a multidimensional space where the connectivity is very complex.
So, the question we should be asking ourselves is where do those opportunities lie?
Where are the points of nucleation that are close to a phase change,
where the resources that we could put in, or maybe have available to us today
can create these changes in capacity.
Where are we close to these phase transitions?
And, how we can tell, how can we measure,
how can we figure out where to put the resources that we have to have,
particularly in the world which we all work in at the moment
where those resources are limited.
And, what I want to suggest is that that is a problem that is one for people
who study networks, perhaps people who study gene networks or ecological networks,
or control networks, or susceptibility to disease, the kinds of things that you
and the people you know and work with are working on.
So, what I'd encourage you is to think about how the tools and the approaches
that you use successfully in the space of bioinformatics,
can be applied to the bigger problem of understanding where the opportunities
are for us to change our research communication systems,
our research communication networks,
so the purpose of finding these opportunities
and changing our capacity to actually do research.
And, I'll finish there, and if anyone's got any questions,
I'm very happy to take them.