K - State libraries - Open access week 2012 - Dr. philip e. bourne

Good afternoon everybody. My name is Lori Goetch. I'm the Dean of Libraries. Most you know that, but we have a few guests in the room who may not. Welcome you to our Open Access week activities. I'm very pleased to have the honor of introducing our speaker this afternoon and I will tell you a little bit about him. Our speakers Dr. Philip E. Borne. Dr. Borne is a Professor of Pharmacology at the Skaggs School of Pharmacy and Pharmaceutical Sciences at UC San Diego. He is also the associate vice chancellor for Innovation and Industrial Alliance's, associate director at the Protein Data Bank, and cofounder and founding editor in chief at the Open Access Journal Plus Computational Biology. He is a past president of the International Society for Computational Biology, an elected Fellow of the American Association for the Advancement of Science the International Society for Computational Biology, and the American Medical Informatics Association. Dr. Borne is as an advocate for open access, and his research in professional interests focus on biological and educational outcomes derived from computational and scholarly communications. His areas of interest include algorithms, text mining, text mining I think this is supposed to say, machine learning, meta languages, biological databases, and visualization applied to a multiplicity of problems including drug discovery, evolution and cell signaling. As an open-access advocate Dr. Borne is committed to furthering the free dissemination of science through new models of publishing and better integration and subsequent dissemination of data and results. Please join me in welcoming Dr. Philip Bourne to Kansas State University Well thank you very much for having me here. It's a great pleasure to be here. Its interesting, we were just having a discussion, in fact, to look at the gender balance in the room, it's not actually a balance, and which is completely opposite when I gave a talk to the editors of the American Chemical Society where the balance was the other way around and they treated me a lot worse that I know you will. I can see that you're a friendly, definitely a friendly, audience already. So maybe just try and tell you a few things that have sorta I've been thinking about for some time as many other people have. I was asked to put more of an emphasis on data, and open date and what the impact implications are. I'm primarily just a faculty member whose interested in pushing this, so that's my perspective. My things not working anymore- it was before. Well I can't seem to change to slides at all. I could just talk about this one slide for.... (Speaking to self) Sorry, let me just restart this. (Speaking to self) That is very strange. (Speaking to self) Well there are other ways to doing this, which don't seem... (Speaking to self) Whoops, we don't want to look at my email, sorry. (Speaking to self) You have enough of your own, you don't need any of mine. (Speaking to audience) So let me just tell you a little about my perspective. So as was already made clear, I work in the Biosciences and it's really that perspective I bring to this. I run a coupe of meta databases and we distribute at lot of information. We estimate a quarter of the National Library of Science, yes the National Library of Science, to investigators every month. Based on that, I have a certain perspective on data. Now what's the new role I have is looking for innovation within an institution. I've started to think more about what the institution itself, each their own institutions, should be doing around data. (Inaudible mumbling) So, in all of this, and I'm a big advocate for open access, I always use a caveat associated with that, and that is the notion that there must be a business model. You can have sustainability without a business model and as you know, and it was a lot of questions about open access, and when a business model will be sustained when things started off, but that's clearly...it is sustainable, no question about that. Organizations like Public Library of Science, which is involved, I have already proven that. I also acknowledge that every discipline is differen, and, in fact, just because something works well, say in the biosciences, it doesn't necessarily mean it's going to work in other places. What is clear, in my mind, at least, is that my general opinion of open access is not a question of if we can have open access. We already have it. It's a question of, you know, when is it going to be and how is it going to be in the predominant form of scholarship and those are still open questions, but it's clear if you talk to leading editors of the major close access publications, I think they will each agree that, at some point, there will be models that provide at least an open access option, if they don't already. So we're getting there. The momentum is increasing. I think where we need to have much more effort and where things will really accelerate is when we use the open access content in ways that actually increase our understanding and increase our knowledge base. I don't think we're doing that at moment. mMst that what we do with open access content is to just have access to it in a broader spectrum and to a large group of people. But we also have with that content, the ability to use that content. I don't just necessarily mean in text mining, although that's part of it, but really knowledge discovery from the corpus, and developing ways of doing that still fairly undeveloped. But we are, as we have politicians keep telling us, in a, you know, knowledge-based economy now. As that tends to grow, I think more more attention is going to come from extracting data from this available corpus and thats sort of what my theme is going to be. So let cast that theme. I learnt at lunch that you're working on an open access policy. The University California system is also working on an open access policy and that just to say to put that debate in its current perspective, let me, just you know, indicate where we are. What we're moving towards in our case, is from an opt-in policy to an opt-out policy. So right now, you choose as a faculty member or as a scholar to opt into open access publishing. What many places- Harvard, Stanford and so on- is essentially move to an opt-out policy where you you need to make your work open access, you can opt out if you so choose. Already I've heard, I've just sort of summarized the arguments that familiar to anyone who's involved with open access because they keep reoccurring, and that of course, is things like the cost for you. Look at the negatives- the cost for some disciplines. If you don't have grant money, you have to pay to publish. Where's that money going to come from? It's a legitimate concern. Therefore, each display needs to approach this differently. The impact on society, which depend, many of whom, depend a lot on the revenues journals, what happens to that? So there are well-trodden arguments that are important and we currently debating all of this. The notion a journal quality: there is this is perception that if you publish in an access channel the quality is not there. Well that's, you know to some extent, is being true because the main (journal), in the biosciences with journals like Science and Nature, there hasn't been an open access equivalent that that. That's now changing. E-Life is about to to start publishing and that intends to be on a par with Science and Nature, as an example. And then is this notion of there being a "big brother," and being told what to do, and so forth. On the "For side," of course, I think most people would agree that in a public university, that you know whatever you do, in a sense, should be made public- simple as that. You could you can cast it in many different ways. What I'm been trying to push now to help this movement move forward is the idea that this institutional perspective and value in this that we haven't even begun to address, yet. I'll give you an example or two of that in the end, but really we haven't really identified if we make our research data, as an example, and our research knowledge, open, we can do things with it as an institution, as well as in a more global perspective, that hasn't been otherwise possible and that, I think, is an example of where open access could really be made to take off. (Inaudible muttering) (Inaudible muttering) So why is all this so important to me? Well, here's an example that's just taken from our own work. This is from this database that we maintain call the Protein Data Bank. Now, on the top right hand corner is a graph, that shows the number h1n1 cases in the United States, maintained by the CDC overtime. You can see that it grew quite quickly then it flattened off. Correspondingly, access to items of data, it doesn't really matter what they are, in a public repository in response to this crisis, and there's a correlation between how this data was accessed and the crisis itself. So what it says is that in a time of crisis, this data becomes very important and is is accessed correspondingly. It's very important that day to be available. At the same time, the Public Library of Science created a portal, plus currents, in influenza. That was a place to publish information about the epidemic very quickly with minimal review so the information could be put out there- not just papers, but also data and other things. There was a huge spike in activity around this portal during the epidemic. When of course that died off, then usage of that drop back to very low levels. Partly because that has to do with the reward system. There was no reward for putting things into that kinda archive when you can publish it, even though it might take a year before it appears. But the crisis shows the people were prepared to do other things. These are just some examples of signs that we could have the potential, at least, to change. It proves, too, that Open Science just accelerates the process. There's no question there was an acceleration discovery by virtue through things like the cross-currents portal and access to the Open Data. In the mode we're in with presidential debates and everything, what happens almost invariably is you start off with a general point, and then you go in to "Well I met someone so when I was having coffee at so-and-so and they told me, and of course, to drive home the point, so why should I be any different?" I'm just gonna tell you a couple stories about people I met that sort of drive home the whole value of open access out and open data in particular. This is someonecalled Josh Sommers. Just out of, interest how many people know this story? (Inaudible mumbling) He was a freshman at Duke in engineering and he started getting these serious headaches. He went we had a full workup and he discovered, he was told, that he had Cordoma disease. He's lying there in recovery after that all these tests using the wireless in the hospital. He's googling this and he's looking at abstract in PubMed, but he realizes this is a very serious situation, but he can't actually get into the details of the studies to find out more about his condition. This is just a common situation. What he does discover is the prognosis is not good. In fact, he had, at best, seven years to live. You know, had it been me I probably would have started partying heavily and and get gone. What he did was to do something quite remarkable. With his mother, first of all, he went into the lab- fortuitously, the only NIH funded Chardoma research in the US at the time was also at Duke University. He stopped doing engineering, and he went volunteered in this person's lab. That's a picture him in the lab. At the same time, he and his mother formed the Chordoma foundation. This is one of the earlier meetings that they had. They got a lot of sponsorship and a lot of support for this. What he noticed in the lab was actually quite appalling. (Inaudible mumbling) (Inaudible mumbling) What he found was that people even in one lab didn't communicate and share information quite in the way they should and sharing between labs was even more problematic, both in terms of data and in terms of knowledge. It was all about getting those publications, which takes a long time. If you only have seven years to live, a year to publication before someone else can even start looking at what your findings are is pretty devastating. He became part of a movement which is called "Sage Bionetworks," where the idea is to take this kind of model, and turn into this kind of model using the so-called "Power of the Commons." It's actually creating a collection of information around disease modeling that's accessible to everybody all the time and accessible more or less immediately. It begins to change the way, potentially the way, science works. It's almost like a commons becomes the equivalent, in physics, of the Hadron Collider- where you have all these scientists aggregating around, in that case a large piece of hardware. Here, you have a lot of sciencetists aggregating around what's actually a virtual space that's full of data and tools and people's commentary about a treatment of a disease. These Commons are starting to really take off. That's important because what it show shows, year after year, is you show the meaning of the Chordoma Foundation, and then when he shows in the circles other people have died since meeting was held last year. You know we're talking about a disease that affects particularly young children. You know, the point is we need to accelerate the process for people like Josh. In the same vein, I'm now moving to different coffee shop and I'm going to tell another story I heard and this is the story of Meredith. This is another example why, frankly, why I'm standing here. I actually was editor in chief of this journal. About once a week, someone would send me a paper directly to the journal, directly to me by email, thinking they're goning get an easy ride through the review process. Normally, I send it directly to the general office. This time, for reasons to become apparent, I decided to look at it myself. It was in pandemic modeling. I'm not an expert in pandemic modeling, but I could see that was something special about this, both in the way tje work being done by a single author, and by the the outcome. I sent it to Simon Levin who is at Princeton and a Kyoto prize winner. He looked at this and he said "Now this is really a special piece of work. So I actually advised her to send it to Science to get be reviewed. It was reviewed, and, in the end, they didn't publish it, but it it will appear in somewhere like P&A very shortly. It's a very good piece to work. What's so special about this is I met with her because she lived in San Diego and after I met with her, I invited her to come give a lecture UCSD, which she did. I'm sitting there, listening to her fend these questions from high-powered academics, and I'm just marveling at it. Why am I marveling at it? Because she's 15 years old. She's a senior at La Joya High School. This started as a project, a science fair project. She then went and started looking at this seriously. She wrote to Wolfram and got a copy of our package or documents- a package for analysis. She wrote to the original offers because the date wasn't online anywhere. She had to get data from the authors. She wrote to the supercomputer center in San Diego when she got thousands of hours of computation time. And she did this thing all on her own. It's just a remarkable story. She's now at Stanford. She is 16 now. She can actually drive. When she first came, her father had to bring it to the lab because she had no way of getting there. So what does all this tell us? It tells us that openness is, obviously she's an extreme end the spectrum, I have a daughter the same age and my daughter so sick of hearing the story. (Inaudible mumbling) It's an odd case, but clearly, what I see in the students that I see every year, which primarily graduate, and some undergraduates, now is that some of them are phenomenal. They are part of this Wikipedia economy, whatever you want to say- your YouTube generation. They know no bounds and I really believe that these kinds, I say under exploited, I mean I have two undergraduates my lab right now who doing work as good as some in my graduate students. One of them, I say, is at the post-doc level. I think they're doing it because they have all this access. It's another example have how we can exploit these assets. We need to tell the stories and really get this to be more common. Let's explore the notion of what we might expand on this with an emphasis on data and where we are going with data. One in the problems, and I'm referring mainly to the biosciences is that we have these things coalesced into silos so the majority of the knowledge is effectively in literature, and the majority the data from which that knowledge is derived, is effectively in databases, at least the digital parts of it. But they're starting to coalesce in some ways. You know they are not fully coalesced, and the reasons are partly technical and partly cultural, in my opinion. The supplemental information in papers is exploded over the last year, mainly because all this extra data in there. Data journals themselves emerging. I'm involved with an effort, for example, with a faculty of 1000 where they're actually putting together data gems. It's actually publishing data sets. Why would you publish data sets, because that's not where you get the credit? You know, we are stuck in this system where you only get credit for publishing papers. So what do you do? You write a paper about a dataset, which of course, has no value in and of itself. It just gives you a metric in the system. I have a paper about the PDB database, which I think is the third or fourth most cited paper in all of biology. What's ironic about is no one has ever read it. There is absolutely no reason to read it because it's just a reference to a database everybody uses. Tut the only way we get credit for doing it is to write a paper, a useless paper, about the database. The system is completely wacky, and when I tell we have, I'm sure we have the same system here, but maybe with a different name, "Committee on Academic Promotions," we had use "CAP," we hear "I tried to talk to CAP about this, and it's just, breaking that system is very hard. But it has to start with the people being evaluated, telling the people who are evaluating them, how they should be evaluated. That's pretty hard if you don't have tenure. I'll say anything. The chances of me getting kicked out are pretty small. Well they were until today, but who knows. I'm digressing. So there is this issue of reward. What we are seeing is a level coalescence and the software becoming available. At the same time, the database is becoming more like journals. I call themselves "knowledge bases," they try and aggregate more information about that data. You can do science on the fly in these resources. You can run applications and programs in these resources, and you know there are people who dedicate their careers to making these resources very valuable- so called, in this field, said biocurators. To me, they are some of the unsung heroes of the kind of science I do, anyway. But all of this leads to say that things really need to change. Where does this take us? Really the whole notion, the whole in principle, not in reward, but in principle, a paper is an artifact of a previous era. It's not a logical end product of e-science. Work is omitted. If you've ever tried o use supplemental information in a paper, you know you can do it if you look at it by hand. But what you really want is you want a computer program to go and pull this information in from multiple different papers, and make some aggregated conclusions. Good luck. The visualization that you can do on things in a paper is pretty limited. There is nothing wrong with the paper, don't get me wrong, but as you will see in a second, it's really just one way I'm looking at it. You can't interact with a paper, and that's really how we get to the next step knowledge is by interacting with what we have already. There are lots of other aspects like the use of rich media, which you know we haven't used very effectively at this time. That's really where we stand on the paper side. What about the data side? We now have data sharing policies. Many you know about these because you either subject to them, or you're responsible for telling faculty and you haven't got the faintest idea what they need to do with these. We all write these grants now with data sharing policies which conform to these wonderful notions, but I almost guarantee, myself included, that anyone who's writing one of these things doesn't even conform to their own policy. I try to, but for reasons, some technical, some cultural, it's very hard to do. But don't worry, no one's really checking. The NSF don't know how to deal with either, or the NIH for that matter It does beg the question, it does take us one step forward to at least some level of awareness now and the importance of this data. Tthe awareness is growing because big data is taking off. There was the Office of Science and Technology. The president of the white house put $200 million dollars into Big Data projects. That's now trickle-down and we've seeing a variety of projects in the NSA and NIH around the notion that Big Data. This more attention being paid to this, I really like this and put in there. Even the private foundations, I've been working with the Gordon and Betty Moore Foundation, where they've been trying to figure out how they can promote the idea of value of data to science in ways that it's not currently being done. We had a think tank meaning at their offices in Palo Alto and the brought in this order conceptual cartoonist, which I really like. While you're debating around the room and discussing this, this is actually being captured in a single wall all outcomes associated with that. They then use this final summary to, in this case, it was saying"Where are we going be with where we need to be with data in the year 2021?" as the the basis for their funding in this area in the coming years. I think what you're going to see is this idea of trying to break this reward system, to reward those people who manage and produce data and use data, not from the publication point of view, but with money to continue to do and development what they are doing. I think you're going to see some interesting developments from all agencies. You've got to add all these other things into the mix when you start worrying about data. It's not just big data. You've got all these issues associated with reproducibility, maintainability, usability, and reward we've already touched upon. Reproducibility is, undoubtedly, a myth. Call it a pillar of science- it's not that something is not reproducible in my opinion, it's just the effort that needs to go into that reproducibility is just huge, even when you do it yourself. I took a paper that we published about a year ago with four authors in my lab. Four authors had already left the lab. What we use was a workflow system to try and recreate that work so the next generation of students coming in could actually repeat those experiments and build new experiments in a way that was quite efficient. Well guess what? We couldn't actually, I could not do it. I could do it myself, anyway, because I wasn't the one doing it, but even the one person left in the lab who knew, you know, the ins and outs of the work, all the bits and pieces, realized that without the help the other two authors who left the lab, we couldn't reproduce it. I'm being honest about this. I would argue in many cases that is the norm, not the exception. Maintainability is something we might not even be coming to grips with. DNA data alone, forget about everything else, is doubling of the order every five months. It is way ahead of the Moore's curve law, which says you know you can continue to, at the same cost, maintain increasing amount of data, it's just not gonna happen. No one seems to be pressing up to the idea do we really going to have to decide what data to throw away, and how we go away throwing it away, and how we choose what throwaway. Usability I'll come back to in a minute, but using data in different forms in different forms is not straightforward. There is no tenure for just publishing banker. Dreams to emerge and he's my dream, all encapsulated in one slide. I've shown it so many times I think it's going to be encapsulated on my tombstone. I just want to use is an illustration for the kind of things I'm trying to say and it kinda puts it together in one slide. That's this notion that here's a paper, this is just one view of the knowledge, it's not the only view. I can actually generate a series of other views, but let's start with the view that's familiar. It has a lot of attraction, it has a lot of value. I actually understand. It has identified I can use. It's not like an interface in an institutional repository, sorry, that basically is unusable. I can use this. I can go from one journal to another and it's pretty recognizable. This the same thing. You shouldn''t throw something like this away away. We probably, we definitely need, different forms. Then I do what I do now. I click on a little thumbnail in that online Journal article. Up pops a bigger version and that's kind of where it stops right now. That's useful. In this case, it doesn't really matter what this is, but just for reference, this is a an enzyme and a structure of an enzyme There's a lot of amazing information and author-based knowledge in the way that this is rendered and represented. But it's a static entity.What I really want to do is I want to stop playing with that and understanding that better. so I want to, but I can't do now, is I want to click on this and I want to retrieve the data that was used to make that figure, which is actually the raw data plus a whole series of meta-data that manipulates that raw data to give exactly that image. In this particular field, there's no no reason why that can't happen already. The people doing this kind of work use a very small number programs and it's kind of all online anyway. The data that supports this is all in a database that I happen to represent. Once I do that, I can now do things I couldn't do before. I can render this. Then what are the things I want to do? When I want to stop mousing over this, and I want to see what the commentary is, I want the social networking aspect of this brought to the floor, I want to know what other people understand about this particular figure when it's rendered in a particular way, so I might be looking at something over here, and then some aspect to this intrigues me. What about other people had to say about it? You know right now, if it was in Plus, for example, you can comment on a paper, but people don't do that. They don't do that because there is no reward for it. On the other hand, what's emerged, which gives me some hope, is that a very active bloggerisphere out there when there's something wrong with the paper, but not when there is something right with it. But we are starting to get to that point. There are now people who build reputation on their blogging, and that's quite an amazing turn of events. They are revered at these conferences and things. It's getting to the point where it's more than just papers. The idea that there's effective commentary on here is coming. Based on that commentary, I see something interesting, and I click on it. That takes me to a mash-up of information to particular loop in this structure which which I'm interested in. I get this mashup information about other papers, other data sets that relate to that. That drives me to another paper and so the loop continues. What I've done is I've created a few things of change here. I've got this seamless interaction between the data and the paper, and the knowledge associated with that data, which I didn't have before. Technically, this is already quite doable It's starting in a small way to happen. So that's, once people see this, I think that becomes, at least my way of thinking, more a motivator. But it's just the sort of beginning. Here's an example of something that's actually not looking at one paper. This is something actually was done, and it's just trolling through the literature. What it is is effectively pulling together, each of these nodes in this network, is it could be any any pieces information, let's say it's a piece of biological information, it's a gene. These genes are in this and if two genes exist, are mentioned, in the same paper, you draw a line between. Say you've got this content network and he got this other network over here, in animation mode, this would have all sprung to life. You getting the whole picture in one shot. The thicker the lines, the more into relationship. The more times that these two genes occur in the same paper. Well guess what? That alone tells you something important. This network has topology, and and when you overlay on that, a very simple thing, namely the type of literature in which this occurs, you see you've got these two distinct networks which are connected by one gene. It turns out that the immunology community, one community could not really understands what the other one knows about that particular gene. Just by doing these things in a fully automated way, you can start to learn things despite trolling the literature. This is a trivial example, but it's kind of an expression where we're probably going. That's a notion of the the field discovery informatics, which i think is just beginning. I went to one of the most exciting workshops earlier this year on discovery informatics where a bunch of computer scientists, a bunch of main scientists, a bunch of social scientist, some librarians all sitting in a room figuring out how we're going to deal with this future. What's clear is Google is incredibly useful, but it is broad and shallow. When you want to get into the nuances of data and you want to look into a subject deeply, you might start with a Google search, but that's not where you end up. You probably don't get to where you want to be. Sciences is cross-disciplinary. You really want to get to a point where the discoveries you want to make certainly surpass an individual's tools and you need these intelligent tools to mine and troll through all this information. Then you need to increase these connections between knowledge and data in the way I just described to you. You need to combine what's clear. You need to combine a whole group of different types of people to address this kind of problem. That's sort of what discovery informatics is all about. Here is a scenario which just expresses this to some degree. How many people who use Evernote? A few. Okay, Evernote is a very trivial and simple tool for keeping note for everything from shopping list to your major developments in the lab. The reason it's, in my mind successful over other programs which are much more sophisticated lab management programs, is that it's just because it is very simple to use for a variety tasks. When you get old and brain dead like me, it's valuable to able to recall information. It's all in the Cloud so I can be standing here on my smartphone and I don't forget to tell you something I can make it look at my notes I forget to tell you said, I could look down on my smart phone. Beware, you'll be here for hours. The point is that there is a recording mechanism. If everyone's using doesn't matter what to list, but the idea is that there's certain you know this commonalities in what we do in the lab at any given time. That can be discovered. If two people in the lab, it gets to this issue trying to deal with the Josh situation, bring all this close together. It's the idea that you know you can discover some commonalities in what's going on in the lab by trolling through what people wrote that day. What's even more important as you can take that and you can go out and you can search the web. The dream, this is just a dream, this is kind of a scenario that came out this workshop, is the idea that those particular threads and themes that are running through the lab, you can go in search the the Semantic Web, which we'll get to in a second, for that kind of information. You can pull back and based on you know all sorts of criteria relating to authority and other things, you could potentially rank information. When you get up next morning you have in your coffee you can see what the rest of the world has done that relates to that your particular interest overnight. This is not that far, at least at some level, from being reality. It brings into sort of play the whole notion that's been around for a long time hasn't gone anywhere, but technically this kind of thing is already doable. It's doable in the context of the so-called Semantic Web. The Semantic Web is being built, it doesn't matter how it's built, you know into RDF and these other things, it doesn't matter. Over here, down in the bottom in pink a whole series of resource from the biological resources. There are connections between data elements within these resources which, you know, NCBI and the National Library of Medicine have been example of putting these things together for years, so is nothing particularly new about that. What's interesting is the semantics, if done right, carry you into completely different other domains, which have some value, perhaps. Here's the trivial example I show. Over here is, this is media, and over here is the BBC database. If this semantic connection was all in place, I should be able to find, I maintain this protein database. I'm always looking for things to help educate students. I can immediately find what BBC has ever shine on a particular protein structure just by virtue of these kinds of connections. Its just, in principle, the technology exists to do this today. You know, we getting to had point. But let's get real okay, so this now, let's be real about what we can actually do. And this is, I'm probably gonna offend people and I apologize, and these are just my own perspectives from trying to do these things. The realities of today are a lot different and this idea of just, you know, of a fully functional semantic web- knowledge discovery, I get up in the morning and cup coffee not my papers already written, I mean, this is just, you know, we're a long way from that. I'd say these are just kind of three realities my own point of view that I just sort of popped out to think about. From my point of view, data repositories, institutional repositories, are just not working. They are not working for me, even though I'm getting to the point now where I'm going to be required to use them if UC passes this open access policy where I have to opt out rather than opting in. I have to deposit a paper into my institution repository. I mean, I tried that the other day and I have to tell you, the whole faculty would be in an absolute uproar. I mean I spent 15 minutes trying to do this, and, you know, I got a message that someone would get back to me in two days. When I'm publishing a hot piece of science, I'm not gonna to wait for an institutional repository for two days to tell me, you know, what's wrong with it. So these kinds of things. I'll mention the High Noon effect in a minute. I think, you know, there is no question that what NCBI's has been able to do over the years is an example for all fields of science. I'd say that the reason for my point of view, why there is trouble with these repositories, is...the idea building and expecting people to use it has not worked very well. I mean, I think the usage of our respositories is just very small. The idea that they are institutionalized, course, is a problem in itself. They should be global. You can certainly have different pieces information would different access, but overall, the concept should be global. Then, NCBI kind of works because, first, it is funded well. It has strong leadership; it has monopoly on some in these things; and it thought through the IT aspects a long time ago. You know, I think these models in there.So it was saying, it takes resource, strong leadership, it needs institutional support behind the repository- I'm really behind it to really make a difference. What do I mean? What else is wrong? I'd say there is this "High Noon Effect." Some of you, well, quite a few of you in this room, remember the days of the VCR, where whenever you'd walk into someone's house, nine times out of ten, you look under the TV set there was a VCR and it was flashing twelve. That's what I call the "High Noon Effect," because no one ever went to the the trouble of trying to program it. You could set the clock, but you know trying to record a program, you know, it was too difficult. The barrier to entry was just too high. DVR's have just changed that. You pull up the program on the screen, you click a a button, it's done. We need to move from the VCR era to the DVR era in respect to institutional repositories and other kinds of tools. Without that, they just aren't going to get used much. It's just not going to happen. Publishers, they've sort of got one end of the situation sorted out. I mention the idea that paper is fairly generic. Well, not when you go to publish a paper. When you go to submit a paper, you know, that there's not that many, but there is a significant number, of different Journal management systems that you have to fight through. Data repositories need to create something is more uniform. What we are seeing, of course, is this merger and that data in journals are, sort of, coming together. Now there are data journals. But they, to me, are not really going to get us anywhere. They are a bit closer to where we want to be, and the image total integration. Journal publishers don't know how to handle this. Most publishers do not know how to handle significant data sets. As a result of that, they sort of pawn the problem off to some third party. Some more publishers, including Plus, for that matter are using resources like Dryad where the data set has to be deposited. That's a great step forward, and it gets a DOI (Digital Object Identifier) so, at some level, it is retrievable. Then it goes off into this thing called Dryad, or one of these other repositories. It's only the beginning. there's not consistent metadata. There is not consistent information about that dataset. We are starting to see data journal themselves where this is this is being addressed. One of the beauties of the whole open access movement is it brings forth new ways of thinking about problems. New kinds of solutions emerge. It's sort of a slight sidetrack, but the idea is how can you move to a new system by just being a little innovative. Here's an example that not directly related to data, but it could be could be applied data. That is the idea of what Plus started with respect to thsee topic pages. If you look at what you know, we talked about Meredith, and the notion of getting knowledge from sources like Wikipedia. The problem is, most scientists don't stick stuff there because there's no reward for it. You don't get tenure for running a Wikipedia page. How do you solve that problem? Well, you essentially say, alright, let's ... solve that problem by giving the credit. Let's write this page as a mini-review. When we do that, we actually published in a journal, it gets PubMed ID, it gets the credit and all the rest of it. That becomes the copy record. At the same time, we actually take version in fact we put in Wikipedia and that becomes the living version. The authors get reward and we've seen it in Wikipedia. There is no restriction, well there is some restrictions, but not very much restriction, in doing that from a copyright point of view in doing that. It sort of solves a kind of problem. It builds more knowledge into Wikipedia and people get credit for it. I'm just going to close now with a sort of final message. about the notion of trying, even know global open access is a global issue, you know what can we do locally? We've been thinking about this problem. I want to give you an example of what we're at least thinking of doing in my place. First of all, we need to try making institutional repository that's useful- with common standards. It's got be vetted by the community, the community have got to be part of the development process. (It needs to be) fully out and searchable and it needs to reward people putting in there. You need to be out to leverage the asset. That is, to me, the key. So what does all that mean? Let me give you a specific scenario in labs. I fessed up I cannot reproduce my own work. One of the reasons is that we talk about Big Data projects. If you take just the average researcher here, and you take them from all around the institution, that's going to be much bigger data than taking a few of the faculty that have big data. If you want to have an impact, let's not worry so much. The big data producers somehow take care of themselves. Let's worry about the little data producers- the people who have DVD's and drives sitting on their shelves, thumb drives sitting around in people's drawers. Let's try and, you know, try and do something with those people to create a better situation, to make their work more reproducible. That way, we can actually comply with the data plans that we talking about. So it's really dealing with a long tile folks. How do we deal with a long tile folks? Well, a very simple solution is just the idea of having an institutional Dropbox. How many people here use Dropbox? A lot of people, right? It's just a great, simple thing. Problem is, I can't move particularly large files into it. But on an institutional network, I potentially could because I have more bandwidth than I have when I'm passing over the broader Internet. When I drag something into Dropbox, there is no particular access control, although I can define that. You want that kind of level access control perhaps a little different- you might want it for individuals, you might want it for a lab, you might want it for institutions. You might want to collaborate, you might want it for globally. It's the same kind of thing. to be As you drag, you want to actually capture a small piece of meta-data. This is not too annoying, but actually means that I might be out to do something with that data set two years from now because I captured a piece of information about it I would have otherwise had. All of that is just trivial stuff. That's just creating a Dropbox. But what can you do with that Dropbox? You can employ it and develop it within the context of the university culture. It becomes a rich campus resource. How can you use it? You can put it into the campus culture by... you solve the data management problem. Everyone who submits a grant, the people who control the institutional Dropbox are the ones that actually effect write your data management plan. What they get in return is when you submit that grant and it's successful, we hope, the line-item you put in for that data management plan gets immediately taken off to support to Dropbox and the other things. You're sustaining it through the funding. The real value of it comes in in how you begin to give develop around that corpus in ways you can't even imagine. A simple way would be I review institutional grants on campus, and and I'd sit there i read this grant from someone else on campus for some money. And I thought "Gaw, I wish I knew that person who was doing it. They are doing what I'm doing." ... So this kind of automated discovery of people, if my research data was in an institutional Dropbox and I allowed it to be scoured for these sorts of purposes and I would discover taht I'm working on certain protein domain, but guess what? Someone in the medical school who I didn't even know it was also been looking at this. I know that by virtue of their data. This occurrence occurred in a data they've been putting in the Dropbox several times. Therefore, it's probably important to them. Let's make a contact between two two people. I've done that automatically. What does that give you? Maybe nothing. It maybe an annoyance. It has to be done very carefully, but if it's done well then potentially what I have is a new collaboration, which will then generate additional grant money which the Deans and everyone else likes, right? That's just one kind. And then there is this institutional data associated with what happens to alumni, you know what happens students when they become alumni and a million things that can also be fit into this kind of process. That's why I'm sort of actually working towards my own campus. This is just a trivial idea, but something along these lines. How they gonna mind it remains to be seen. These things already getting out there. Just to finish off, I want to be able to answer questions. Right now, I can't do that, particularly. I can retrieve data. I can't go to resources and answer questions quite in the way I want. This goes beyond the institutional repository. This is really more generally. I've now gone from local to global. I want to know all there is to know about a biological data. I can't even find it now. I can find instances of it, but I can't find all the instances of it. I want to do things in a way that are simpler, more productive, more reproducible. Here's a couple ideas quickly about how to get there. I need to have a registry. We don't actually have a registry for data. What we have is a Google index. If there was a data registry, supposing I generate a data set with a whole set of data items in it, let's just continue to biological theme with genes in it. Those are identifiable, so I should be able to register the fact that I have this dataset and register the fact that this gene exists in this dataset in a central registry. Essentially all that it is the name of that gene, perhaps linked to where it resides. But people can go in use it. I'm required to do that by the federal funding I get. I have to to make it known to the registry. People who come along, they see "Oh, here's the gene referenced in these several different resources." I go and I try and one's good, one's bad. I can actually add a comment on that, I can star it. I can do other things. I can crowd source the value of that data set in the registry by the virtue the facts that it is in a registry. That's essentially the idea. Another aspect of it is the whole idea of how I operate on that data. What's really crazy to me as we struggle to use programs in the lab that people have developed, and then immediately we get on our smart phones and we download an app and we start running something which we understand in 30 seconds. We need more that notion of the "app model" in science and scientific software. I've gone on way too long, so I won't elaborate on what I mean by that. I think it's an intuitive direction in which to take. I think in many cases, it could be done. The nice thing about an app is there is the App Store. The App Store gives credit for things. You know when you look in there, if it's been downloaded a bunch of times it's got a five star rating. It probably is pretty good. When don't have anything like that in science. You go to a paper, you read it. "Oh this paper was published in the Journal of Molecular Biology." It must be a good piece of software. You get it, and it's not. The review is never looked at. It's all meaningless. In summary, we have at hand a way to increase the rate of discovery. We need to put more value on the date and individuals that produce it and institutions that maintain it. We're all stakeholders in this endeavor. I'm just gonna give you one example of how you can get involved. There is an organization called FORCE 11, which has come out of a group of people who feel that they can really make a difference in scholarship by getting together and working together in this notion in the Commons effectively to improve the way scholarship is maintained and disseminated. I encourage you to take a look, sign up for the mailing list, and get involved. There is actually manifesto that came out of the FORCE 11 work. You can look at. There is also so the fourth paradigm, which really focuses on the importance of data. Both of these are obviously open access. I encourage you to take a look at those. I apologize for going on a bit long, and I apologize for the fact that I can't show you was wonderful animations. Thank you very much. I'm willing to be yelled down, I'm willing to be abused, I'm willing for whatever. It's whenever you want to do. Goetch: Questions for Dr. Bourne? Bourne: Disagreements, anything. Audience: I'm Marty Courtois. I work with the repository here at K-State. I had a question. I think I'm gonna need to look at my computer again. Do you think repositories would be more useful if we made them easy-to-use. For example, in libraries, we're very good at identifying all the publications from faculty on campus. We have the tools to identify that. If we devise a system where faculty don't have to take the time like you mentioned to sit-down and deposit in article in the repository, but if that would happen automatically, and let you know that you that "Professor A" publishes this paper and it goes into the repository automatically. Would that help to make repositories more useful from your perspective, or are we still... Bourne:" I think these small steps that make a huge difference. Just taking the idea of having a student putting a thesis in a repository. I'm sure you do too- we have specific format requirements for what a thesis should actually look like. Probably right now, that's a PDA embedded in a PDF, which might be difficult to get at, but technically, it doesn't have to be that way. There could be a way, it could be a a different cover sheet or whatever else is attached to the thing, that really automatically provides all the meta data that is needed for that thesis as it's dragged and dropped. When the person drags into the repository, finished, they are about to go out and party, they've finished their thesis, they drag it into the repository. The repository says "this is a thesis, here is all the information," you say "yes that's correct, I just what the reviewers to look at it right now." and click. The reviewer is already there, click. They get an email, they can drag it back, but no one else can look at it. Then, after the partying is over and the person has passed their thesis, it becomes a part of the public record. There is nothing technical to stop any of that right now, as far as I can tell. It's just the resources and the will to make it happen. Resources are an issue. It's really "where is the value coming?" I think what's going to happen, and perhaps I didn't state this as well as I'd wanted, in this knowledge economy, it is clear that more and more there going to be a constant reminder that mining information brings forth new things that have some kind of value- it could be economic, it could be other. I was in a meeting where they brought forward two editors to the major British publications, newspapers- Murdoch's "The Times" and "The Guardian." At the time, they were talking about was what Murdoch did was they decided to put up a pay wall. They lost 95 percent of all of "The Times" subscribers to the online version in a month. Obviously it was a bad pricing model, but you know, there were too many other places to get news. On the other hand the "Guardians" model was "okay we're gonna make this content free." We're going to create an API around it so people can write applications that use that content. So one group wrote an application where they can actually predict voting patterns just by looking at large amounts of the corpus, and information it was in the corpus, the news corpus, as to what voting patterns were in particular counties in Great Britain. That has value and so you know, they got a piece of that back action. That's just one trite example of how that can that Cooper's could be used. It's a different business, it's an interesting model. It would be interesting to see if these things work out, but I think the nice thing is that, you know, that with oppeness comes innovation. That's the message- there are lots of different kinds messages like that. Drug companies are realizing this now that they can gain much more by making at least some of what they do open. The financial industry is another one. World Bank is essentially making huge amounts of information open access. ... Goetch: "Other questions?" Audience: "Hi Dr. Bourne, I'm Shar Simser I'm the coordinator for electronic publishing here at Kansas State Libraries. We run an open-access press. My question doesn't have to necessarily deal with the data, but the UK government is now funding, what 10 million pounds to publishers to support immediate and open-access to articles published by their researchers. What do you think about the report that this was based on and the adoption by the UK? Do you see that happening here? Bourne: "No. Not yet." I must say, I hadn't read the report. I mean, I've looked at it, and I know people's opinions and discussions about it. There is somewhat of a different culture, I think. But not withstanding, it's a very interesting step. I have enormous thanks to likes, in this country, to the likes of the NIH for taking the stance they have with respect to open access. It's a huge step and whether or not you take a very large step or you take a smaller step, a step is good. It will be interesting to see how all of that pants out. Audience: "Dr. Bourne, you mentioned several times that in order for data to be more accessible, faculty have to be rewarded and the 10-year promotion process. Do see that happening anywhere, yet? Or is it happening on your campus? Bourne: We just had a report where we looked at a group charge looking at the way promotion was done. They were, from my point of view, it was a little disappointing because I don't think innovation was rewarded as much as I would have liked to have seen. What was clear is the idea that we have to get away from this notion are rewarding people in a singular sense. That's ultimately what we have to do. Somehow we have to reward them for the fact that science itself is becoming much more collaborative, and the whole idea of "what has value" as part of that collaboration. That needs to be assessed. Again, I'm a total radical, but to me, the university structure itself is totally broken. I mean, that the fact that we have, people, and money, effectively, siloed into departments that don't necessarily make any sense anymore in the way that people do research. We solve that problem in the UC system by organized research units, which bring people together around particular areas of interest, but they still get pointed to appointments. Students get graduated through those departments. The money flows through those departments and not necessarily all of it going into what is producing the results. It's kind of a step, but not perhaps as far as we want to go. There's a lot of issues at stake here. Changing systems like that is very hard. I wrote an editorial a while back that says "How to get promoted as a computational biologist in academia." It got huge downloads on the Plus side. There are a lot of people who are interested... essentially. the message was, as I said in the talk, you have to educate the committee that's reviewing it. I'll be quite frank, I sit on tons of these committees, and you have six people making a decision will really, probably two people who know that person's work, one perhaps pretty well. The rest, how it goes, where that person is published now if you don't know the work well enough. Now you might actually use Google Scholar or ISI Science to actually see you know what kind of impact it's had. For a lot of reviewers, that's as far as it got. If I produce software, or I produce these datasets, it doesn't count for anything. Yet, you know ask yourself "What's more valuable? A paper that has only been cited by the people who wrote it, or a dataset being downloaded hundred of times and it's generated a whole lot new papers and science?" It's kind of a no-brainer to me. We had this great idea when we were discussing this at Gordon and Betty Moore (foundation). They way to get attention in institutions is with money. The idea was we would actually create chairs within these institutions. I guarantee there are probably people that you know that probably don't get the credit in this institution that they should because they do things that are not quite traditional, yet they are absolutely an integral and an important part of the fabric of research and education. The idea was to identify some of those people and give them chairs, like an actual named chair and to elevate name into a position where they would be recognized by their peers. I think this is particularly true in data in the digital realm because these people underlined what absolutely critical. Some of them are doing absolutely amazing work, but they're also maintaining resources for others and that sort of thing. They don't get quite the credit the deserve. Another long-winded diatribe, sorry. Goetch: Any other questions? We still have a little time. Before we end, I just wanted to remind you we have another program tomorrow called "Open access in your publications: what's copyright got to do with it?" It's in Hale room 407 beginning at 1:30. This is a webinar with copyright authority Kenneth Cruz who will be talking about how you can facilitate access to your materials by learning to be proactive. Please join me again in thanking Dr. Bourne. thank you for having me it's wonderful