Power of Information Conference - Analyzing disparate data

[applause] >> Paul Light: So nice to see so many of you who are clinging to hope that the answer is here, and it might be! We've got three really talented people. We are all three—or all four quite aware that we are what stands between you and freedom, and your rides home. So we are committed to pith—pithyness? Is that correct? It's a really great panel, we've heard from all of these characters before. I've got Diego May right here; he's the CEO of Junar, which is a cloud-based— >> Diego May: open data platform. >> Paul Light: open data, I'm going to ask him a tough question about that. We have to his right Raphael—no, Raphael's at the end, Majma, who's senior advisor to the CIO. >> Raphael Majma: T-O. >> Paul Light: CTO—Chief technology—did we create that by statute or is it by order? >> Raphael Majma: I can go look it up if you like; I don't know off the top of my head. [laughter] >> [woman off camera]: We used executive order. >> Paul Light: Executive order. Because we have CFOs, CIOs, we do a lot of this, but this is one that has some, some strength to it by executive order, so it's nice having you here, and you can say a little bit more about what you do. And Jason Payne, from Pinatar. >> Jason Payne: Palantir. [laughter] >> Paul Light: Palantir. [whispering] Pinatar, Palantir. [speaking] It must be a variation, some word play. Does a wonderful amount of work on the philanthropic, providing software and support, which is not tax-deductible, so this is good to the heart. This is all from the heart. Not that giving things that are tax-deductible would be anything but from the heart. So I'd like each one of you to say a little bit more about what you're doing, and then I'll ask a question or two, and then I'll open it up and we'll continue and bring this all to a close with some remarks from Buzz I belive. Anyway, you want to start? >> Diego May: Why not? So we started Junar with my company-founder five years ago, and five years ago we saw that a lot of people were talking about—actually, very few people were talking about how to publish data. The problem that existed back then was that if you wanted to use data, if you're a user like you and you wanted to use data, it was very difficult to find data. Once you found data if you were lucky, then you would find that data in PDF format or in big excel files. Then you would find that you cannot use that data. And on the other hand, if you were an owner of very valuable data and you wanted to open up that data, you didn't have a clue on how. By the way, if I ask now, how do you go to start creating an account to publish 140 characters of content, you know hwere to go of course. If you want to do the same for a video, you know. But if I tell you, go out there and find something that provides you an account to open up data, probably you don't know where to go. So five years ago we decided we're going to create the easiest-to-use open data platform. It has to be brutally easy to use, and it has to be amazingly cheap. We didn't get yet to the brutally and amazingly, but we're on our way. [laughter] But we have today about 45 paying clients, most of them governments, usually city governments that have the mandate to open data. They want to do that because that's the right thing to do. And they're using our platform to be able to accomplish that and get a lot of benefits out of that. And that's it. >> Jason Payne: So my name is Jason Payne; I have the pleasure of leading the philanthropic engineering team at a software startup in Palo Alto, just down the road, by the name of Palantir Technologies. So as a company we build data analysis capabilities for enterprise organizations and work in the public, private and commercial sectors. I have the pleasure of leading our work in the social sector where we donate our software to data-driven organizations and help them empirically address the problem sets with which they're chartered to solve. A very good tactical example of this in the context of open data is an organization that we support out of Santa Barbara by the name of Direct Relief, formerly Direct Relief International, and they're using our software to improve their ability to donate medicines and medical supplies to those that need it the most. So they are a trusted agent that big pharma companies will give in-kind donations of medicines, medical supplies, etc. and then they're using our software to optimally donate that. So a very concrete example of that would be, if you have 20 million insulin injections, how do you most efficiently distribute those in America. What data would you want to use to make those decisions? How are you going to pull in hurricane track data and flood plain data and diabetes rates data, and social vulnerability data and census data and that sort of thing, to build a cohesive model and a cohesive picture to say, these are the ways, these are the counties, these are the federally-funded community clinics where we can donate these supplies and have the most impact. And so we're working in, the focus areas that we have today are anti-trafficking efforts, disaster relief efforts, which we define as both natural as well as political disasters, supporting veterans, and global health. >> Raphael Majma: HELLO! [laughter] Sorry. I saw faces down so I thought I'd start a little louder. So I'm Raphael, I work at the Office of Science and Technology Policy over at the White House. Now I'm a senior advisor to the Chief Technology Officer Todd Park, but before that I was a Presidential Innovation Fellow. And I'll take a minute just to say what that is, and if you heard me say it before I apologize for repeating the joke, but it's kind of like Peace Corps for geeks, but instead of going to like a developing world country, you go into a place like the Department of Education, USDA, and you go— >> [woman off camera]: It's a lot like being in the Peace Corps, let me tell you. [laughter] >> Raphael Majma: [laughs] And it gives you an opportunity to tackle a discrete problem. My problem was nonprofit data, and there are a lot of different ways that government currently collects data about nonprofits and then makes that data available. Not in the most machine-readable ways possible, but we're trying to work on that. Since staying on board I've continued to work on our open data executive order, which mandates that all open data, for future information collections, are open by default. So that means that in the future, [knocks on wood] all of our open data is going to be just, it's going to be the presumed default and agencies will have to actually say why it shouldn't be open, rather than say why it should be open. >> Paul Light: Do the classifiers, are they on board with that yet? Can you, the default position being, the people who have the authority to classify it as a national security, there are about 300,000 of them. >> Raphael Majma: So I can't speak to the individual...efforts. >> Paul Light: Agencies? >> Raphael Majma: But I can tell you from our perspective the executive order actually lays out the reasoning why that should be the case, and it creates the mandate for doing so in the future. >> Paul Light: Good. Good. Well, we'll ask a couple of questions here which is, really, whether there's any data that you think should never be released? >> Diego May: I'd start with something... one thing that perhaps the three of us, and I'm not sure if it's general to this room here, when we think about open data, at least when I think about open data, I'm thinking spreadsheets. I'm thinking spreadsheets. I'm thinking tabular data. I'm thinking dashboards with interesting maybe charts, but usually a lot of tabular data. And as we were discussing before with some people here and there, sometimes this may be unstructured data first, a lot of data coming from different documents and then somehow, with a lot of intelligence and a lot of tools, that becomes structured data, and then that structured data can be opened up. Sometimes it was already structured data that existed in a system, in a file that someone is working on, or in a database. With that said the question of, is there any data that shouldn't be open, more or less? If you ask me, I'm a fan of all data has to be open. Of course there is a lot of privacy issues when we deal with governments, city of Palo Alto, city of Sacramento—they always ask the same question. " I have a lot of data. I want to open a lot of this data, but there are some pieces of data that if I open it up, then I'm going to be crushed, and any open data system has to take that into account, because that's a key thing. whenever we're touching privacy aspects; whenever we're talking about who is earning what, in the case of government, and you know better your world—there is some data that doesn't have to be open. >> Raphael Majma: So to government, we have open data, which is kind of aggregated information. We get to collect census information, like that. And then we have what we call "my data", which is letting a citizen access their own information. This is really popularized in areas like health and student education records and in your energy consumption. It's one of the things that we really make a hallmark of "My Data", is that rigorous privacy protection, meaning you can choose to download that, and you can choose to send it to a trusted third party, and that is your control; that is your right. We and other institutions, as governed by laws like HIPAA and FIRPA, don't have the right to make that data open and available. >> Jason Payne: I think when you look at the philanthropic ecosphere, by and large, we're trying to help those that are most vulnerable, and that same population is the most vulnerable to exploitation. And granular data that identifies those individuals used in illicit purposes can radically harm those individuals, so the utmost care must be done to ensure those that have access to that information, have need-to-know to use that data. I think something that's very important, but under-discussed, is the need to audit immutably what people are doing with information. I see the ecosphere as a whole making a huge rush toward the siren of Big Data solving all of the world's problems, but that data, illicitly used, can open up significantly more problems. So anytime that something is granularly personally identifiable, or even possible to be backwards-computed to be personally identifiable, the utmost care must be taken in how that's shared, who has access to it, and ensuring that it's used correctly. >> Paul Light: How do you get organizations that are in the same social impact line of work to share data so they're not overlapping. Let's say in your particular case, you've got a number of organizations that want to deliver vaccines and medical supplies. So that you don't have pileups and big overlaps, how do you share data across those organizations? Has it been your experience that they're sharing data to avoid the waste? >> Jason Payne: I've seen good, I've seen bad and I've seen ugly. [laughter] I've literally had an NGO say, what's in it for me? I have 80% of the solution. Three or four other organizations might have the rest of the 20%, but I need it to be worth my while. So I've seen selfishness. I've also seen incredible selflessness. One thing that's really interesting is the notion of data exhaust, and that is that there's all these organizations that are collecting information to do their day job. They're helping farmers, they're helping families, etc. other organizations, the World Health Organization, folks trying to stamp out malaria or other tropical diseases, that data about microloans may actually be a very good data source to make a decision on how to allocate other resources, so finding organizations that can put data out to the collective and incentivizing that—at the end of the day, it's actually expensive to do that. To host it on a server; to hire the right engineers to build those application programming interfaces, etc., cost an organization money, and I think that funders and foundations inspiring that and frankly [aligning? 13:22] a relatively small amount, but having things like, we discussed earlier today Kiva making an effort to make all their data publicly available, that sort of very small investment can actually have massive returns, looking at this as an ecosphere and not a marketplace. >> Paul Light: So data exhaust in this particular usage would be the release of data by the organizations or...? >> Jason Payne: Yes. >> Paul Light: OK. as opposed to data exhaustion. >> Jason Payne: Yes. [laughter] >> Raphael Majma: Hopefully after today we're not in data exhaustion. But, I think, it seems to me that you can't create inherent demand for data. However, there is a demand for it, so there is an ecosystem chomping at the bit in a couple different areas. And by creating a business case for why organizations should pool resources, or should create a baseline of information, you, one, allow an easier access for, possibly competitors, but usually just people who want to compete in a different space by using that information. I'll give you an example. iTriage uses GPS data, but they use it in a way that no one else does, or in a way that fewer others do. They use it to track where you are, and track where local healthcare providers are that meet your symptoms. They have no way to compete with TomTom, who also tracks that same GPS data. So by creating that baseline, I think there's a real business value, not just for the participants who are using the data, but also for the funders and the nonprofits who actually aggregate it themselves. >> Diego May: We comment on that the World Bank usually talks about, we open up data of what we know, what we do and then of the operations, the impact. And I think that is interesting for almost anyone that is doing interesting things, learning a lot and exposing that. And I'm also in the line of, if I have data and if I'm leading organizations, if I'm one of those that is scared of everything, then I'm not going to share anything. But if I'm a leader in what I'm doing and I really want to do good things, I should be very open to share data that can really then benefit others and create a larger value. In our case, whenever we talk to these organizations that have these reactions that you mentioned, fine, I know that there's going to exist a lot of organizations that don't want to share. Let's go to the others that want to share. And I've seen a lot of NGOs, an example of one in Panama that has a lot of entrepreneurship data, and they create a lot of valuable reports, and then they open up these reports, and then they do hack-a-thons and they invite people to collaborate, and you see, and you've probably been to a lot of these hack-a-thons... has anybody been to a hack-a-thon before? Any of those governments? Amazing. This is an awesome crowd. But the energy that you feel there, when you see all this data put there, government officials asking questions, I mean, they need to solve problems, and then you have citizens, journalists, and other people trying to solve those problems. It's high-energy, but it can only happen if you really open up and you're ready to share. >> Paul Light: And allow the data to be manipulated, right? >> Diego May: Yes. Yep. >> Paul Light: So, as opposed to releasing it as a dashboard or as a summative kind of piece, it's available for analysis. >> Diego May: Yep. >> Paul Light: And how much do you think is out there to that degree, where we can actually manipulate it? What have you been seeing? >> Raphael Majma: Oh boy, well that is a great question. So, we know what we know, but we have no idea what we don't know. And I think, at least in the nonprofit ecosystem, there's a lot of organizations who probably collect data and don't even realize it's data—or rather, collect information, and don't realize it can become data. And I make that distinction on purpose. As far as government goes, we have tens of thousands of data sets available to varying usage, so I think that, when we talk about—I couldn't begin to put a number on it. >> Paul Light: Does somebody need to say along the way, independently, the quality of the data and whether we really have data? Have you thought about that question, about sort of telling the people who might manipulate or use it, well, we've got X confidence rate, or it has Y validity, or so forth? >> Diego May: I think one initial piece of the answer to that is, I think in the early days of the web, of the content web, and I like dividing the web into the content web and the data web, the content web, initially there was a lot of—and I cannot use the word; we're in front of the cameras—bad content. [laughter] Right? And then the web itself started curating itself and now there are better ways of finding the right content. We're at the very very very early days of the data web, so I think that whenever we're asking about quality of data, we need to have data out first. And once we have a lot of data out that can be used, then there's going to be a lot of ways of finding the right data, to be able to find the curators of the data, to be able to find organizations that can define the quality of the data. But today we're at only the tip of the iceberg of the data web, I think. >> Raphael Majma: And I mean, it's happening now, to some degree. I can't tell you how many people in my twitter stream talk about standards data, and the efficacy of it, the accuracy of it, and places like Stack Overflow have dedicated data communities. So I'd rather not hear from the data owner about the quality of the data, I'd rather hear from someone that's been using the data for a while about how great it is. Being able to go to places like Stack Overflow and Stack Exchange, who kind of, that's where geeks sort of live, that's where they reside, that's where they talk about these things, I'd trust them much more than I'd trust some other places. >> Jason Payne: So, a couple points. The first is, I'm a strong believer that, if you build open interfaces to access data, the right solutions will materialize from technological folks using it, rather than, let's get 8,000 folks in a room to build an information exchange model that will rule all information exchange models. And just release it in a granular, transparent, easy-to-use fashion, and who kniows what will come out of it? New York City released taxi cab data on pickups and releases, and you don't need an information exchange model that's ubiquitous across 80 data sources but some NGO can use that to make New York City a better place. The second point is that in all—I think it was Robert Kirkpatrick at the Global Pulse, UN Global Pulse, who introduced this concept, when I first got it, but the notion of data philanthropy. Right now philanthropy really can be bucketized into giving time or giving money. Let's add a third leg to that stool, of giving data. And obviously with correct data access and auditing and that sort of thing, there are so many data sources out there that could be very helpful. An example that I find interesting is that if I had Coca-Cola's data, in an Indonesian tsunami or an earthquake in the developing world, if I looked at sales rebounding of affected areas, I would basically just put aid where sales had fallen the most, blindly, and that actually would be a really good metric to decide where to distribute aid, over anything else. Finding a way that Coca-cola could do that without risk of that data being used for competitive reasons—if that incentive can be done, that could be a potentially massively valuable thing. >> Paul Light: That's what you call a bellwether. >> Diego May: Can I say something? >> Paul Light: Sure! >> Diego May: I think that we agree that data has to be out, and I love the concept of the data philanthropy. One thing that Jason mentioned and I think it's super important. Defining the quality of the data, difficulty is going to happen over time. But defining how data can be exposed, there are clear standards, it has to be searchable, so data has to become SEO-friendly, friendly for the search engines. It has to be usable once you find it. It has to have an API that allows people to use that data systemically, and I know that not everybody's technical here, but that's crucial whenever you are defining how you are going to open up data. Then the quality of the data itself, that's going to progress over time. >> Paul Light: I think we're going to open up for questions. Dana, did you have one? >> Dana Pancrazi: I did, I do. I struggle with the idea, because we're interested in service of the public good, we're also interested in commercial enterprise and maximizing financial return. Talk to me a little bit, and we spoke about this a bit, what are the bounds of, where does open data and monetization of data meet? And is that regulated currently on agreement? Is there a structural barrier? Is there a barrier? I'm unclear. When I say open data I think I typically mean available to many, but I have personally no idea where the bounds are on that and where monetization begins. Or doesn't. [laughter] >> Raphael Majma: I think it depends on the source of where you're getting the data from. If it's from the government, it's 99.9% free. It's the taxpayer data: you already paid for this; you already paid for its collection. It's your data to do with what you wish. If it comes from other places there might be licensing issues that are related to that, or contracting issues. How you monetize that might be dictated by the terms of that agreement. But as far as government data goes, there are many many companies that have been started on the backbone of government open data, and we love seeing that. That is the best thing ever, as far as we're concerned. >> Jason Payne: There are some cases where there's data that's generated as a by-product, and the open release of it's just a positive externality. There are other cases where a fremium model is needed. Especially as, so fremium meaning that in some cases the data can be freely available, or subsets of it can be freely available, and in other cases, more verbose explanations of the data, or better capabilities to analyze that data would cost money. Or there are ways, a sort of public-private partnership, where governmental users of the capability, or corporate users of the capability, might basically fund the public use of that data or capability. Unfortunately there's far too much need in the data world for humans in the loop, and when you're releasing PDFs and when you're releasing a scanned paper documents, etc., at the end of the day, it takes, if it's an hour an organization and you've got 10,000 data points that's five man years, even in a philanthropic context it's hundreds of thousands of dollars that an organization would have to invest before they could open up that data, and everyone's like, oh yeah, you have that data, you should open it! But no no no, they have to get paid for it, otherwise they're insolvent. Furthering the concept of data philanthropy, I would encourage funders to look at that being an impact, and that being an area of investment, is capitalizing those sorts of efforts for that public good, because if that can improve an ecosphere, that small investment is going to pay far more dividends than on-the-ground funding of specific things. >> Diego May: I think that Mayor Bloomberg would agree that data can be monetized, and by the way, 80% of the data that Bloomberg provides is open data, right? >> Dana Pancrazi: Quite clearly! >> Diego May: Then they curate and package and do great things with that. And then there's people that pay $1500 a month to use, to have access to that, Capital IQ made a lot of money out of that as well, and Reuters and some others. I think that there's a lot of, we were talking with James [Lum, of Guidestar] these days, the business case, the business models around open data, and they are there. You are putting a lot of value on top of the data that you are generating, and you have a lot of, in the case of James, a lot of cases where people are paying to have access to that data. You're subject matter experts. So once you have data, I agree, a fremium model. Some is going to be for free, it makes sense, and then for that data where you're putting a lot of effort, or for that data that is being accessed systemically via an API in a broad way all the time, to get always the latest values, there's going to be revenue models, and there are even cities already thinking about how to monetize this data that's going to be accessed all the time on a very frequent basis. >> Paul Light: Bloomberg has been buying up a lot of capacity to understand government data, and they have a very significant investment in monitoring government. Probably the largest now—larger than the Washington Post and so forth. So how do you feel about the fact that they're buying the capacity to analyze the data, and that that capacity is very very expensive to create? >> Raphael Majma: So I'm not actually familiar with what Bloomberg has done, but the capacity to analyze data is a good thing; we should be able to analyze data more robustly and better. I don't really know what they're analyzing for, but I don't see a problem with that. >> Paul Light: Ok, well, we'll talk. >> Raphael Majma: Ok! [laughter] >> Paul Light: Well, there's something—I mean, you know, weather.com uses NOAA data, but they're not the only one that does it, but there are some highly specialized data sets that are open, but they're so expensive to analyze that they can only be analyzed by somebody with some deep pockets, I guess. Other questions? Yes. >> James Lum: I had a question about personally identifiable information. So, we're more involved with organizational data, and obviously open data would put more data out there, but as long as there's a silo, a box around it, a piece of data, that's open, it's more useful, but the real value comes when you can tie this data set with that data set, and you have a common field in between. Which for all of us is pretty much the EIN number, despite its limitations. So you guys were talking at depth for a while about personally identifiable data, when it's the individual, and to Mauricio's point, if you're measuring impact at the individual level, how does the industry mash personally identifiable data from one source versus another? Jason, you said you can, you know, machine compute who this person is, if you're in a silo, but how do you bring data from three different sources that have some sort of—is it social security numbers? I have no practical experience. I thought maybe you could shed some light. >> Jason Payne: One of the things that I'll say when talking about any data is, a common set of unique keys, it's easy with EINs for this ecosphere, but the more unique IDs that you can have, the easier it is to build these collaborative efforts across data sets. There are sophisticated algorithms where you can deterministically look at property matches that define what would be a known match, and so that can be organization name, that can be address, that can be phone numbers, a plethora of things that would be semi-unique, but putting them together would be unique. So there are a couple of ways to combine data in that capacity. On the PII question, there's an interesting trade-off of what's a known clear value about that sort of data coming together across organizations, juxtaposed with what are the risks that happen? And unfortunately, you really do have to say, what's the worst-case scenario? And plan against that. There have been some well-documented cases where, even at an aggregated level, someone releasing *** documents about disease outbreaks in certain villages ended up outing the person that was the patient-zero in that village, and resulted in an ostracization and problems there. One of our disaster relief organizations actually, for lack of a better word to describe it, had gotten hacked, and had people misrepresent them, and either use the data that comes out and collecting data from folks to say, what help do you need? Where are you? And geographically coding it, is great if you have great intentions; it's also great if you want to exploit those people. And so to be honest, I'm terrified, as I see this huge gold rush toward Big Data, that these questions need to be asked, because as an ecosphere we don't get too many "oopsies" before this really gets shut down. So I want to be very conservative when working with those types of data. >> Paul Light: Can I just interject a question here in terms of when the data should be released? For example, you get an academic paper that argues that austerity programs are essential for growth, recovery and so forth. The paper is released; we build an entire regime of policy to cut budgets, and then the data eventually come out, and they don't prove the case at all. So is there a standard there about release at time of conclusion drawn, that you all might embrace, or not? >> Diego May: You know my answer, probably; it's as soon as possible! I guess, as soon as someone is ready to publish a paper, all the data behind the paper should be released. That's my initial reaction. And again, thinking naively of a very constructive world where you open up data, you expect comments, you expect also to find somebody that says, with your data, I arrived to a different conclusion, so you're wrong, and then you're happy that someone found you are wrong. Not everybody works in that way, [laughter] —but my feeling is, whenever you have valuable data, open it up, allow people to use it and have fun with it. >> Paul Light: I'm releasing a book and I'm thinking of releasing the data, so you'd say I should do that, right? But I'm terrified. [laughter] Because I know, I just, I think you said that, somebody said that, I'm sure there's a mistake in there. There's got to be. Because I am close to perfect, but not quite. [laughter] So, it's a scary prospect. But you've got to expose yourself to that, right? >> [man off camera]: You could publish a second update! [general chatter] >> Paul Light: We should have the second editions, which say "The I Made a Mistake Edition", right? >> Raphael Majma: "What I really meant" version. >> Paul Light: Yeah, "What I thought I should have found". >> Jason Payne: You know, what I think is actually far more important than when the data should be released, is when the data should be forgotten. >> Paul Light: Interesting. >> Jason Payne: Again, on this notion of vulnerability. It's valuable to have data about health information, maybe if you have arrest data, all sorts of data points, but once that, looking at the graph we had yesterday, once we move into this green area instead of that green area, that being an artifact that follows that individual around for decades I feel is inappropriate. We should also look at when do we delete, destroy or otherwise let data be forgotten? And that's something that should be a policy, or something that organizations think about as well. >> Paul Light: That's interesting. Can it ever be forgotten? Just, exists out there someplace? >> Jason Payne: I would say yes. Open data, maybe not, but when it's talking about some of the data systems, with the keynote, with FII, some of that data could have data retention policies where it is actually removed from the system. Once it's out on the internet, no, but in an internal system, yes. >> Paul Light: Other questions? >> Brad Langhorst: I wonder what technical things could be applied to things like the Netflix case, things where people thought they had anonymized data? Was it effective? Or the case you mentioned with the disease, patient-zero situation. I'm not aware of sophisticated technology that says, ok, this is sufficiently anonymized, or sufficiently aggregated, or enough noise has been added that you can't back out individual information. Are you guys aware of technology that can do that kind of thing and give you some confidence about, I'm releasing this; it's permanent; you never get it back. How can you decide it's okay? >> Raphael Majma: I don't think technology alone solves that problem. I think there are processes around making sure the data set you're about to release is covered. There's a human component and there's a machine component to it as well, so technology alone probably shouldn't be the solution to this. In the case of the federal government, there are processes around making sure that you're doing the best possible job, and making sure personally identifiable information doesn't get out. Or that it isn't included, in some cases, we consider also the mosaic effect of releasing a lot of data sets that could lead back to a patient zero, for example, but it isn't just a tech fix. I don't know if it will be anytime soon just a tech fix. >> Diego May: I agree 100%. [laughter] >> Catherine Havasi: I think that it also becomes very difficult because, as in the Netflix case, there's always interaction between data sets that ends up revealing things. And so even if when you release your data set, it doesn't reveal anything, someone may release something down the road that will cause your data set to reveal a lot of things. And that becomes something that technology doesn't solve. >> Paul Light: Response? ...Agreed. >> Raphael Majma: That's correct. >> Paul Light: Other questions? Is there any indicator that you would use—I know you know I've asked this question before, but if we're looking at technology or open data, anything in particular that you've seen that would be a bellwether of a bad system? Or an indicator that you ought to take a deeper look at whether there's some gaming going on with the data? >> Jason Payne: That actually brings up an interesting point to make is that, call it Big Data if you want to use the buzzword of the year; call it data mining; it's very dangerous and problematic in this ecosphere. Doing Big Data on a Wal-Mart, Amazon, a GM, a Ford type data set that's cleansed, manicured, uniform, complete, you can actually run deterministic algorithms; you can run predictive algorithms. Doing it especially on humanitarian development data is actually a very very dangerous proposition. Looking at, I've looked at certain data in Afghanistan where the data says that violence stops somewhere, and it turns out that that's because folks got pulled out of their humanitarian work in that area, because it got too violent. And running a machine algorithm would say, hey, there's no problem there, on that data set. So there's sort of a troika required to be successful in my opinion, and that's data, technology and domain expertise, and in this space it's incredibly important to have that domain expertise. As an outsider looking at the data I'm going to make terrible decisions on how to solve a problem, without that knowledge of that specific problem set. >> Raphael Majma: I mean, you just described the mix of a good hack-a-thon. It takes people who understand technology, data experts, and then doers. Without one of those you have a situation where you're not going to create a minimum viable product. And to your question, I want to shy away from the sys—well, shy towards a different type of system question, and I see a lot of people that talk about open data do not understand the systems that create data, or that structure data. There are systems that create basically proprietary data sets, meaning, unless you use that same vendor, unless you use that same content management system, you can't access the data, you can't make sense of it. So focusing on open formats and focusing on making things machine-readable across various platforms is, I think, where I'd love to see the nonprofit sector or ecosystem go. >> Paul Light: Do you think that people who are generating the data have the training, the kind of ethical and kind of thing you're laying out here, Jason, to know what they're doing? Do they need some help? And do the consumers of it, the policy makers and so forth, do they have enough technical information to use this wisely? >> Jason Payne: Hmm. >> Paul Light: There's that wry smile where you're like, hey, I think I have an answer to this one! >> Jason Payne: I'll say, ethically yes, there are certainly bad actors in the ecosphere, but they're basically trying to rip people off; they're not actually collecting data and going out and unethically using it. That being said, frankly, the road to hell can be paved with good intentions. By and large, I think that there's not enough technical talent in the ecosphere, and when I say technical talent I don't mean people that ten, twenty years ago wrote some code. I mean people that know how to use modern APIs, modern data analysis, etc. standing on a soap box, I really think that the funders and the foundations have the ability to instill change, and instill as a requirement for grants, certain processes and procedures. The comment about the Italian restaurant with four employees not being able to do data, instead of as a condition of the grant to that sort of organization, it came with a data systems that actually saved the organization time and made them more efficient and effective, you'd get two hands up if that was part of it. Here's where I absolutely love what the SalesForce Foundation has done: they've got 14,000 clients using their software, very open, very easy-to-use software, and an entire ecosystem around it where, anytime I come across an organization that is using SalesForce, it makes our data analysis technology applied to their data so much less risky, because of the quality of that platform. So I think that foundations donating that as part of a grant could actually significant help those small organizations. >> Paul Light: That would be a form of operating support, which is prohibited in the foundation world. [laughter] Oh! I didn't say that! [laughter] I didn't mean that. Are you saying that my COBALT training is no longer valuable? Because— [laughter] >> Diego May: Co-co-what? >> Paul Light: I was very popular at Y2K. [laughter] I was very much in demand. Other questions? Yes. >> Diego May: Can I--? Can I, sorry, super quick on that. I think that one message that should be very clear is that we shouldn't be concerned or scared about open data or about opening data also. There are some processes that, or some common sense guidelines about what data to open, but then after that, it is much simpler. And if it's not, it's our fault. NGOs, associations, governments, should be able to very easily open up data. It should be as easy as putting a blog post, putting up a web page, and the same type of processes, right? But somehow whenever we talk about data we get so concerned, and then oh, before doing anything wrong, I prefer to do nothing. And I think that opening data is no big deal. I mean, you have data, you ensure that you're not putting private or individuals' information, but after that, it's pretty simple, and it's pretty common sense. It can be done. >> Renee Westmoreland: You set me up perfectly here, because I think that there's been this sort of thing going on the last couple of days that's like, "Oh, taking all the data and running it through something, that's just the super simple part." So, I'm not entirely sure that most organizations out there have that experience. So I was wondering if you might be able to give an example, you said, of a case of taking an unstructured data set and turning it into structured data, and how you come to the conclusion that that's pretty simple. >> Raphael Majma: Actually, can I just jump in? It's going to answer part of your question I think. When people talk about opening data, usually, especially in the nonprofit ecosphere I think they say, here's the reports you have: Open that data! And that's difficult, that's actually pretty difficult, because that means possibly going through an OCR process on the PDF, or whatever type of format it is, or actually going into paper and transposing that into some form of excel spreadsheet. What good open data policy really is, is information management. It's actually looking at an information, and I'm cribbing from our new executive order, but it's looking at information lifecycle from the beginning to the end, and bifurcating that process. So you start with information collection: how do you manage it for internal use? How do you manage it for that eventual external publication? And then actually getting it ready to be used internally, and then to be used publicly. I think that I've had enough conversations where it is not an overnight process for the nonprofit ecosphere. There's no light switch that will turn this on. But you have to think that there is a way to trickle that down, or to encourage the release of high-value data, or, there are ways to do this where, surgically, frankly: be really precise about what data you're after, what organizations can provide that data, and what's a good environment for creating this culture of better information management to create that open data ecosystem. >> Paul Light: Now we can go to Hope, or...? Yeah. Go, go, go, yeah! >> Diego May: Super quick. >> Hope Neighbor: I completely agree with Raph, that I think the difficult part is this unstructured data, but I think that what's magical in an example like the World Bank is that what actually happens is the bank has a very strong internal project development and management process, and so every single piece of data that they—every single piece of project-related data that they publish externally has already, is a piece of information that they already needed for their internal process. So nothing is re-created, they're simply releasing it. And I think there's tremendous potential on the foundation side, because there are standard proposals and reports, to release that data, and from there, I think the really tricky part, is how you then think about releasing all the underlying data sets. But if we were to release just that first strategy, I think we would leap forward in terms of the information we have access to. >> Raphael Majma: To your point, there are certain things that all nonprofits do. They manage information, they manage financial information. There are certain tools that could be created to help them better, not just manage that internally, but aggregate that for eventual public dissemination. And it's one of those things where I can see there's a vacuum there for what those tools could be. Whether it's open source or it's proprietary, hopefully open source, but there's a way to share that across nonprofits and across verticals. >> Paul Light: Diego. >> Diego May: A quick comment and again, with our open source solution or, I don't want to sound sales-y with this but I want to ensure that this is really easy. Either with our open source solution or the SAS-based solution, we've seen clients go from zero to having a complete open data portal with a lot of valuable data in one day, and I'm not trying to make it "wow!" I'm not the only one; there's other companies that do that. It's really simple. If you have the data sets that you want to release, setting up an account to have an open data site and then have it ready and machine-readable and everything, can be done in one—it's creating an account and setting up all that. >> Paul Light: Elizabeth, right? >> Elizabeth Dreicer: So, just a question, because, I think a point that you made a moment ago, Diego, is that people are scared so it's like, instead of doing anything, we'll do nothing. But I want to articulate that, at least in healthcare, the point that was made, I'm sorry, by Catherine? Was that, while the data may have not been PII from any singular source, the constellation makes that PII [personally identifiable information]—have any of you guys seen anybody working to try to detect when PII has occurred, and essentially then encrypt or somehow address it? Because I think that while it sounds great to just say, put it out there, I'm thinking about it from a risk management perspective, I put my risk management hat on for a minute, if I were a trustee of a medical establishment, could I, knowing that PII is possible, from unique, even biometric data, can I really just release and open the floodgates? I may have an ethos that says, hey, I want to be open! And yet there is some responsibility. I'm just wondering how do we manage through this process, and what are you guys seeing in the domain to help those folks who are duly risk managing overcome this? >> Jason Payne: We can look at open through a couple different lenses. We can look at open being any human being in the world having access to the info, or we can look at it being collaborative communities, and that certainly could be the case where in the healthcare space, it could be plausible to reverse identities out of something, but a group of academic researchers, governmental researchers, etc., working collaboratively in a quasi-open fashion, there might be the ability to say, hey trusted agents wouldn't go down that capacity. Throwing a wild idea out, could you do the same thing that large-scale software companies do, which is offer bounties for zero-date vulnerabilities or exploits? So that altruistic hackers basically get paid for finding these things? And say, if you can reverse-engineer this data set, we will give you X and try our best to redact that data set. That might be a decent investment to make. >> Paul Light: I think we have time for one last question, and some from our guests here. Any other questions? Because we want Buzz to have enough time here as well. Any last comments? >> Dana Pancrazi: I had just a quick question, it goes back to something you said earlier Jason, relative to the foundations could help this along by setting up some space, money, whatever, infrastructure. So, mindful of bad outcomes from good intentions, what is your current thinking on what that looks like. So for example, I don't know what the perseverance of data and need in an organization over time looks like. So is it creating—and perhaps the answer is, depends on the organization, blah blah—is it creating infrastructure at the organization to do this on an ongoing basis? Is it creating money and space for them to contract with somebody on an ad hoc, as-needed basis? What is your thinking? Because I worry that groups who choose to truly build it in house won't be able to afford the talent that they could access on an ad-hoc basis. But I have no earthly idea if ad hoc is sufficient. >> Paul Light: Interesting question. >> Jason Payne: Let me give you a hopeful and naive answer and see how it goes. When thinking about dollars, at the end of the day it is to some degree a zero-sum game. And you look at an issue set, funding players in that issue set, you're objectively creating competition and creating a zero-sum game. If you look at providing data analysis capabilities for that issue set, it actually can potentially be a unifier. So if funders are thinking about, how can I help Syria, Sudan, disaster response capability, stopping trafficking, etc., doing it in a way to say, this is for the collective, for people to make a good-faith effort to contribute X and get Y back. Without creating more entrants in an already crowded space, hopefully, naively, I would think that data and technology with the access control that we spent a lot of time talking, could actually cause cohesion in some of these issue sets. Again, that's a very naïve hope, but I think it's plausible. >> Paul Light: Optimistic. Naïve. >> Jason Payne: Optimistic, certainly. >> Paul Light: Yeah, hopeful. Quick? >> Mayur Patel: Not that quick. Maybe I'll hold it for Buzz, later. >> Paul Light: Yeah, this is a great panel. Very nice, hopeful people. >> Tim Durbin: I do have a quick question, just to kind of follow up on that. We've had some conversations about it in the past, and you alluded to it earlier, talking about the government's view on opening up data, just as the average citizen probably doesn't think that inherently that's our taxpayer's right, but inherently it is and you guys recognize that. Can you guys talk about some of the struggles that would apply to foundations that have that data, in some format or another, and how you've thought about going and opening that up and maybe if there's contract language or such things to enhance it? >> Raphael Majma: Definitely. One of the things we're looking at right now, it's in the executive order, is, are grants, procurement, contracting processes, to actually look at: we have the authority to ask for underlying data. We don't necessarily have the language or the use cases of what it looks like to do so. I think that it would be really interesting if, as we cultivate this, foundations could start thinking about how do start to cultivate this from our grantees as well? We recognize all the problems. There's a technical deficiency—and I don't mean in know-how, I mean in actual tools to collect—and then there's the question of now we have the data, what do we do with it? There's no data.gov for the philanthropic world. Is one necessary? I don't want to presume to say, but a platform for accessing data so that everyone has that baseline of information for the philanthropic sector, is, I think, important in some respects. I don't know what that platform looks like, but I think it would be really interesting to start exploring. Right now we're making a lot of changes to data.gov. It's going to be called next.data.gov for the time being, and we're going on a federated model, which is similar to what you would expect from the foundation world. We're not asking everyone to go to data.gov. We're asking everyone to manage their own public data listing, and then we federate that up to the data.gov portal. So instead of actually, and you can imagine the analogue to the foundation world is, instead of having all of the nonprofits in your portfolio giving you all their information, they just manage it publicly online, what is available to be managed publicly online that doesn't have any privacy concerns, and then that just goes up to the foundation to be called upon when necessary. I think that could be an interesting model, but I wouldn't presume to know that that is the answer. >> Paul Light: Let's thank the panel. [applause]