Tip:
Highlight text to annotate it
X
[applause]
>> Paul Light: So nice to see so many of you who are clinging to hope that the answer is
here, and it might be! We've got three really talented people. We are all three—or all
four quite aware that we are what stands between you and freedom, and your rides home. So we
are committed to pith—pithyness? Is that correct? It's a really great panel, we've
heard from all of these characters before. I've got Diego May right here; he's the CEO
of Junar, which is a cloud-based—
>> Diego May: open data platform.
>> Paul Light: open data, I'm going to ask him a tough question about that. We have to
his right Raphael—no, Raphael's at the end, Majma, who's senior advisor to the CIO.
>> Raphael Majma: T-O.
>> Paul Light: CTO—Chief technology—did we create that by statute or is it by order?
>> Raphael Majma: I can go look it up if you like; I don't know off the top of my head.
[laughter]
>> [woman off camera]: We used executive order.
>> Paul Light: Executive order. Because we have CFOs, CIOs, we do a lot of this, but
this is one that has some, some strength to it by executive order, so it's nice having
you here, and you can say a little bit more about what you do. And Jason Payne, from Pinatar.
>> Jason Payne: Palantir.
[laughter]
>> Paul Light: Palantir. [whispering] Pinatar, Palantir. [speaking] It must be a variation,
some word play. Does a wonderful amount of work on the philanthropic, providing software
and support, which is not tax-deductible, so this is good to the heart. This is all
from the heart. Not that giving things that are tax-deductible would be anything but from
the heart. So I'd like each one of you to say a little bit more about what you're doing,
and then I'll ask a question or two, and then I'll open it up and we'll continue and bring
this all to a close with some remarks from Buzz I belive. Anyway, you want to start?
>> Diego May: Why not? So we started Junar with my company-founder five years ago, and
five years ago we saw that a lot of people were talking about—actually, very few people
were talking about how to publish data. The problem that existed back then was that if
you wanted to use data, if you're a user like you and you wanted to use data, it was very
difficult to find data. Once you found data if you were lucky, then you would find that
data in PDF format or in big excel files. Then you would find that you cannot use that
data. And on the other hand, if you were an owner of very valuable data and you wanted
to open up that data, you didn't have a clue on how. By the way, if I ask now, how do you
go to start creating an account to publish 140 characters of content, you know hwere
to go of course. If you want to do the same for a video, you know. But if I tell you,
go out there and find something that provides you an account to open up data, probably you
don't know where to go.
So five years ago we decided we're going to create the easiest-to-use open data platform.
It has to be brutally easy to use, and it has to be amazingly cheap. We didn't get yet
to the brutally and amazingly, but we're on our way.
[laughter]
But we have today about 45 paying clients, most of them governments, usually city governments
that have the mandate to open data. They want to do that because that's the right thing
to do. And they're using our platform to be able to accomplish that and get a lot of benefits
out of that. And that's it.
>> Jason Payne: So my name is Jason Payne; I have the pleasure of leading the philanthropic
engineering team at a software startup in Palo Alto, just down the road, by the name
of Palantir Technologies. So as a company we build data analysis capabilities for enterprise
organizations and work in the public, private and commercial sectors. I have the pleasure
of leading our work in the social sector where we donate our software to data-driven organizations
and help them empirically address the problem sets with which they're chartered to solve.
A very good tactical example of this in the context of open data is an organization that
we support out of Santa Barbara by the name of Direct Relief, formerly Direct Relief International,
and they're using our software to improve their ability to donate medicines and medical
supplies to those that need it the most. So they are a trusted agent that big pharma companies
will give in-kind donations of medicines, medical supplies, etc. and then they're using
our software to optimally donate that. So a very concrete example of that would be,
if you have 20 million insulin injections, how do you most efficiently distribute those
in America. What data would you want to use to make those decisions? How are you going
to pull in hurricane track data and flood plain data and diabetes rates data, and social
vulnerability data and census data and that sort of thing, to build a cohesive model and
a cohesive picture to say, these are the ways, these are the counties, these are the federally-funded
community clinics where we can donate these supplies and have the most impact. And so
we're working in, the focus areas that we have today are anti-trafficking efforts, disaster
relief efforts, which we define as both natural as well as political disasters, supporting
veterans, and global health.
>> Raphael Majma: HELLO!
[laughter]
Sorry. I saw faces down so I thought I'd start a little louder. So I'm Raphael, I work at
the Office of Science and Technology Policy over at the White House. Now I'm a senior
advisor to the Chief Technology Officer Todd Park, but before that I was a Presidential
Innovation Fellow. And I'll take a minute just to say what that is, and if you heard
me say it before I apologize for repeating the joke, but it's kind of like Peace Corps
for geeks, but instead of going to like a developing world country, you go into a place
like the Department of Education, USDA, and you go—
>> [woman off camera]: It's a lot like being in the Peace Corps, let me tell you.
[laughter]
>> Raphael Majma: [laughs] And it gives you an opportunity to tackle a discrete problem.
My problem was nonprofit data, and there are a lot of different ways that government currently
collects data about nonprofits and then makes that data available. Not in the most machine-readable
ways possible, but we're trying to work on that. Since staying on board I've continued
to work on our open data executive order, which mandates that all open data, for future
information collections, are open by default. So that means that in the future, [knocks
on wood] all of our open data is going to be just, it's going to be the presumed default
and agencies will have to actually say why it shouldn't be open, rather than say why
it should be open.
>> Paul Light: Do the classifiers, are they on board with that yet? Can you, the default
position being, the people who have the authority to classify it as a national security, there
are about 300,000 of them.
>> Raphael Majma: So I can't speak to the individual...efforts.
>> Paul Light: Agencies?
>> Raphael Majma: But I can tell you from our perspective the executive order actually
lays out the reasoning why that should be the case, and it creates the mandate for doing
so in the future.
>> Paul Light: Good. Good. Well, we'll ask a couple of questions here which is, really,
whether there's any data that you think should never be released?
>> Diego May: I'd start with something... one thing that perhaps the three of us, and
I'm not sure if it's general to this room here, when we think about open data, at least
when I think about open data, I'm thinking spreadsheets. I'm thinking spreadsheets. I'm
thinking tabular data. I'm thinking dashboards with interesting maybe charts, but usually
a lot of tabular data. And as we were discussing before with some people here and there, sometimes
this may be unstructured data first, a lot of data coming from different documents and
then somehow, with a lot of intelligence and a lot of tools, that becomes structured data,
and then that structured data can be opened up. Sometimes it was already structured data
that existed in a system, in a file that someone is working on, or in a database.
With that said the question of, is there any data that shouldn't be open, more or less?
If you ask me, I'm a fan of all data has to be open. Of course there is a lot of privacy
issues when we deal with governments, city of Palo Alto, city of Sacramento—they always
ask the same question. " I have a lot of data. I want to open a lot of this data, but there
are some pieces of data that if I open it up, then I'm going to be crushed, and any
open data system has to take that into account, because that's a key thing. whenever we're
touching privacy aspects; whenever we're talking about who is earning what, in the case of
government, and you know better your world—there is some data that doesn't have to be open.
>> Raphael Majma: So to government, we have open data, which is kind of aggregated information.
We get to collect census information, like that. And then we have what we call "my data",
which is letting a citizen access their own information. This is really popularized in
areas like health and student education records and in your energy consumption. It's one of
the things that we really make a hallmark of "My Data", is that rigorous privacy protection,
meaning you can choose to download that, and you can choose to send it to a trusted third
party, and that is your control; that is your right. We and other institutions, as governed
by laws like HIPAA and FIRPA, don't have the right to make that data open and available.
>> Jason Payne: I think when you look at the philanthropic ecosphere, by and large, we're
trying to help those that are most vulnerable, and that same population is the most vulnerable
to exploitation. And granular data that identifies those individuals used in illicit purposes
can radically harm those individuals, so the utmost care must be done to ensure those that
have access to that information, have need-to-know to use that data. I think something that's
very important, but under-discussed, is the need to audit immutably what people are doing
with information. I see the ecosphere as a whole making a huge rush toward the siren
of Big Data solving all of the world's problems, but that data, illicitly used, can open up
significantly more problems. So anytime that something is granularly personally identifiable,
or even possible to be backwards-computed to be personally identifiable, the utmost
care must be taken in how that's shared, who has access to it, and ensuring that it's used
correctly.
>> Paul Light: How do you get organizations that are in the same social impact line of
work to share data so they're not overlapping. Let's say in your particular case, you've
got a number of organizations that want to deliver vaccines and medical supplies. So
that you don't have pileups and big overlaps, how do you share data across those organizations?
Has it been your experience that they're sharing data to avoid the waste?
>> Jason Payne: I've seen good, I've seen bad and I've seen ugly.
[laughter]
I've literally had an NGO say, what's in it for me? I have 80% of the solution. Three
or four other organizations might have the rest of the 20%, but I need it to be worth
my while. So I've seen selfishness. I've also seen incredible selflessness.
One thing that's really interesting is the notion of data exhaust, and that is that there's
all these organizations that are collecting information to do their day job. They're helping
farmers, they're helping families, etc. other organizations, the World Health Organization,
folks trying to stamp out malaria or other tropical diseases, that data about microloans
may actually be a very good data source to make a decision on how to allocate other resources,
so finding organizations that can put data out to the collective and incentivizing that—at
the end of the day, it's actually expensive to do that. To host it on a server; to hire
the right engineers to build those application programming interfaces, etc., cost an organization
money, and I think that funders and foundations inspiring that and frankly [aligning? 13:22]
a relatively small amount, but having things like, we discussed earlier today Kiva making
an effort to make all their data publicly available, that sort of very small investment
can actually have massive returns, looking at this as an ecosphere and not a marketplace.
>> Paul Light: So data exhaust in this particular usage would be the release of data by the
organizations or...?
>> Jason Payne: Yes.
>> Paul Light: OK. as opposed to data exhaustion.
>> Jason Payne: Yes.
[laughter]
>> Raphael Majma: Hopefully after today we're not in data exhaustion. But, I think, it seems
to me that you can't create inherent demand for data. However, there is a demand for it,
so there is an ecosystem chomping at the bit in a couple different areas. And by creating
a business case for why organizations should pool resources, or should create a baseline
of information, you, one, allow an easier access for, possibly competitors, but usually
just people who want to compete in a different space by using that information.
I'll give you an example. iTriage uses GPS data, but they use it in a way that no one
else does, or in a way that fewer others do. They use it to track where you are, and track
where local healthcare providers are that meet your symptoms. They have no way to compete
with TomTom, who also tracks that same GPS data. So by creating that baseline, I think
there's a real business value, not just for the participants who are using the data, but
also for the funders and the nonprofits who actually aggregate it themselves.
>> Diego May: We comment on that the World Bank usually talks about, we open up data
of what we know, what we do and then of the operations, the impact. And I think that is
interesting for almost anyone that is doing interesting things, learning a lot and exposing
that. And I'm also in the line of, if I have data and if I'm leading organizations, if
I'm one of those that is scared of everything, then I'm not going to share anything. But
if I'm a leader in what I'm doing and I really want to do good things, I should be very open
to share data that can really then benefit others and create a larger value.
In our case, whenever we talk to these organizations that have these reactions that you mentioned,
fine, I know that there's going to exist a lot of organizations that don't want to share.
Let's go to the others that want to share.
And I've seen a lot of NGOs, an example of one in Panama that has a lot of entrepreneurship
data, and they create a lot of valuable reports, and then they open up these reports, and then
they do hack-a-thons and they invite people to collaborate, and you see, and you've probably
been to a lot of these hack-a-thons... has anybody been to a hack-a-thon before? Any
of those governments? Amazing. This is an awesome crowd. But the energy that you feel
there, when you see all this data put there, government officials asking questions, I mean,
they need to solve problems, and then you have citizens, journalists, and other people
trying to solve those problems. It's high-energy, but it can only happen if you really open
up and you're ready to share.
>> Paul Light: And allow the data to be manipulated, right?
>> Diego May: Yes. Yep.
>> Paul Light: So, as opposed to releasing it as a dashboard or as a summative kind of
piece, it's available for analysis.
>> Diego May: Yep.
>> Paul Light: And how much do you think is out there to that degree, where we can actually
manipulate it? What have you been seeing?
>> Raphael Majma: Oh boy, well that is a great question. So, we know what we know, but we
have no idea what we don't know. And I think, at least in the nonprofit ecosystem, there's
a lot of organizations who probably collect data and don't even realize it's data—or
rather, collect information, and don't realize it can become data. And I make that distinction
on purpose. As far as government goes, we have tens of thousands of data sets available
to varying usage, so I think that, when we talk about—I couldn't begin to put a number
on it.
>> Paul Light: Does somebody need to say along the way, independently, the quality of the
data and whether we really have data? Have you thought about that question, about sort
of telling the people who might manipulate or use it, well, we've got X confidence rate,
or it has Y validity, or so forth?
>> Diego May: I think one initial piece of the answer to that is, I think in the early
days of the web, of the content web, and I like dividing the web into the content web
and the data web, the content web, initially there was a lot of—and I cannot use the
word; we're in front of the cameras—bad content.
[laughter]
Right? And then the web itself started curating itself and now there are better ways of finding
the right content. We're at the very very very early days of the data web, so I think
that whenever we're asking about quality of data, we need to have data out first. And
once we have a lot of data out that can be used, then there's going to be a lot of ways
of finding the right data, to be able to find the curators of the data, to be able to find
organizations that can define the quality of the data. But today we're at only the tip
of the iceberg of the data web, I think.
>> Raphael Majma: And I mean, it's happening now, to some degree. I can't tell you how
many people in my twitter stream talk about standards data, and the efficacy of it, the
accuracy of it, and places like Stack Overflow have dedicated data communities. So I'd rather
not hear from the data owner about the quality of the data, I'd rather hear from someone
that's been using the data for a while about how great it is. Being able to go to places
like Stack Overflow and Stack Exchange, who kind of, that's where geeks sort of live,
that's where they reside, that's where they talk about these things, I'd trust them much
more than I'd trust some other places.
>> Jason Payne: So, a couple points. The first is, I'm a strong believer that, if you build
open interfaces to access data, the right solutions will materialize from technological
folks using it, rather than, let's get 8,000 folks in a room to build an information exchange
model that will rule all information exchange models. And just release it in a granular,
transparent, easy-to-use fashion, and who kniows what will come out of it? New York
City released taxi cab data on pickups and releases, and you don't need an information
exchange model that's ubiquitous across 80 data sources but some NGO can use that to
make New York City a better place.
The second point is that in all—I think it was Robert Kirkpatrick at the Global Pulse,
UN Global Pulse, who introduced this concept, when I first got it, but the notion of data
philanthropy. Right now philanthropy really can be bucketized into giving time or giving
money. Let's add a third leg to that stool, of giving data. And obviously with correct
data access and auditing and that sort of thing, there are so many data sources out
there that could be very helpful. An example that I find interesting is that if I had Coca-Cola's
data, in an Indonesian tsunami or an earthquake in the developing world, if I looked at sales
rebounding of affected areas, I would basically just put aid where sales had fallen the most,
blindly, and that actually would be a really good metric to decide where to distribute
aid, over anything else. Finding a way that Coca-cola could do that without risk of that
data being used for competitive reasons—if that incentive can be done, that could be
a potentially massively valuable thing.
>> Paul Light: That's what you call a bellwether.
>> Diego May: Can I say something?
>> Paul Light: Sure!
>> Diego May: I think that we agree that data has to be out, and I love the concept of the
data philanthropy. One thing that Jason mentioned and I think it's super important. Defining
the quality of the data, difficulty is going to happen over time. But defining how data
can be exposed, there are clear standards, it has to be searchable, so data has to become
SEO-friendly, friendly for the search engines. It has to be usable once you find it. It has
to have an API that allows people to use that data systemically, and I know that not everybody's
technical here, but that's crucial whenever you are defining how you are going to open
up data. Then the quality of the data itself, that's going to progress over time.
>> Paul Light: I think we're going to open up for questions. Dana, did you have one?
>> Dana Pancrazi: I did, I do. I struggle with the idea, because we're interested in
service of the public good, we're also interested in commercial enterprise and maximizing financial
return. Talk to me a little bit, and we spoke about this a bit, what are the bounds of,
where does open data and monetization of data meet? And is that regulated currently on agreement?
Is there a structural barrier? Is there a barrier? I'm unclear. When I say open data
I think I typically mean available to many, but I have personally no idea where the bounds
are on that and where monetization begins. Or doesn't.
[laughter]
>> Raphael Majma: I think it depends on the source of where you're getting the data from.
If it's from the government, it's 99.9% free. It's the taxpayer data: you already paid for
this; you already paid for its collection. It's your data to do with what you wish. If
it comes from other places there might be licensing issues that are related to that,
or contracting issues. How you monetize that might be dictated by the terms of that agreement.
But as far as government data goes, there are many many companies that have been started
on the backbone of government open data, and we love seeing that. That is the best thing
ever, as far as we're concerned.
>> Jason Payne: There are some cases where there's data that's generated as a by-product,
and the open release of it's just a positive externality. There are other cases where a
fremium model is needed. Especially as, so fremium meaning that in some cases the data
can be freely available, or subsets of it can be freely available, and in other cases,
more verbose explanations of the data, or better capabilities to analyze that data would
cost money. Or there are ways, a sort of public-private partnership, where governmental users of the
capability, or corporate users of the capability, might basically fund the public use of that
data or capability.
Unfortunately there's far too much need in the data world for humans in the loop, and
when you're releasing PDFs and when you're releasing a scanned paper documents, etc.,
at the end of the day, it takes, if it's an hour an organization and you've got 10,000
data points that's five man years, even in a philanthropic context it's hundreds of thousands
of dollars that an organization would have to invest before they could open up that data,
and everyone's like, oh yeah, you have that data, you should open it! But no no no, they
have to get paid for it, otherwise they're insolvent.
Furthering the concept of data philanthropy, I would encourage funders to look at that
being an impact, and that being an area of investment, is capitalizing those sorts of
efforts for that public good, because if that can improve an ecosphere, that small investment
is going to pay far more dividends than on-the-ground funding of specific things.
>> Diego May: I think that Mayor Bloomberg would agree that data can be monetized, and
by the way, 80% of the data that Bloomberg provides is open data, right?
>> Dana Pancrazi: Quite clearly!
>> Diego May: Then they curate and package and do great things with that. And then there's
people that pay $1500 a month to use, to have access to that, Capital IQ made a lot of money
out of that as well, and Reuters and some others. I think that there's a lot of, we
were talking with James [Lum, of Guidestar] these days, the business case, the business
models around open data, and they are there. You are putting a lot of value on top of the
data that you are generating, and you have a lot of, in the case of James, a lot of cases
where people are paying to have access to that data. You're subject matter experts.
So once you have data, I agree, a fremium model. Some is going to be for free, it makes
sense, and then for that data where you're putting a lot of effort, or for that data
that is being accessed systemically via an API in a broad way all the time, to get always
the latest values, there's going to be revenue models, and there are even cities already
thinking about how to monetize this data that's going to be accessed all the time on a very
frequent basis.
>> Paul Light: Bloomberg has been buying up a lot of capacity to understand government
data, and they have a very significant investment in monitoring government. Probably the largest
now—larger than the Washington Post and so forth. So how do you feel about the fact
that they're buying the capacity to analyze the data, and that that capacity is very very
expensive to create?
>> Raphael Majma: So I'm not actually familiar with what Bloomberg has done, but the capacity
to analyze data is a good thing; we should be able to analyze data more robustly and
better. I don't really know what they're analyzing for, but I don't see a problem with that.
>> Paul Light: Ok, well, we'll talk.
>> Raphael Majma: Ok!
[laughter]
>> Paul Light: Well, there's something—I mean, you know, weather.com uses NOAA data,
but they're not the only one that does it, but there are some highly specialized data
sets that are open, but they're so expensive to analyze that they can only be analyzed
by somebody with some deep pockets, I guess. Other questions? Yes.
>> James Lum: I had a question about personally identifiable information. So, we're more involved
with organizational data, and obviously open data would put more data out there, but as
long as there's a silo, a box around it, a piece of data, that's open, it's more useful,
but the real value comes when you can tie this data set with that data set, and you
have a common field in between. Which for all of us is pretty much the EIN number, despite
its limitations.
So you guys were talking at depth for a while about personally identifiable data, when it's
the individual, and to Mauricio's point, if you're measuring impact at the individual
level, how does the industry mash personally identifiable data from one source versus another?
Jason, you said you can, you know, machine compute who this person is, if you're in a
silo, but how do you bring data from three different sources that have some sort of—is
it social security numbers? I have no practical experience. I thought maybe you could shed
some light.
>> Jason Payne: One of the things that I'll say when talking about any data is, a common
set of unique keys, it's easy with EINs for this ecosphere, but the more unique IDs that
you can have, the easier it is to build these collaborative efforts across data sets. There
are sophisticated algorithms where you can deterministically look at property matches
that define what would be a known match, and so that can be organization name, that can
be address, that can be phone numbers, a plethora of things that would be semi-unique, but putting
them together would be unique. So there are a couple of ways to combine data in that capacity.
On the PII question, there's an interesting trade-off of what's a known clear value about
that sort of data coming together across organizations, juxtaposed with what are the risks that happen?
And unfortunately, you really do have to say, what's the worst-case scenario? And plan against
that. There have been some well-documented cases where, even at an aggregated level,
someone releasing *** documents about disease outbreaks in certain villages ended up outing
the person that was the patient-zero in that village, and resulted in an ostracization
and problems there.
One of our disaster relief organizations actually, for lack of a better word to describe it,
had gotten hacked, and had people misrepresent them, and either use the data that comes out
and collecting data from folks to say, what help do you need? Where are you? And geographically
coding it, is great if you have great intentions; it's also great if you want to exploit those
people. And so to be honest, I'm terrified, as I see this huge gold rush toward Big Data,
that these questions need to be asked, because as an ecosphere we don't get too many "oopsies"
before this really gets shut down. So I want to be very conservative when working with
those types of data.
>> Paul Light: Can I just interject a question here in terms of when the data should be released?
For example, you get an academic paper that argues that austerity programs are essential
for growth, recovery and so forth. The paper is released; we build an entire regime of
policy to cut budgets, and then the data eventually come out, and they don't prove the case at
all. So is there a standard there about release at time of conclusion drawn, that you all
might embrace, or not?
>> Diego May: You know my answer, probably; it's as soon as possible! I guess, as soon
as someone is ready to publish a paper, all the data behind the paper should be released.
That's my initial reaction. And again, thinking naively of a very constructive world where
you open up data, you expect comments, you expect also to find somebody that says, with
your data, I arrived to a different conclusion, so you're wrong, and then you're happy that
someone found you are wrong.
Not everybody works in that way,
[laughter]
—but my feeling is, whenever you have valuable data, open it up, allow people to use it and
have fun with it.
>> Paul Light: I'm releasing a book and I'm thinking of releasing the data, so you'd say
I should do that, right? But I'm terrified.
[laughter]
Because I know, I just, I think you said that, somebody said that, I'm sure there's a mistake
in there. There's got to be. Because I am close to perfect, but not quite.
[laughter]
So, it's a scary prospect. But you've got to expose yourself to that, right?
>> [man off camera]: You could publish a second update!
[general chatter]
>> Paul Light: We should have the second editions, which say "The I Made a Mistake Edition",
right?
>> Raphael Majma: "What I really meant" version.
>> Paul Light: Yeah, "What I thought I should have found".
>> Jason Payne: You know, what I think is actually far more important than when the
data should be released, is when the data should be forgotten.
>> Paul Light: Interesting.
>> Jason Payne: Again, on this notion of vulnerability. It's valuable to have data about health information,
maybe if you have arrest data, all sorts of data points, but once that, looking at the
graph we had yesterday, once we move into this green area instead of that green area,
that being an artifact that follows that individual around for decades I feel is inappropriate.
We should also look at when do we delete, destroy or otherwise let data be forgotten?
And that's something that should be a policy, or something that organizations think about
as well.
>> Paul Light: That's interesting. Can it ever be forgotten? Just, exists out there
someplace?
>> Jason Payne: I would say yes. Open data, maybe not, but when it's talking about some
of the data systems, with the keynote, with FII, some of that data could have data retention
policies where it is actually removed from the system. Once it's out on the internet,
no, but in an internal system, yes.
>> Paul Light: Other questions?
>> Brad Langhorst: I wonder what technical things could be applied to things like the
Netflix case, things where people thought they had anonymized data? Was it effective?
Or the case you mentioned with the disease, patient-zero situation. I'm not aware of sophisticated
technology that says, ok, this is sufficiently anonymized, or sufficiently aggregated, or
enough noise has been added that you can't back out individual information. Are you guys
aware of technology that can do that kind of thing and give you some confidence about,
I'm releasing this; it's permanent; you never get it back. How can you decide it's okay?
>> Raphael Majma: I don't think technology alone solves that problem. I think there are
processes around making sure the data set you're about to release is covered. There's
a human component and there's a machine component to it as well, so technology alone probably
shouldn't be the solution to this. In the case of the federal government, there are
processes around making sure that you're doing the best possible job, and making sure personally
identifiable information doesn't get out. Or that it isn't included, in some cases,
we consider also the mosaic effect of releasing a lot of data sets that could lead back to
a patient zero, for example, but it isn't just a tech fix. I don't know if it will be
anytime soon just a tech fix.
>> Diego May: I agree 100%.
[laughter]
>> Catherine Havasi: I think that it also becomes very difficult because, as in the
Netflix case, there's always interaction between data sets that ends up revealing things. And
so even if when you release your data set, it doesn't reveal anything, someone may release
something down the road that will cause your data set to reveal a lot of things. And that
becomes something that technology doesn't solve.
>> Paul Light: Response? ...Agreed.
>> Raphael Majma: That's correct.
>> Paul Light: Other questions? Is there any indicator that you would use—I know you
know I've asked this question before, but if we're looking at technology or open data,
anything in particular that you've seen that would be a bellwether of a bad system? Or
an indicator that you ought to take a deeper look at whether there's some gaming going
on with the data?
>> Jason Payne: That actually brings up an interesting point to make is that, call it
Big Data if you want to use the buzzword of the year; call it data mining; it's very dangerous
and problematic in this ecosphere. Doing Big Data on a Wal-Mart, Amazon, a GM, a Ford type
data set that's cleansed, manicured, uniform, complete, you can actually run deterministic
algorithms; you can run predictive algorithms. Doing it especially on humanitarian development
data is actually a very very dangerous proposition.
Looking at, I've looked at certain data in Afghanistan where the data says that violence
stops somewhere, and it turns out that that's because folks got pulled out of their humanitarian
work in that area, because it got too violent. And running a machine algorithm would say,
hey, there's no problem there, on that data set. So there's sort of a troika required
to be successful in my opinion, and that's data, technology and domain expertise, and
in this space it's incredibly important to have that domain expertise. As an outsider
looking at the data I'm going to make terrible decisions on how to solve a problem, without
that knowledge of that specific problem set.
>> Raphael Majma: I mean, you just described the mix of a good hack-a-thon. It takes people
who understand technology, data experts, and then doers. Without one of those you have
a situation where you're not going to create a minimum viable product. And to your question,
I want to shy away from the sys—well, shy towards a different type of system question,
and I see a lot of people that talk about open data do not understand the systems that
create data, or that structure data. There are systems that create basically proprietary
data sets, meaning, unless you use that same vendor, unless you use that same content management
system, you can't access the data, you can't make sense of it. So focusing on open formats
and focusing on making things machine-readable across various platforms is, I think, where
I'd love to see the nonprofit sector or ecosystem go.
>> Paul Light: Do you think that people who are generating the data have the training,
the kind of ethical and kind of thing you're laying out here, Jason, to know what they're
doing? Do they need some help? And do the consumers of it, the policy makers and so
forth, do they have enough technical information to use this wisely?
>> Jason Payne: Hmm.
>> Paul Light: There's that wry smile where you're like, hey, I think I have an answer
to this one!
>> Jason Payne: I'll say, ethically yes, there are certainly bad actors in the ecosphere,
but they're basically trying to rip people off; they're not actually collecting data
and going out and unethically using it. That being said, frankly, the road to hell can
be paved with good intentions.
By and large, I think that there's not enough technical talent in the ecosphere, and when
I say technical talent I don't mean people that ten, twenty years ago wrote some code.
I mean people that know how to use modern APIs, modern data analysis, etc. standing
on a soap box, I really think that the funders and the foundations have the ability to instill
change, and instill as a requirement for grants, certain processes and procedures.
The comment about the Italian restaurant with four employees not being able to do data,
instead of as a condition of the grant to that sort of organization, it came with a
data systems that actually saved the organization time and made them more efficient and effective,
you'd get two hands up if that was part of it. Here's where I absolutely love what the
SalesForce Foundation has done: they've got 14,000 clients using their software, very
open, very easy-to-use software, and an entire ecosystem around it where, anytime I come
across an organization that is using SalesForce, it makes our data analysis technology applied
to their data so much less risky, because of the quality of that platform. So I think
that foundations donating that as part of a grant could actually significant help those
small organizations.
>> Paul Light: That would be a form of operating support, which is prohibited in the foundation
world.
[laughter]
Oh! I didn't say that!
[laughter]
I didn't mean that. Are you saying that my COBALT training is no longer valuable? Because—
[laughter]
>> Diego May: Co-co-what?
>> Paul Light: I was very popular at Y2K.
[laughter]
I was very much in demand. Other questions? Yes.
>> Diego May: Can I--? Can I, sorry, super quick on that. I think that one message that
should be very clear is that we shouldn't be concerned or scared about open data or
about opening data also. There are some processes that, or some common sense guidelines about
what data to open, but then after that, it is much simpler. And if it's not, it's our
fault. NGOs, associations, governments, should be able to very easily open up data. It should
be as easy as putting a blog post, putting up a web page, and the same type of processes,
right? But somehow whenever we talk about data we get so concerned, and then oh, before
doing anything wrong, I prefer to do nothing. And I think that opening data is no big deal.
I mean, you have data, you ensure that you're not putting private or individuals' information,
but after that, it's pretty simple, and it's pretty common sense. It can be done.
>> Renee Westmoreland: You set me up perfectly here, because I think that there's been this
sort of thing going on the last couple of days that's like, "Oh, taking all the data
and running it through something, that's just the super simple part." So, I'm not entirely
sure that most organizations out there have that experience. So I was wondering if you
might be able to give an example, you said, of a case of taking an unstructured data set
and turning it into structured data, and how you come to the conclusion that that's pretty
simple.
>> Raphael Majma: Actually, can I just jump in? It's going to answer part of your question
I think. When people talk about opening data, usually, especially in the nonprofit ecosphere
I think they say, here's the reports you have: Open that data! And that's difficult, that's
actually pretty difficult, because that means possibly going through an OCR process on the
PDF, or whatever type of format it is, or actually going into paper and transposing
that into some form of excel spreadsheet.
What good open data policy really is, is information management. It's actually looking at an information,
and I'm cribbing from our new executive order, but it's looking at information lifecycle
from the beginning to the end, and bifurcating that process. So you start with information
collection: how do you manage it for internal use? How do you manage it for that eventual
external publication? And then actually getting it ready to be used internally, and then to
be used publicly.
I think that I've had enough conversations where it is not an overnight process for the
nonprofit ecosphere. There's no light switch that will turn this on. But you have to think
that there is a way to trickle that down, or to encourage the release of high-value
data, or, there are ways to do this where, surgically, frankly: be really precise about
what data you're after, what organizations can provide that data, and what's a good environment
for creating this culture of better information management to create that open data ecosystem.
>> Paul Light: Now we can go to Hope, or...? Yeah. Go, go, go, yeah!
>> Diego May: Super quick.
>> Hope Neighbor: I completely agree with Raph, that I think the difficult part is this
unstructured data, but I think that what's magical in an example like the World Bank
is that what actually happens is the bank has a very strong internal project development
and management process, and so every single piece of data that they—every single piece
of project-related data that they publish externally has already, is a piece of information
that they already needed for their internal process. So nothing is re-created, they're
simply releasing it. And I think there's tremendous potential on the foundation side, because
there are standard proposals and reports, to release that data, and from there, I think
the really tricky part, is how you then think about releasing all the underlying data sets.
But if we were to release just that first strategy, I think we would leap forward in
terms of the information we have access to.
>> Raphael Majma: To your point, there are certain things that all nonprofits do. They
manage information, they manage financial information. There are certain tools that
could be created to help them better, not just manage that internally, but aggregate
that for eventual public dissemination. And it's one of those things where I can see there's
a vacuum there for what those tools could be. Whether it's open source or it's proprietary,
hopefully open source, but there's a way to share that across nonprofits and across verticals.
>> Paul Light: Diego.
>> Diego May: A quick comment and again, with our open source solution or, I don't want
to sound sales-y with this but I want to ensure that this is really easy. Either with our
open source solution or the SAS-based solution, we've seen clients go from zero to having
a complete open data portal with a lot of valuable data in one day, and I'm not trying
to make it "wow!" I'm not the only one; there's other companies that do that. It's really
simple. If you have the data sets that you want to release, setting up an account to
have an open data site and then have it ready and machine-readable and everything, can be
done in one—it's creating an account and setting up all that.
>> Paul Light: Elizabeth, right?
>> Elizabeth Dreicer: So, just a question, because, I think a point that you made a moment
ago, Diego, is that people are scared so it's like, instead of doing anything, we'll do
nothing. But I want to articulate that, at least in healthcare, the point that was made,
I'm sorry, by Catherine? Was that, while the data may have not been PII from any singular
source, the constellation makes that PII [personally identifiable information]—have any of you
guys seen anybody working to try to detect when PII has occurred, and essentially then
encrypt or somehow address it? Because I think that while it sounds great to just say, put
it out there, I'm thinking about it from a risk management perspective, I put my risk
management hat on for a minute, if I were a trustee of a medical establishment, could
I, knowing that PII is possible, from unique, even biometric data, can I really just release
and open the floodgates? I may have an ethos that says, hey, I want to be open! And yet
there is some responsibility.
I'm just wondering how do we manage through this process, and what are you guys seeing
in the domain to help those folks who are duly risk managing overcome this?
>> Jason Payne: We can look at open through a couple different lenses. We can look at
open being any human being in the world having access to the info, or we can look at it being
collaborative communities, and that certainly could be the case where in the healthcare
space, it could be plausible to reverse identities out of something, but a group of academic
researchers, governmental researchers, etc., working collaboratively in a quasi-open fashion,
there might be the ability to say, hey trusted agents wouldn't go down that capacity. Throwing
a wild idea out, could you do the same thing that large-scale software companies do, which
is offer bounties for zero-date vulnerabilities or exploits? So that altruistic hackers basically
get paid for finding these things? And say, if you can reverse-engineer this data set,
we will give you X and try our best to redact that data set. That might be a decent investment
to make.
>> Paul Light: I think we have time for one last question, and some from our guests here.
Any other questions? Because we want Buzz to have enough time here as well. Any last
comments?
>> Dana Pancrazi: I had just a quick question, it goes back to something you said earlier
Jason, relative to the foundations could help this along by setting up some space, money,
whatever, infrastructure. So, mindful of bad outcomes from good intentions, what is your
current thinking on what that looks like. So for example, I don't know what the perseverance
of data and need in an organization over time looks like. So is it creating—and perhaps
the answer is, depends on the organization, blah blah—is it creating infrastructure
at the organization to do this on an ongoing basis? Is it creating money and space for
them to contract with somebody on an ad hoc, as-needed basis? What is your thinking? Because
I worry that groups who choose to truly build it in house won't be able to afford the talent
that they could access on an ad-hoc basis. But I have no earthly idea if ad hoc is sufficient.
>> Paul Light: Interesting question.
>> Jason Payne: Let me give you a hopeful and naive answer and see how it goes. When
thinking about dollars, at the end of the day it is to some degree a zero-sum game.
And you look at an issue set, funding players in that issue set, you're objectively creating
competition and creating a zero-sum game. If you look at providing data analysis capabilities
for that issue set, it actually can potentially be a unifier. So if funders are thinking about,
how can I help Syria, Sudan, disaster response capability, stopping trafficking, etc., doing
it in a way to say, this is for the collective, for people to make a good-faith effort to
contribute X and get Y back. Without creating more entrants in an already crowded space,
hopefully, naively, I would think that data and technology with the access control that
we spent a lot of time talking, could actually cause cohesion in some of these issue sets.
Again, that's a very naïve hope, but I think it's plausible.
>> Paul Light: Optimistic. Naïve.
>> Jason Payne: Optimistic, certainly.
>> Paul Light: Yeah, hopeful. Quick?
>> Mayur Patel: Not that quick. Maybe I'll hold it for Buzz, later.
>> Paul Light: Yeah, this is a great panel. Very nice, hopeful people.
>> Tim Durbin: I do have a quick question, just to kind of follow up on that. We've had
some conversations about it in the past, and you alluded to it earlier, talking about the
government's view on opening up data, just as the average citizen probably doesn't think
that inherently that's our taxpayer's right, but inherently it is and you guys recognize
that. Can you guys talk about some of the struggles that would apply to foundations
that have that data, in some format or another, and how you've thought about going and opening
that up and maybe if there's contract language or such things to enhance it?
>> Raphael Majma: Definitely. One of the things we're looking at right now, it's in the executive
order, is, are grants, procurement, contracting processes, to actually look at: we have the
authority to ask for underlying data. We don't necessarily have the language or the use cases
of what it looks like to do so. I think that it would be really interesting if, as we cultivate
this, foundations could start thinking about how do start to cultivate this from our grantees
as well? We recognize all the problems. There's a technical deficiency—and I don't mean
in know-how, I mean in actual tools to collect—and then there's the question of now we have the
data, what do we do with it? There's no data.gov for the philanthropic world. Is one necessary?
I don't want to presume to say, but a platform for accessing data so that everyone has that
baseline of information for the philanthropic sector, is, I think, important in some respects.
I don't know what that platform looks like, but I think it would be really interesting
to start exploring.
Right now we're making a lot of changes to data.gov. It's going to be called next.data.gov
for the time being, and we're going on a federated model, which is similar to what you would
expect from the foundation world. We're not asking everyone to go to data.gov. We're asking
everyone to manage their own public data listing, and then we federate that up to the data.gov
portal.
So instead of actually, and you can imagine the analogue to the foundation world is, instead
of having all of the nonprofits in your portfolio giving you all their information, they just
manage it publicly online, what is available to be managed publicly online that doesn't
have any privacy concerns, and then that just goes up to the foundation to be called upon
when necessary. I think that could be an interesting model, but I wouldn't presume to know that
that is the answer.
>> Paul Light: Let's thank the panel.
[applause]