Tip:
Highlight text to annotate it
X
[APPLAUSE]
SIMON ROGERS: Good afternoon.
I feel like, for once, I'm in a room where lots of people
already get this.
So it goes.
But my job is at the Guardian Datastore and Datablog.
I'm a news editor on paper.
My first day on the news desk was September 10, 2001.
And that huge event has just bookended my career in these
kind of amazing things that have happened since.
And one of the reflections of this being the
enormous growth in data.
I have one of those jobs that not many people understand.
I try to explain it to my parents repeatedly.
They just don't really get it.
But what I'll do is I'll talk a bit about how we do things,
and kind of pinpoint some of the stuff that's already
around there and out there with data already.
So when I was asked to do this, I was thinking, what can
I talk about?
And I was originally going to talk about how data will
change the world, how big data is about to change to world.
And it's happening already.
This stuff is already out there, big data, small data,
everything in between.
And the way that we work as an organization
has completely changed.
One of the things to think about with this is that what
we do now as a news organization
is very much altered.
So when we started, what we used to do is hand our pearls
of wisdom out to the world.
We would print a story, and people would gratefully
receive that information.
That process has completed changed.
Now it's much more two-way.
This is the Guardian Datastore, which is what do I
do every day.
I'm working with the news desk, and what we do is we
share raw data.
We publish it using Google Spreadsheets, which we do
because it's very, very easy to share.
It's also an economic thing, because I couldn't get any
development resource to build a database.
And it has actually turned into quite a
useful way to do it.
And one of the things to, I suppose, start
by is to look backwards.
This whole event today is about look
forward, looking ahead.
But actually, a lot of the stuff that we do has real
echoes with the past. You can go very, very far back with
this to 1202, which is the Bible as a graphic.
And I suppose the difference then is that graphics and
visualization were a way of kind of getting across
information to people who couldn't read.
And now we're using them, in a sense, partly to get
information across to people who can't be bothered to read,
which is a different thing all together.
But this is the work of a guy called William Playfair.
I don't know if anyone has heard of William Playfair.
He was an inveterate gambler, he took part in the storming
of the Bastille.
But he also invented the line chart, and the bar chart, and
the pie chart.
All these ways we have of seeing the world now come from
a man in the 1780s, and really haven't changed much, except
obviously now the balance of trade graph
goes the other way.
Anybody know who this is?
AUDIENCE: Florence Nightingale.
SIMON ROGERS: Thank you.
Nor would I have complete silence myself.
So Florence Nightingale, The Lady with the Lamp, but also
obsessed with statistics and numbers.
When she came back from the Crimean War, she was
commissioned to produce a report on the conditions of
people in the armed forces.
And this report was like a lot of data at the time, published
as book, and data, and tables, and numbers, and so on.
So she came up with this as a way of visualizing that data.
Now what that means actually is--
I'll have to explain it for people who can't really see
it-- is the pink bits in the middle are deaths of soldiers
from things you'd expect soldiers to die of from, being
in action, being blown up, or whatever.
The blue bars are deaths from preventable disease.
So it's using data, presenting it in a way that makes it real
for people and actually makes a point.
And this report caused a storm when it came out, changed the
way that things worked for the British Army.
And again, changed the world with data.
We have a tradition of this in the Guardian too.
This is the first edition from May 1821.
In those days, adverts were on the front page.
We couldn't do that now for lots of reasons, though some
people would like to do it.
And news was on the back.
And so this is the first news page of the first of the
Guardian, a third of that page taken up with a table of data,
a table of numbers.
Now that's what would now appear to be incredibly
uncontroversial data.
It's a list of schools in Manchester, how many pupils it
had, and how much funding they got.
But in 1921 it was very controversial, not in the
least because it was 60 to 80 years before education became
compulsory, a very political thing.
And education was done on Sundays, because during the
week kids were at work.
Some people would like to bring that back too.
And the people that leaked it to us-- it was leaked
information--
leaked to us because the official data was rubbish.
It was partial, it was compiled by priests, you
couldn't rely on it.
So this idea that, actually, if you have the numbers and
have statistics you can know what's going on in the world,
you can improve things, you can make things better.
And that's something that kind of flows through what
we try to do now.
Because now what we have is the same kind of ideas about
producing data, but we've suddenly got immense amounts
of new ways of presenting that information and presenting it
differently.
So this is something which we do every year in the Guardian.
This is government spending by department.
Each big circle represents a different government
department.
Now if we lived in the States, what we could do is we could
ring up the treasury, and they'd give us a nice
spreadsheet with all this data in.
And we'd just be able to give it to designers
and there you go.
But we don't.
We live in a country where government departments print
their annual reports as PDFs.
And anybody who has tried to extract data from PDFs will
know how much fun that is.
And it's full of kind of interesting bits.
So the top left-hand corner is the Department of Work and
Pensions, which is our biggest spender.
And the bit on the left is benefits.
You probably can't see it from here, that
biggest benefit circle.
Any ideas what that is?
AUDIENCE: Government pensions.
SIMON ROGERS: It's pensions.
Exactly.
Something people don't even think of as a
benefit in some ways.
And then down there in the bottom left-hand corner we've
got the Ministry of Defence.
And when we were doing the MOD we thought, this is odd.
There's nothing there for spending in Afghanistan.
They must be spending some money over there.
So we rang the MOD.
And the figure wasn't in the official report because it
comes from a different budget.
It's voted for by Parliament.
So we had to kind of bring all this stuff together and put it
into one place.
And we're in a position to do that as a news organization.
I guess the difference now is that what we'll also do is
we'll put that data out there, and make it accessible to
people, and see what other people can do with it, and see
if other people can do what we do better.
And there are loads of people doing this stuff now.
This is from Italy.
This is poverty broken down by age group.
This is the composition of a chromosome.
And a lot of these ways are just ways of kind of
expressing really interesting information in new ways.
This is a really good website, which is interactive, that
shows how people move across America and where people move
from and too.
And I guess what a lot of people are increasingly
starting to do now is take these huge bits of data and
bring them together in ways which we can
understand a bit better.
This is the Qur'an and the Bible and the words used, and
how they link together and where their commonalities are.
Obviously, where we are now is a place where there are huge
amounts of data out there.
This is from the great company called ITO World.
I don't know if anybody has heard of them.
They do transport mapping.
And this is a 24 hours worth of flights over Europe.
But there are enormous amounts of data out there.
I was looking at a report that said there are 2.5 quintillion
bytes of data out there every day.
Does any body know what a quintillion is?
A quintillion is 1,000 quadrillion.
and quadrillion is 1,000 trillion, if that helps.
Welcome to my world.
But there's this enormous amount of data.
And what people are doing are finding increasingly
interesting ways of bringing that stuff together.
This is another thing that ITO World did, which is another
interactive, which is just based on 10 years worth of
road casualty data in the States.
And they've done the same thing for the UK.
And what that allows you to do is actually zoom in and find
specific incidents, look for things, and then
interact with the data.
So again, it's moving away from the idea of just telling
people what the numbers say, but actually kind of helping
them to find their way through it.
This is something that was produced by a fantastic group
at Oxford University.
They has this very, very small but pretty
decent internet institute.
A guy called Mark Graham works there.
And he specializes now in gathering together geotagged
information because there is tons of this stuff out there.
So this is English language entries in Wikipedia.
And they've also done some in different languages, and you
can see how they change.
They've also done the same thing with Flickr photographs
and how they're posted.
And they're starting to really get a sense of the amount of
information that's out there that's geotagged and how you
can present it in an interesting way.
This is Davos, obviously, World Economic Forum.
Recently solving the world's economic problems. Obviously,
we're all fixed now.
But one thing they also talked about was data.
Data came up at Davos.
There was a whole session on the report published.
And this is something that a group called Tweetminister did
with a PR communication where they monitored tweets in
Africa and who tweets most. And it's interesting.
So this is part of what the Davos report was about.
They were looking at mobile phone use in Africa.
There's 53% of the population in Africa now who have a
mobile phone, compared to maybe 0.5% who access the
internet via fixed broadband.
So you've suddenly got a lot of people out there and you
can monitor that data.
So for instance, you know that if people start having lots of
conversations on Twitter about food, it
reflects food price inflation.
Or they've also worked out that the change in Twitter use
in different parts of the region
reflects population change.
So these are ways you can use this kind of amazing amount of
data that's out there to actually change social policy
and change the way that people interact with the world.
So one of the things, I suppose, motivates us is now
there's so much stuff out there that it has become
confusing for people.
Where do they start?
You've got all this competition.
Google's Public Data Explorer is a fantastic resource.
There's a great company called Infochimps, which
works in the States.
And they wanted to become the YouTube of data.
And that's just basically set up by a few developers.
These aren't big companies often.
DataMarket, which was set up in Iceland as a data
[UNINTELLIGIBLE] now moved to the States.
And there's a fantastic company which I've just been
looking at some of their data this week called
[UNINTELLIGIBLE].
I don't know if anybody has used them.
They've got 20 years worth of company
reports in one website.
That's 20 million reports or something.
It's an incredible amount of data and information which
they had just harvested and pulled
together into one place.
So all this stuff is public, but
accessing it is often difficult.
One of the things we've started to do
is to provide data.
So these are a couple of data searches which we
offer on our site.
The one of the left, World Government Data brings
together open data sites from different
countries around the world.
So you can search for the crime figures for New York and
crime figures from London, maybe start to compatible.
And we're done the same kind of thing aid data.
So what we've noticed is we're increasingly dealing with
bigger and bigger data sets.
And often they are about really, really,
really small things.
And specifically about geography often.
So this is something which we did with the Indices of
Multiple Deprivation which is one of these data sets, it
comes out every 4 years, incredibly important.
It works out how wealthy or poor each part of England is.
They don't it in Scotland and Wales which is the devolution
as they have it with national statistics.
But it is amazing data.
But it's published in the worst
possible way on a website.
You can't find where you live.
It's really hard to find to your way around.
So we just took that data, and we worked with a GIS
specialist at Sheffield University who had the
coordinates of each of these little areas.
And we put it together using Google Fusion Tables, which is
a fantastic map making resource we use a lot, because
it's free and easy to use.
And you can make quite sophisticated
things quite quickly.
And it allows people to go down to, say, four or five
streets and find out how wealthy or poor they are,
which is public information, really important information.
Because it affects everything about public
spending in their area.
And this is another big data set, which is
a little less official.
This is WikiLeaks.
There were three big WikiLeaks releases,
Afghanistan, 920,000 records.
If you think, the Pentagon Papers in the
'60s were 7,000 documents.
So a huge amount of documents.
And then this, which was Iraq, 400,000 documents.
And it turns out soldiers are really good at entering data.
So one of the things we had was these
casualties by instance.
We picked every event where at least one person had died.
And using Fusion maps, in about half an hour we made
this map, which passes out every single instant where
somebody died in Iraq.
And that tells a story much better than thousands of words
could do about what was happening and the sheer cost
of that war.
And I was at an event in San Francisco recently.
This huge guy came up to me at the end of the event.
I saw he had a badge on on that said US Army.
I though, well I'm in trouble now.
I have WikiLeaks data set.
And he's like, this is fantastic.
We've had all this data for ages.
We haven't been able to show it to people.
We haven't been able to show it to our generals because we
haven't had that kind of technology.
So you're a busy general, you get a massive spreadsheet
stuck in front of you, you don't where to start.
So increasingly, what we find is these free tools are giving
us access to tell data in new and interesting ways.
So one of the things now is there's this huge kind of
democratization of data sets and visualizing data sets.
And graphic designers hate this, because it basically
means anybody can do this stuff, which people dislike.
And the technology has changed where you have something like
this, the most popular graphic on the web a few years ago, to
be able to do stuff like this using Tableau, which is a
brilliant kind of free--
Tableau Public, anyway, is free.
That's how they suck you in to buy the big one.
But actually, it's enough to release
sophisticated things like this.
And anybody can do this stuff.
It is kind of big democratization of what
traditionally people here and we used to do on our own.
This is our Flickr group of people every day.
Quite a small group, but they're very active.
They post on visualizations or graphics and things that
they've done.
And sometimes we post them on the main site.
Originally, we didn't have the Flickr group.
We just had to find a way to manage this stuff because it
was so difficult to work out how to do it.
So I'm rapidly running out of time.
So I'm going to *** through what we do have left.
So sometimes data can be seen as a threat.
Bank of America had to take on 20 lawyers just in case
WikiLeaks had anything on them.
It turns out they did have a lot of documents.
But having spoken to people inside WikiLeaks, apparently
they were really dull.
But it didn't stop Bank of America having to spend a lot
of money about it.
And then, of course, what you've got are lot of
companies whose entire operations are based on data
and the way you use data.
So if you have Vodafone, obviously who can monitor
every single call and know exactly how the users use it.
Google is based entirely on data.
And not just in the obvious Public Data Explorer sense,
but the way that the searches work, the smart email
boxes and so on.
This stuff is just there.
And it is part of the way these organizations work.
Are we moving there?
Let's start with this.
This is LinkedIn.
LinkedIn, 2008, everybody described themselves as gurus
on LinkedIn profiles.
In 2009, they're evangelists, evangelicals, data
evangelists.
2010--
this is still going on now-- they're Jedis.
LinkedIn knows this stuff about you.
And it's how they kind of decide when they're going to
create new products.
Because all they are is their data.
They've got OkCupid, fantastic company, because when people
sign up with Guardian Soulmates, unfortunately we've
agreed not to release their data, which is annoying
because we can't do stuff like this, which OkCupid does.
And yeah, they know everything about the people
that use the site.
Everything about the difference between black
people and white people and how they describe themselves
and so on, and smartphone users.
And obviously the social media, and the way it works,
and the kind of stuff they can tell us has changed too.
So these are the degrees of separation between people and
how Facebook has changed those degrees, hops.
It's gone from five to four.
It's interesting how they take this data every day, and they
use it and they know what makes us tick.
The other thing we've dealt with a bit is crowdsourcing,
or, as we used to call them, surveys.
We did it with MP's expenses, where we had
400,000 pages of PDFs.
And we thought, how can we go through this stuff?
The Telegraph had this stuff for months because they paid
somebody for it.
And we didn't have that.
So we threw it over into our readers.
And we did it twice.
This is the second exercise.
The second exercise, the whole lot was done in a
week twice over .
Because we kind of gamified it a bit.
We gave people tasks to do, like just do the cabinet or do
the shadow cabinet, make a bit more fun.
The first task that people helped was
about 300,000 pages.
One person did 29,000 pages.
So they probably know more about MP's expenses than our
entire Westminster team.
This is a company called Zooniverse who are brilliant
at crowdsourcing stuff.
And what they do is they take very specific tasks.
So they're crowdsourcing photographs of the
surface of the moon.
See what people can find out about it.
And they're also doing this thing which is called Old
Weather, where they're taking old Royal Naval log books.
And because what people used to record every day was
temperature.
So if you've got all these log books, you've got an amazing
resource of temperature data.
And you can start looking at what's happening
with climate change.
And the other thing they're doing is they make sure that
10 people go through each one.
So you lose those errors that just would happen otherwise.
And there's a company called Kaggle who are interesting.
They run a crowdsource data development competition.
People pay them to find developers who will then build
solutions for them.
It's really interesting.
Will they make any money for it?
I don't know.
This is an example of how we do stuff.
I'll just talk briefly through how we ran
stuff through the riots.
Because this was the sort of thing where we didn't have the
luxury of time to produce data analysis and use
some of these tools.
We had to do stuff very, very quickly.
So the 1981 Brixton riot was very, very different in the
sense that we didn't have this flood of information.
We had to wait for [UNINTELLIGIBLE] to come back
and tell us what was going on and what caused it.
2011, we're assaulted with information via Twitter, news,
everything is kind of being thrown at us all the time and
assertions by politicians is what's causing
what's going on.
So what can we do about it?
Well when the riots were happening, we set this up.
Google Fusion Tables, just a list of verified incidents.
Every day we would update it as soon as reports came
through from our reporters or from the wires.
We'd just add things onto this map.
And it just told people what was going on
where, very, very simple.
We let people download it as well, because
this stuff is open.
It was the biggest thing on the site for
three days in a row.
It had maybe about 700,000 page impressions on one page,
which is great for the Datablog.
But is it also a way of--
because people are desperate for that kind of basic
information.
Now as people started to get arrested, we thought, well
great, we're going to have proper data from the
government now.
What they do though, of course, is the aggregate it.
They put it all together.
And what we were interested in was the individuals.
Who were these people who were out there?
Why were they doing this?
So we started going after these.
These are called the court registers.
Every magistrate court has these.
At the end of each day, for every person in court, it
tells you who they are, their ages, their addresses, what
they're accused of, and what happened to them each day.
I mean, those questions we had.
This is a week after riots.
We're a week after people started appearing in court.
And we wanted know are people being treated more harshly in
court, are they being treated differently to other people in
court, and this kind of stuff.
We wanted to find out what was going on.
So we went to the magistrates court, said, we'd like all
your court registers for the day, please.
We went to Camberwell, and so on.
And they said no.
Well Camberwell actually asked us to pay the five
pounds for each name.
Because we didn't have any people that [UNINTELLIGIBLE]
This is public information.
So we went to the Ministry of Justice.
And they put an instruction around to every court to
release this data.
So a week after the event, we had 1,000 records of people
who had been through the court in riot related cases, which
meant we could prove that actually 2/3 of people came
from very poor places, they were being treated more
harshly than people in court.
Whether or not you agree with that, it's useful to know
that, because you want to know how the justice process is
actually working, especially when so many people are kind
of being [UNINTELLIGIBLE] through the court.
And then we were inspired by the Detroit Riots Project.
In the late '60s, big riots in Detroit, people died.
It was a big thing.
And a guy called Philip Meyer went out and interviewed
people who were actually involved in the riots to find
out why they did this.
So we used that database as a way to go out and interview
people and find people involved in the riots.
And we asked them important questions
like, was it Twitter?
Did Twitter cause the riots?
As we know, Twitter was blamed for the riots.
Actually, what we found is most people found out about
the riots, the people who went out there, from
TV, Old media, right?
And we did stuff like this monitoring hashtags.
And that big one at the end is riot clean up.
We showed how Twitter was actually used by people as a
way to tell what was going on.
We did this as well, with ITO where we mapped where people
lived and where they were accused of rioting.
And then modelled how people went from
one place to another.
This is London.
In London, people lived a lot closer to where they were
accused of rioting.
In Manchester, which is here, people came from miles away to
come to the center.
And partly because London is full of high streets.
So in London, you're never too far from a [UNINTELLIGIBLE].
Manchester is different.
We also looked at rumors on Twitter.
The green dots are rumors starting.
The red dots are rumors being squashed.
That was the rumor that the London Eye was on fire.
But it turns out, you can't burn metal.
So it wasn't.
There was another rumor that there was a children's
hospital in Birmingham which was being attacked.
See that green, it's starting.
And then people start to squash the
rumor down at the end.
Mark Twain said that a lie can be around the world before the
truth has got its boots on.
And with Twitter, you think that's going to be more true.
Actually, the truth can bomb around right after it and make
sure people really know what's going on.
So the next stage of this project is that we're going to
go the other side of the barricades and talk to people
who were on the front lines of the police officers and people
in the courts and see what they thought.
Why they thought the riots started and their experiences
of the whole process as well.
So this is something that I was going to bring up too.
[VIDEO PLAYBACK]
-In fact, there are now over 3.1 million millionaires.
But these are not the richest of all.
The US has over 400 billionaires, more than any
other country in the world.
Who's at the top of that pile?
These three have a combined net worth of $131 billion.
That's just over the combined budget shortfall of every
state in the US for 2011.
More than the cost of the global war on terror in 2010.
[END VIDEO PLAYBACK]
SIMON ROGERS: So that's actually something we did very
recently around the 99% versus 1% debate.
And it's quite a traditional method, using a video and
showing that information.
Now I suppose what I'm gearing up to say is, actually,
although there is 150 years between something like this
and something like this, the distance between what we're
trying to do is very, very similar.
And what we have now is this amazing amount of information
[UNINTELLIGIBLE]
and the amazing capability of presenting information in ways
we never could before.
Thank you very much [APPLAUSE]