Think Bigger: Simon Rogers - Data, It's Big and It's Clever

[APPLAUSE] SIMON ROGERS: Good afternoon. I feel like, for once, I'm in a room where lots of people already get this. So it goes. But my job is at the Guardian Datastore and Datablog. I'm a news editor on paper. My first day on the news desk was September 10, 2001. And that huge event has just bookended my career in these kind of amazing things that have happened since. And one of the reflections of this being the enormous growth in data. I have one of those jobs that not many people understand. I try to explain it to my parents repeatedly. They just don't really get it. But what I'll do is I'll talk a bit about how we do things, and kind of pinpoint some of the stuff that's already around there and out there with data already. So when I was asked to do this, I was thinking, what can I talk about? And I was originally going to talk about how data will change the world, how big data is about to change to world. And it's happening already. This stuff is already out there, big data, small data, everything in between. And the way that we work as an organization has completely changed. One of the things to think about with this is that what we do now as a news organization is very much altered. So when we started, what we used to do is hand our pearls of wisdom out to the world. We would print a story, and people would gratefully receive that information. That process has completed changed. Now it's much more two-way. This is the Guardian Datastore, which is what do I do every day. I'm working with the news desk, and what we do is we share raw data. We publish it using Google Spreadsheets, which we do because it's very, very easy to share. It's also an economic thing, because I couldn't get any development resource to build a database. And it has actually turned into quite a useful way to do it. And one of the things to, I suppose, start by is to look backwards. This whole event today is about look forward, looking ahead. But actually, a lot of the stuff that we do has real echoes with the past. You can go very, very far back with this to 1202, which is the Bible as a graphic. And I suppose the difference then is that graphics and visualization were a way of kind of getting across information to people who couldn't read. And now we're using them, in a sense, partly to get information across to people who can't be bothered to read, which is a different thing all together. But this is the work of a guy called William Playfair. I don't know if anyone has heard of William Playfair. He was an inveterate gambler, he took part in the storming of the Bastille. But he also invented the line chart, and the bar chart, and the pie chart. All these ways we have of seeing the world now come from a man in the 1780s, and really haven't changed much, except obviously now the balance of trade graph goes the other way. Anybody know who this is? AUDIENCE: Florence Nightingale. SIMON ROGERS: Thank you. Nor would I have complete silence myself. So Florence Nightingale, The Lady with the Lamp, but also obsessed with statistics and numbers. When she came back from the Crimean War, she was commissioned to produce a report on the conditions of people in the armed forces. And this report was like a lot of data at the time, published as book, and data, and tables, and numbers, and so on. So she came up with this as a way of visualizing that data. Now what that means actually is-- I'll have to explain it for people who can't really see it-- is the pink bits in the middle are deaths of soldiers from things you'd expect soldiers to die of from, being in action, being blown up, or whatever. The blue bars are deaths from preventable disease. So it's using data, presenting it in a way that makes it real for people and actually makes a point. And this report caused a storm when it came out, changed the way that things worked for the British Army. And again, changed the world with data. We have a tradition of this in the Guardian too. This is the first edition from May 1821. In those days, adverts were on the front page. We couldn't do that now for lots of reasons, though some people would like to do it. And news was on the back. And so this is the first news page of the first of the Guardian, a third of that page taken up with a table of data, a table of numbers. Now that's what would now appear to be incredibly uncontroversial data. It's a list of schools in Manchester, how many pupils it had, and how much funding they got. But in 1921 it was very controversial, not in the least because it was 60 to 80 years before education became compulsory, a very political thing. And education was done on Sundays, because during the week kids were at work. Some people would like to bring that back too. And the people that leaked it to us-- it was leaked information-- leaked to us because the official data was rubbish. It was partial, it was compiled by priests, you couldn't rely on it. So this idea that, actually, if you have the numbers and have statistics you can know what's going on in the world, you can improve things, you can make things better. And that's something that kind of flows through what we try to do now. Because now what we have is the same kind of ideas about producing data, but we've suddenly got immense amounts of new ways of presenting that information and presenting it differently. So this is something which we do every year in the Guardian. This is government spending by department. Each big circle represents a different government department. Now if we lived in the States, what we could do is we could ring up the treasury, and they'd give us a nice spreadsheet with all this data in. And we'd just be able to give it to designers and there you go. But we don't. We live in a country where government departments print their annual reports as PDFs. And anybody who has tried to extract data from PDFs will know how much fun that is. And it's full of kind of interesting bits. So the top left-hand corner is the Department of Work and Pensions, which is our biggest spender. And the bit on the left is benefits. You probably can't see it from here, that biggest benefit circle. Any ideas what that is? AUDIENCE: Government pensions. SIMON ROGERS: It's pensions. Exactly. Something people don't even think of as a benefit in some ways. And then down there in the bottom left-hand corner we've got the Ministry of Defence. And when we were doing the MOD we thought, this is odd. There's nothing there for spending in Afghanistan. They must be spending some money over there. So we rang the MOD. And the figure wasn't in the official report because it comes from a different budget. It's voted for by Parliament. So we had to kind of bring all this stuff together and put it into one place. And we're in a position to do that as a news organization. I guess the difference now is that what we'll also do is we'll put that data out there, and make it accessible to people, and see what other people can do with it, and see if other people can do what we do better. And there are loads of people doing this stuff now. This is from Italy. This is poverty broken down by age group. This is the composition of a chromosome. And a lot of these ways are just ways of kind of expressing really interesting information in new ways. This is a really good website, which is interactive, that shows how people move across America and where people move from and too. And I guess what a lot of people are increasingly starting to do now is take these huge bits of data and bring them together in ways which we can understand a bit better. This is the Qur'an and the Bible and the words used, and how they link together and where their commonalities are. Obviously, where we are now is a place where there are huge amounts of data out there. This is from the great company called ITO World. I don't know if anybody has heard of them. They do transport mapping. And this is a 24 hours worth of flights over Europe. But there are enormous amounts of data out there. I was looking at a report that said there are 2.5 quintillion bytes of data out there every day. Does any body know what a quintillion is? A quintillion is 1,000 quadrillion. and quadrillion is 1,000 trillion, if that helps. Welcome to my world. But there's this enormous amount of data. And what people are doing are finding increasingly interesting ways of bringing that stuff together. This is another thing that ITO World did, which is another interactive, which is just based on 10 years worth of road casualty data in the States. And they've done the same thing for the UK. And what that allows you to do is actually zoom in and find specific incidents, look for things, and then interact with the data. So again, it's moving away from the idea of just telling people what the numbers say, but actually kind of helping them to find their way through it. This is something that was produced by a fantastic group at Oxford University. They has this very, very small but pretty decent internet institute. A guy called Mark Graham works there. And he specializes now in gathering together geotagged information because there is tons of this stuff out there. So this is English language entries in Wikipedia. And they've also done some in different languages, and you can see how they change. They've also done the same thing with Flickr photographs and how they're posted. And they're starting to really get a sense of the amount of information that's out there that's geotagged and how you can present it in an interesting way. This is Davos, obviously, World Economic Forum. Recently solving the world's economic problems. Obviously, we're all fixed now. But one thing they also talked about was data. Data came up at Davos. There was a whole session on the report published. And this is something that a group called Tweetminister did with a PR communication where they monitored tweets in Africa and who tweets most. And it's interesting. So this is part of what the Davos report was about. They were looking at mobile phone use in Africa. There's 53% of the population in Africa now who have a mobile phone, compared to maybe 0.5% who access the internet via fixed broadband. So you've suddenly got a lot of people out there and you can monitor that data. So for instance, you know that if people start having lots of conversations on Twitter about food, it reflects food price inflation. Or they've also worked out that the change in Twitter use in different parts of the region reflects population change. So these are ways you can use this kind of amazing amount of data that's out there to actually change social policy and change the way that people interact with the world. So one of the things, I suppose, motivates us is now there's so much stuff out there that it has become confusing for people. Where do they start? You've got all this competition. Google's Public Data Explorer is a fantastic resource. There's a great company called Infochimps, which works in the States. And they wanted to become the YouTube of data. And that's just basically set up by a few developers. These aren't big companies often. DataMarket, which was set up in Iceland as a data [UNINTELLIGIBLE] now moved to the States. And there's a fantastic company which I've just been looking at some of their data this week called [UNINTELLIGIBLE]. I don't know if anybody has used them. They've got 20 years worth of company reports in one website. That's 20 million reports or something. It's an incredible amount of data and information which they had just harvested and pulled together into one place. So all this stuff is public, but accessing it is often difficult. One of the things we've started to do is to provide data. So these are a couple of data searches which we offer on our site. The one of the left, World Government Data brings together open data sites from different countries around the world. So you can search for the crime figures for New York and crime figures from London, maybe start to compatible. And we're done the same kind of thing aid data. So what we've noticed is we're increasingly dealing with bigger and bigger data sets. And often they are about really, really, really small things. And specifically about geography often. So this is something which we did with the Indices of Multiple Deprivation which is one of these data sets, it comes out every 4 years, incredibly important. It works out how wealthy or poor each part of England is. They don't it in Scotland and Wales which is the devolution as they have it with national statistics. But it is amazing data. But it's published in the worst possible way on a website. You can't find where you live. It's really hard to find to your way around. So we just took that data, and we worked with a GIS specialist at Sheffield University who had the coordinates of each of these little areas. And we put it together using Google Fusion Tables, which is a fantastic map making resource we use a lot, because it's free and easy to use. And you can make quite sophisticated things quite quickly. And it allows people to go down to, say, four or five streets and find out how wealthy or poor they are, which is public information, really important information. Because it affects everything about public spending in their area. And this is another big data set, which is a little less official. This is WikiLeaks. There were three big WikiLeaks releases, Afghanistan, 920,000 records. If you think, the Pentagon Papers in the '60s were 7,000 documents. So a huge amount of documents. And then this, which was Iraq, 400,000 documents. And it turns out soldiers are really good at entering data. So one of the things we had was these casualties by instance. We picked every event where at least one person had died. And using Fusion maps, in about half an hour we made this map, which passes out every single instant where somebody died in Iraq. And that tells a story much better than thousands of words could do about what was happening and the sheer cost of that war. And I was at an event in San Francisco recently. This huge guy came up to me at the end of the event. I saw he had a badge on on that said US Army. I though, well I'm in trouble now. I have WikiLeaks data set. And he's like, this is fantastic. We've had all this data for ages. We haven't been able to show it to people. We haven't been able to show it to our generals because we haven't had that kind of technology. So you're a busy general, you get a massive spreadsheet stuck in front of you, you don't where to start. So increasingly, what we find is these free tools are giving us access to tell data in new and interesting ways. So one of the things now is there's this huge kind of democratization of data sets and visualizing data sets. And graphic designers hate this, because it basically means anybody can do this stuff, which people dislike. And the technology has changed where you have something like this, the most popular graphic on the web a few years ago, to be able to do stuff like this using Tableau, which is a brilliant kind of free-- Tableau Public, anyway, is free. That's how they suck you in to buy the big one. But actually, it's enough to release sophisticated things like this. And anybody can do this stuff. It is kind of big democratization of what traditionally people here and we used to do on our own. This is our Flickr group of people every day. Quite a small group, but they're very active. They post on visualizations or graphics and things that they've done. And sometimes we post them on the main site. Originally, we didn't have the Flickr group. We just had to find a way to manage this stuff because it was so difficult to work out how to do it. So I'm rapidly running out of time. So I'm going to *** through what we do have left. So sometimes data can be seen as a threat. Bank of America had to take on 20 lawyers just in case WikiLeaks had anything on them. It turns out they did have a lot of documents. But having spoken to people inside WikiLeaks, apparently they were really dull. But it didn't stop Bank of America having to spend a lot of money about it. And then, of course, what you've got are lot of companies whose entire operations are based on data and the way you use data. So if you have Vodafone, obviously who can monitor every single call and know exactly how the users use it. Google is based entirely on data. And not just in the obvious Public Data Explorer sense, but the way that the searches work, the smart email boxes and so on. This stuff is just there. And it is part of the way these organizations work. Are we moving there? Let's start with this. This is LinkedIn. LinkedIn, 2008, everybody described themselves as gurus on LinkedIn profiles. In 2009, they're evangelists, evangelicals, data evangelists. 2010-- this is still going on now-- they're Jedis. LinkedIn knows this stuff about you. And it's how they kind of decide when they're going to create new products. Because all they are is their data. They've got OkCupid, fantastic company, because when people sign up with Guardian Soulmates, unfortunately we've agreed not to release their data, which is annoying because we can't do stuff like this, which OkCupid does. And yeah, they know everything about the people that use the site. Everything about the difference between black people and white people and how they describe themselves and so on, and smartphone users. And obviously the social media, and the way it works, and the kind of stuff they can tell us has changed too. So these are the degrees of separation between people and how Facebook has changed those degrees, hops. It's gone from five to four. It's interesting how they take this data every day, and they use it and they know what makes us tick. The other thing we've dealt with a bit is crowdsourcing, or, as we used to call them, surveys. We did it with MP's expenses, where we had 400,000 pages of PDFs. And we thought, how can we go through this stuff? The Telegraph had this stuff for months because they paid somebody for it. And we didn't have that. So we threw it over into our readers. And we did it twice. This is the second exercise. The second exercise, the whole lot was done in a week twice over . Because we kind of gamified it a bit. We gave people tasks to do, like just do the cabinet or do the shadow cabinet, make a bit more fun. The first task that people helped was about 300,000 pages. One person did 29,000 pages. So they probably know more about MP's expenses than our entire Westminster team. This is a company called Zooniverse who are brilliant at crowdsourcing stuff. And what they do is they take very specific tasks. So they're crowdsourcing photographs of the surface of the moon. See what people can find out about it. And they're also doing this thing which is called Old Weather, where they're taking old Royal Naval log books. And because what people used to record every day was temperature. So if you've got all these log books, you've got an amazing resource of temperature data. And you can start looking at what's happening with climate change. And the other thing they're doing is they make sure that 10 people go through each one. So you lose those errors that just would happen otherwise. And there's a company called Kaggle who are interesting. They run a crowdsource data development competition. People pay them to find developers who will then build solutions for them. It's really interesting. Will they make any money for it? I don't know. This is an example of how we do stuff. I'll just talk briefly through how we ran stuff through the riots. Because this was the sort of thing where we didn't have the luxury of time to produce data analysis and use some of these tools. We had to do stuff very, very quickly. So the 1981 Brixton riot was very, very different in the sense that we didn't have this flood of information. We had to wait for [UNINTELLIGIBLE] to come back and tell us what was going on and what caused it. 2011, we're assaulted with information via Twitter, news, everything is kind of being thrown at us all the time and assertions by politicians is what's causing what's going on. So what can we do about it? Well when the riots were happening, we set this up. Google Fusion Tables, just a list of verified incidents. Every day we would update it as soon as reports came through from our reporters or from the wires. We'd just add things onto this map. And it just told people what was going on where, very, very simple. We let people download it as well, because this stuff is open. It was the biggest thing on the site for three days in a row. It had maybe about 700,000 page impressions on one page, which is great for the Datablog. But is it also a way of-- because people are desperate for that kind of basic information. Now as people started to get arrested, we thought, well great, we're going to have proper data from the government now. What they do though, of course, is the aggregate it. They put it all together. And what we were interested in was the individuals. Who were these people who were out there? Why were they doing this? So we started going after these. These are called the court registers. Every magistrate court has these. At the end of each day, for every person in court, it tells you who they are, their ages, their addresses, what they're accused of, and what happened to them each day. I mean, those questions we had. This is a week after riots. We're a week after people started appearing in court. And we wanted know are people being treated more harshly in court, are they being treated differently to other people in court, and this kind of stuff. We wanted to find out what was going on. So we went to the magistrates court, said, we'd like all your court registers for the day, please. We went to Camberwell, and so on. And they said no. Well Camberwell actually asked us to pay the five pounds for each name. Because we didn't have any people that [UNINTELLIGIBLE] This is public information. So we went to the Ministry of Justice. And they put an instruction around to every court to release this data. So a week after the event, we had 1,000 records of people who had been through the court in riot related cases, which meant we could prove that actually 2/3 of people came from very poor places, they were being treated more harshly than people in court. Whether or not you agree with that, it's useful to know that, because you want to know how the justice process is actually working, especially when so many people are kind of being [UNINTELLIGIBLE] through the court. And then we were inspired by the Detroit Riots Project. In the late '60s, big riots in Detroit, people died. It was a big thing. And a guy called Philip Meyer went out and interviewed people who were actually involved in the riots to find out why they did this. So we used that database as a way to go out and interview people and find people involved in the riots. And we asked them important questions like, was it Twitter? Did Twitter cause the riots? As we know, Twitter was blamed for the riots. Actually, what we found is most people found out about the riots, the people who went out there, from TV, Old media, right? And we did stuff like this monitoring hashtags. And that big one at the end is riot clean up. We showed how Twitter was actually used by people as a way to tell what was going on. We did this as well, with ITO where we mapped where people lived and where they were accused of rioting. And then modelled how people went from one place to another. This is London. In London, people lived a lot closer to where they were accused of rioting. In Manchester, which is here, people came from miles away to come to the center. And partly because London is full of high streets. So in London, you're never too far from a [UNINTELLIGIBLE]. Manchester is different. We also looked at rumors on Twitter. The green dots are rumors starting. The red dots are rumors being squashed. That was the rumor that the London Eye was on fire. But it turns out, you can't burn metal. So it wasn't. There was another rumor that there was a children's hospital in Birmingham which was being attacked. See that green, it's starting. And then people start to squash the rumor down at the end. Mark Twain said that a lie can be around the world before the truth has got its boots on. And with Twitter, you think that's going to be more true. Actually, the truth can bomb around right after it and make sure people really know what's going on. So the next stage of this project is that we're going to go the other side of the barricades and talk to people who were on the front lines of the police officers and people in the courts and see what they thought. Why they thought the riots started and their experiences of the whole process as well. So this is something that I was going to bring up too. [VIDEO PLAYBACK] -In fact, there are now over 3.1 million millionaires. But these are not the richest of all. The US has over 400 billionaires, more than any other country in the world. Who's at the top of that pile? These three have a combined net worth of $131 billion. That's just over the combined budget shortfall of every state in the US for 2011. More than the cost of the global war on terror in 2010. [END VIDEO PLAYBACK] SIMON ROGERS: So that's actually something we did very recently around the 99% versus 1% debate. And it's quite a traditional method, using a video and showing that information. Now I suppose what I'm gearing up to say is, actually, although there is 150 years between something like this and something like this, the distance between what we're trying to do is very, very similar. And what we have now is this amazing amount of information [UNINTELLIGIBLE] and the amazing capability of presenting information in ways we never could before. Thank you very much [APPLAUSE]