Developerworks Interviews - Crowdsourcing big data creativity and the iod2012 unconference

[ MUSIC ] LANINGHAM: This is developerWorks Interviews. I'm Scott Laningham. I am joined over a Skype audio call this time by Jim Kobielus, an IBM Big Data evangelist and former analyst for Forrester Research. Also joining us is Stephen O'Grady, an analyst with Redmonk, self-described as the first and only developer-focused analyst firm. We're here to talk about the Unconference that will be going on at Information on Demand 2012 held the week of October 21 through 25 at the Mandalay Bay Resort in Las Vegas, Nevada, U.S.A. Gentlemen, thanks for joining today. Appreciate it. KOBIELUS: Sure. O'GRADY: Yes, thanks for having me. LANINGHAM: You know, first thing we should do is get a little basic information out of the way about IOD 2012, the Unconference, so people know what's going on, the day that it's happening. Is it open to everybody and anyone? What's the story there? Jim, you want to chime in on that? KOBIELUS: Yes. We're trying to attract to the Unconference, we're trying to attract people who wouldn't normally come to or participate heavily in the main body of IOD. So it will be on the final day, the 25th, which is the Thursday. And I believe it's from 10 a.m. to 2 p.m. Pacific, of course, it would be in Las Vegas. And what we're doing, the format is really up to the attendees in advance to submit discussion topics through a portal on Skype -- you might want to later on give the URL for the portal, or I might -- and then to vote on the topics. And really we will discuss at the Unconference itself whatever the attendees in advance vote as their priorities. So we're going to keep it nice and ad hoc and very much attendee driven, and hopefully we'll very highly spontaneous and interesting and insightful contributions from everybody, not just IBM. I hope that we can get a broad swath of customers and partners and really non-traditional IOD attendees to come out and be aggressively interesting. LANINGHAM: Yes, great. It sounds like what people are coming to expect from an Unconference, which is an opportunity to turn the tables a little bit and instead of being talked to, the participants get to decide what you're going to talk about and do a lot of the talking, right? KOBIELUS: Yes, we're not going to market at them, far from it. We're just going to open it up for discussion. LANINGHAM: Fantastic. And you mentioned the link that we want to share, and that's iod.uservoice.com, iod.U-S-E-R-V-O-I-C-E.com, which is where you can nominate and vote on topics ahead of the Unconference, and there's a lot of good activity going on right now. From what I hear, over 100 people already participating in building the topic selection for the unconference. KOBIELUS: It's amazing, how much...what a diversity of ideas we already have up there. LANINGHAM: Stephen, you know James Governor was at Impact, I'm sure, for one of these last May. And I've worked with him before. I'm just wondering, are you going to be bringing that same Redmonk rabble-rousing energy to this thing? O'GRADY: We hope so. Yes, it's certainly our voice and our viewpoint. As you mentioned, we're very developer focused so we are, we're sort of very concerned with the voice of the practitioner, and we will absolutely hope to inject that into the Unconference coming up. LANINGHAM: Fantastic. And never a dull moment when a Redmonk participant is there, so it will promise to be both informative and fun. KOBIELUS: As somebody who used to compete against them, I want to tell you, those guys are sensational. Opinionated, smart. Can't wait to engage them face to face. LANINGHAM: Have you ever met an analyst, Jim, that is not opinionated? [ LAUGHTER ] KOBIELUS: Well, only the failed analyst, yes. [ LAUGHTER ] LANINGHAM: Now, why don't we touch on a few of the topics that are already bubbling up in some of this nominating and things that I know you guys are interested in talking about, just to give people a little taste of what's going to be happening. You okay with that? O'GRADY: Sounds like a plan. LANINGHAM: Now, Jim, I know something that you and I had talked about is this idea of having enormous amounts of data in the palm of your hand. I mean, soon it's going to be petabytes in your palm. Will you want to...are people going to want to cash their entire life on their gadgets? I mean, how far can this go? KOBIELUS: You know, I mean, the whole notion of petabytes in the palm of your hand is both figurative and literal, at least in the way that I'm pitching it. I pitched that idea, by the way. It's up there. We may not necessarily discuss it if people don't want it, but here is my take on it. In the cloud more data is on demand increasingly. And increasingly, as the cost of storage comes down radically, it will continue to drop, especially as like, you know, flash memory and so forth become extremely dirt cheap. And great volumes, a lot of storage. So it's, I mean, right now, terabytes, in the low terabytes, that's where the average data warehouse is in terms of storage capacity. But we're seeing the whole big data space has blown wide open because storage is getting so cheap that it's not inconceivable now that you might want to and afford to build out a data warehousing infrastructure with petabytes and so forth because it's not as expensive as it used to be. And by the end of this decade, it will be far less expensive. So you've got to start thinking ahead to the year 2020. Vint Cerf, the venerable father of the Internet, or one of them... LANINGHAM: Right. KOBIELUS: ...prophesied by the end of this decade the cost of storage would be...I mean, raw storage would be something like 1/100th of what it is now. LANINGHAM: Amazing. KOBIELUS: And so if you just do some mental calculations based on the retail costs of storage in petabytes, denominated in petabytes, you look at the improvements and compression ongoing, you look at things like deduplication technologies, usable storage, conceivably can...a petabyte might be cheaper than 100 bucks by the end of decade, possibly even far cheaper than that. And you've got to think that if it gets really cheap, it will be petabytes, potentially... LANINGHAM: Right. KOBIELUS: ...will be embedded in your handhelds, will be embedded in appliances and whatnot that will change the way we live. So if it's embedded, let's say, in your handhelds and your tablets and all the devices you use in your life, well, won't people increasingly cache more and more information on themselves in platforms like your handheld where it's literally handy? You can get access to it and you can use it for everything: your medical needs, to store a complete record of your life, you know, and so forth. I mean, people like to hoard data. So really by the end of this decade it will be cost effective to have a petabyte of literal storage in the palm of your hand. Would you want to? Would you want to keep that much data on any handheld or easily lose-able or theft-able, if that's a word, steal-able device? Clearly you have to tighten security. And you have to worry about identity theft, it becomes a serious issue. LANINGHAM: Yes. KOBIELUS: An increasingly serious issue. Maybe only the geeks of the world, like say, Stephen Wolfram, who is one of the quantified life I think aficionados, those are the sort of people who really get into being like completely self documenting and they want to hoard every last scrap of information about themselves, including what they did every second of the day and so forth and chart it out with all that. But most human beings are not that excessively geeky about analyzing their lives. So, you know, some of us will be on the forefront of the continuously self documenting new wave and will want to keep it all handy, and the rest of us will say, that's crazy. LANINGHAM: Yes. Stephen, do you want to chime in on this one at all? O'GRADY: Well, I mean, I think the only addition that I would have is is that I think the fact is that most people who have a smartphone today are accessing petabytes of information, right? And basically if I show up in a new city and I'm, you know, I open Google Maps, as an example, that's basically me accessing petabytes of data, not just map data, data about locations and so on. So I think the real reality is that in many cases, you know, whether it is present on the device, we're already leveraging massive data stores as we walk around every day. LANINGHAM: Yes. So it's amazing. You said Vint said....Vint said that, you said 1/100th would be the price drop by the end of the decade? Is that what you're saying? KOBIELUS: Yes. And last year he said the retail price for a petabyte -- he went out to like Best Buy or somewhere -- it was like $128. And he said that based on historical trends, storage could be anywhere from 100 times cheaper...from 10 times to 100 times cheaper by the end of this decade. And so when you use that 100 times cheaper outside estimate, we're really talking about, you know, storage, a petabyte might be cheaper than a night out at a fine restaurant. LANINGHAM: I know. I know. It's amazing, isn't it? I looked back once in a blog post at one of the early hard drives that was marketed, and if you did the adjustments, you know, I don't know. If you looked at what someone paid for a 10-gig hard drive, you know, when they were first available, if you adjusted it now, those dollars against today's storage totals, you know, it would be millions of dollars for that drive. So we've already seen that huge drop in cost, and I guess it will just continue. But it will reach a point where maybe people don't even pay for storage anymore. Do you think? KOBIELUS: Yes. It's a distinction between what Stephen was saying, which is essentially petabytes at your fingertips, meaning on demand, versus petabytes literally in the palm of your hand for you to... LANINGHAM: Right. KOBIELUS: ...carry around with you and possibly lose. It will be all accessible in one way or another, and then we'll increasingly be -- we, human beings, will be entirely oblivious about where a given bit is being persisted right now, whether it's locally, whether it's in the cloud, whether it's being replicated between the two. We won't care, we won't need to care, because it will always be accessible. And if we lose, let's say, our smartphone, we can get another one fairly quickly and then download as much data to it cost effectively over broadband connections as we wish. So it will be kind of irrelevant where it's physically stored. LANINGHAM: Yes. Let's run through a few others here real quick. We don't want to give away too much because we want people to come to the Unconference and participate in these. But Big Data's optimal deployment model. Federate everything? What do we think? [ LAUGHTER ] KOBIELUS: Oh, in a nutshell, there's, I don't think there's any one optimal deployment model for Big Data. It's going to be centralized, to some degree, in ever-larger scale-up platforms. And we at IBM have just released a new platform called PureData. Conceivably it could all be stored on those platforms, or could be in a three-tier architecture. We have in-memory columning on the front end for fast access. You have say a relational database in the hub tier for governance. And let's say back in the staging or preprocessing layer you've got Hadoop or NoSQL or whatnot. And they all play together through the power of virtualization and extraction layers like all these different SQL dialects, they all play together as if they were one unified big data resource. You know, really, and then the deployment models that are optimal are not any hard-and-fast rule, but it depends on your environment, the data sources, the kind of analytic jobs you're running and sort of the access patterns you're trying to support. So that's my take on it. LANINGHAM: Stephen? O'GRADY: Yes, I would basically kind of agree with that. I would say that ultimately when we look at what people are trying to do with data, it's really going to come down to, you know, what does the data look like? Where does it sit? Where are they getting it from? One of the examples that we talk to people about is we...during the financial crisis, the worst of the financial crisis, there was a hedge fund that tried to essentially correlate general financial market performance with taxicab dispatch data from the Wall Street area. So ultimately, when you look at sort of Big Data problems like in terms of trying to correlate very, very different datasets, some of which are very large... LANINGHAM: Right. O'GRADY: ...again, there is no one way you're going to do this. Right? It's going to be, all right, where are the different data sources? Where do they sit? What kind of access do I have to them? What are the models that I need to apply to correlate or to relate them to one another? And in some cases that's going to mean, you know, cramming everything on to the same racks from a Hadoop perspective for performance reasons; in other cases it's going to mean pulling a variety of sort of disparate remote sources together. LANINGHAM: Let me throw this next one to you first, Stephen. You know, for those coming to the NoSQL world from a relational background, what's the role of SQL today? Talk about that. O'GRADY: Yes, it's been one of the more interesting questions that we've seen. Obviously in the last, say, 24 months, 36 at the outside, interest in sort of not, quote-unquote, non-relational storage or what people might refer to as NoSQL storage has really exploded. Right? And there's all kinds of sort of different projects. Hadoop is a highly visible example and other projects like Mongo. There are a whole wide variety of different types as well. We key value storage, document storage, graph databases, distributed filesystems and so on. None of these are replacing the relational database. They're augmenting or complementing the relational database in a vast majority of cases. But one of the problems that everybody is having is that there are a lot of people who know SQL, there are a lot of people who know relational databases; there and comparatively fewer people, on the other hand, who know sort of approaches or paradigms like MapReduce. So, as a result, one of the things that organizations find is is that even when Hadoop is a superior problem or a solution to a given problem, procuring the necessary resources to make this work can be difficult. So what we've seen over the past, oh, I don't know, couple of years has been a lot of the projects have sort of been going back, you know, and it's sort of ironic given the categories referred to as NoSQL, they've been going back and adding SQL-like functionality. So Hive and Pig are two projects which allow people with a sort of SQL background to access data from Hadoop without having to know Java, without having to know MapReduce. Cassandra has recently introduced the Cassandra query language. Uncle is the product of, you know, efforts between some of the folks who worked on Couchbase, as well as, you know, a couple of other databases. So the point is that really I think for people coming at, you know, the NoSQL problem from a relational background, from a SQL background, one of the interesting things that I think they'll find, you know, certainly today and increasingly so the next couple of years, is that the world is going to be increasingly accessible to them... ...because many of these projects find that as much as they may have been frustrated with a relational model or the SQL model in the past, the fact at the end of the day one of the most important things is where can I find people? And if I can find more relational people, they're going to find a way to put those regional people to work even if they're not working on relational backends. LANINGHAM: Okay. Should we move on, Jim? Or did you want to chime in with anything on that? KOBIELUS: No, that was great. I can't add anything to that. LANINGHAM: Let's, I want to close with this idea about crowdsourcing Big Data. And for those of you who haven't heard about Kaggle, it's pretty interesting really. I was looking on their website how they define what they do, and they say Kaggle is the leading platform for predictive modeling competitions. Now, companies, governments, and researchers present datasets and problems, the world's best data scientists then compete to produce the best solutions. And at the end of the competition, the competition host pays prize money in exchange for the intellectual property behind the winning model. Crowdsourcing clearly brings a lot of power to productive modeling. Is this thing, in your guys' opinion, going to become more and more common? LANINGHAM: I think the whole notion, yes. I think the whole notion of sort of an open source market for data science expertise is coming into its own everywhere. People, first of all -- when I say "people," organizations, businesses don't have enough, either enough data scientists or enough of the right data scientists in-house. And the ones they might need for any given project in a bleeding-edge area, let's say it's something to do with customer experience optimization based on experienced graph modelling or whatever, to the extent that they can find these people on a consulting basis, they're expensive. So more and more what companies will be doing is that they're going to be outsourcing on a project-by-project basis the modeling itself, the exploration of the data, the building of these models. And of course, if you outsource it to some expert, you want to retain rights to whatever the results are that are deliverable. So a market like a Kaggle seems like it's sort of the foundation for an emerging sort of a collective or open source market for this kind of expertise, in an era where data science expertise is in short supply. And it may even be beyond that. It really seems like as data scientists themselves get established in their career, at a certain point they will say I am good, I am a star. Or these people tell me I'm a star. Let me go out and essentially like an athlete superstar and be a free agent and see how much I can get for my services in some marketplace such as this. LANINGHAM: Stephen. Thoughts? O'GRADY: Yes, I was just going to say, I mean, I would add that Joy's Law, which is named after Bill Joy, the co-founder of Sun Microsystems, used to say no matter who you are, most of the smartest people work for someone else. [ LAUGHTER ] O'GRADY: And you know, I think by and large I think that's true. You know, because, in other words, you just, if you're taking an organization or an individual and betting on them or betting on the field, you pretty much have to take the field. So I think by that extension I think crowdsourcing is a model that clearly has potential. The execution, I think, is often uneven, you know, in terms of how do you up-level the right conversations? And I think Netflix may be an example there, where they had what they called the Netflix Challenge. And the Netflix Challenge was a $1 million prize that was given to, you know, the first team that could beat their in-house recommendations' algorithm by I believe it was more than 10 percent. And that told you a couple of things. First of all, that Netflix certainly understood this law. LANINGHAM: Right. O'GRADY: Just as importantly, that this improvement was worth more than $1 million to them. Right? Because you're not obviously just talking about the prize itself, you're talking about the effort that goes into sort of running the contest, maintaining the contest, evaluating and assessing results and so on. LANINGHAM: Yes. O'GRADY: I think that the return for organizations that are able to leverage outside expertise will be there. What is unclear, as I said, I think, remains how and where you execute. LANINGHAM: Yes, it's, you could probably be pretty accurate by saying the experts on crowdsourcing are those in the crowd in many ways, aren't they? O'GRADY: Pretty much, yes. LANINGHAM: It's a very powerful model, and I think we're going to see more and more of it. If it's good enough for if you're going to buy a lawnmower or washing machine or which movie to pick or which song to pick, it clearly is going to be important for a lot of things. KOBIELUS: Yes, this kind of model, like the Netflix Challenge, you know, model which is essentially also essentially the Kaggle model, it's where you, the client, paying the money, the prize money for this contest, you're paying for results. You're not just paying for a bunch of geeks to build a statistical model; they've got to prove that it works better at prediction, or you'll cough up the funds. So that's really the model of this kind of a competitive market. They're paying for the results. LANINGHAM: Guys, this has been a great chat and a little prelim to what will be happening at the Unconference at IOD 2012 coming up October 21 through 25 in Las Vegas. Jim, remind me of the day of the unconference during that week. KOBIELUS: Yes. That's Thursday, October the 25th from 10 a.m. to 2 p.m. Pacific time, Las Vegas. LANINGHAM: So people will be loaded with all the things they've been hearing and pondering during the week, and they'll be ripe and ready for an unconference at the end of the old conference. And again, we should mention iod.uservoice.com, which is where you can go to nominate and vote on topics for that Unconference right now to get ready to attend and make the most of that date on the 25th and the whole week. We hope to see everybody there. Again, my guests have been Jim Kobielus, IBM Big Data evangelist, and Stephen O'Grady, an analyst with Redmonk. They will be both there making this Unconference happen with you. And, again, thanks, guys, for joining today. O'GRADY: It has been a pleasure. KOBIELUS: Thank you. LANINGHAM: I'm Scott Laningham. This has been developerWorks Interviews. [ MUSIC ]