Iu Cloud Computing Mooc - Virtual clusters 2 - Scattered kernels in a cluster

So now that I think we feel like we've got at least a decent understanding of MapReduce from a high level, some of you much... probably have been dealing with MapReduce before even this summer school session started. Let's... actually start to talk about... a little bit about a cluster. So what exactly is a cluster? Some of you may have had clusters in your cereral this morning. Other of you may have been actually experiencing clusters from a computing perspective. So if you go from this kind of local node of what's a cluster, I'll start with my laptop, but now if your problems are starting to get big enough where I actually can't compute them within my own little laptop environment, there's this whole concept of high-performance computing, high-throughput computing, distributed computing. We're starting to get this mode of, "I think I'm probably going to need more than one computer to do my work. I'm gonna need to put my data across multiple machines, I'm going to need to use multiple CPUs." And so it starts to become a lot more complex, now machines need to communicate with each other. They need to... I need to be able to actually have my data accessible from every single computer. So it becomes a harder problem. And you say, "Wait a sec, I'm not even a computer scientist. All I want to do is solve my Bioinformatics problem, all I want to do is solve my Linguistics problem. I just have a lot of data to deal with. I want to go out and pull all this Twitter data and I'm going to figure out something interesting out of it." Well, how do I do that on a cluster? So again, these are complicated problems which are maybe... dealing with a particular domain. So let's dig a little bit into what exactly is a cluster. So from a high level, a cluster is just a set of computers that have... are [unknown] in somewhat close proximity. They have networking that's making them act as kind of a single unified thing, this is where this kind of term of 'supercomputer' can sometimes come into play. There's typically storage that either can be distributed across the cluster, or at least from a distributed manner, could be accessible from other computers. And you probably will have a resource manager tied to that, and so this... what this means then is that... this resource manager's actually understanding how are my other computers... what are they doing, how much CPU usage do they have, how much memory are they using, what's the disk... I/O footprint look on those computers, because I may actually move work... And you can think of this as almost like a traffic cop. If I've got a computer that's busy, maybe I'm gonna shuffle work over to a different one. So from a high level... just... and this can be from kind of a high-performance computing space as we get into this MapReduce space, some of these high-level properties of a cluster now can be played into this Hadoop framework that we were just talking about in the previous presentation. I'm again solving some sort of an application problem, and I don't necessarily wanna have to know all of these details, I just want my application to work. That's the beauty of sometimes having underlying frameworks that can take advantage of some of these things for you and actually putting the data in the right spot and running on the right computer and storing it in a proper place. So here's a very simple way to look at this. I've got this computer, which called... sometimes it's called a head node. And I've got these other computers that are sometimes called the execute nodes. And we'll see a little bit different terminology in this Hadoop space than the MapReduce clusters. But a head node is typically... if anybody's ever logged into a high performance computer, these are the computers that you're going to log into, and you're kind of interacting on. What you can do then is you can schedule work out to these execute nodes, and they'll actually handle your task. So what you'll actually see here then to is you've got a file system that can distribute data across those nodes and so that way then I can say, "Well, what if I want my application to be accessible across every single one of the nodes?" You may have it stored in a spot where it's distributed across the entire environment. So if we take this kind of view of a cluster, and say, "Well, what can I actually do with this?", let's take a high level look at a very simple paradigm of breaking down large problems and running them on a cluster. So there's a... kind of a programming paradigm out there called Scatter/Gather. And if you take some input data, and you say well, so let's take, as an example, file that I want to get the WordCount on, back to this WordCount example, I've got a million rows of data in this text. I could have a very long... it could be a very long paper, it could be... the Bible, it could be anything, very long set of text. What if I want to do a WordCount across that? So I'll split that across every single node that I have in my cluster, maybe, in an even fashion, and now I can start to do WordCounts across every single one of those nodes in this environment. So I've gone from... this kind of diamond shape is a very nice model of... I've taken this input data, I've split it across all of my nodes, and then I've gathered my output, whatever that is, whatever operation I'm doing, back to a final landing spot. And so if we take a step back and look at our cluster, I can do this same mentality with this. All of a sudden now I've split all of my data across all of my nodes in a cluster, I'm operating on each one of them independently, and now I'm going to gather back my result, and whatever that result was. So you start to see this in this example now. I'm taking this Scatter/Gather and I'm actually now mapping that into how it would work on a cluster. So I've got file system then, and each one of my nodes splits the data and scatters it across the nodes. I execute on each one of those things independently and then I gather my data back into an output. So from this... view here, what if I have a lot of data in this mentality? So all of a sudden this is where... actually, this is why Google changed their mentality of a cluster and started to introduce this MapReduce framework. Because for them what ended up happening in their problems, because they were so large, was that the network actually started to become a bottleneck in this framework. So instead of trying to map data from a distributed file system that's easily accessible from all of the nodes up to an execute node, performing work on it, and then sending it back down to an output file, they were saying that it's gonna be too costly and our network and network technologies aren't keeping up with the amount of data that we're dealing with. And I think you can go out and look on Wikipedia or something like this, but I think Google now is they're stating out there publicly that they process about 24 petabytes a day. So you start to think about how much data that actually is. So what becomes your bottleneck in this? Probably gonna be your network. So in this case, if I'm gonna take this setup of what I see in a kind of a traditional cluster, how can I start to change that mentality up? I still need to be doing large computing, how do I change my mentality, how do I change my cluster setup so that I can actually operate on the data? And so what they actually did then was they said, "Well, why don't we make a file system that actually put the data essentially on the nodes?" So look at the difference in the files, we see this says 'file system' across each one of these execute nodes. That's a different setup than having a networking connection where the file system's actually accessible from all of the nodes, what they've done now is they've said, "Well, why don't we just go ahead and put the data on the nodes, and we'll operate on each one of those independently. We're already doing that anyway in this kind of scatter-gather mentality. Why don't we go ahead and just rethink about how we want to do our file systems and how we want to actually build our clusters, and lets put the data actually out there." So now I'm going to actually distribute my data across all of my nodes. I'll probably do some sort of replication so that way then if I have a node failure, again it still sits on another node, so that's how you're getting this prevention of data loss. And then I'll have the file system actually keep track of this metadata for me so that I know which node's actually storing it. So now I can actually say, "Send my compute actually to those nodes where the data resides." And now I've kind of changed my mentality a little bit of I still have a head node, and I still have these execute nodes, and I still have this resource manager that's saying, "Hey, where should I be scheduling my work?" Well, schedule your work to where the data resides and then it'll be simpler. Now I can actually minimize my network traffic because it says, "Well, your 'hello, world' is already over there on that node, let's go ahead and go do a WordCount on it." So the scheduling resources now changes its job, not only it says which ones are the most memory intensive, I/O bound, CPU, but also where my data reside and lets actually schedule tasks to that node so that I can... or I can reduce my network traffic. So again if we go back and look at this, what we've done now is, back to this kind of splitting of data, we've actually done this split step and put it out into the file system as the first thing we're gonna do. Back to this example, I've split my data, and I went ahead and just put it on the file system before I even started operating on it. And that's what you start to see here. I've scattered my data across all of these...and now I'm just gonna come in with a map task, and I've already put it out there, I haven't had them load it from the file system, that was done before. So if you picture Google indexing the web, they're grabbing all this data from web pages, and they're starting to put the data across all their nodes in their data centers, and now say, "What do I want to do on that data? What's my next step that I want to do in this map task?" And so that gets us a little bit into then some of our next step. So... in our kind of checklist of things that we're doing here, we've obviously just kind of talked about clusters. We've gone a little bit from HPC clusters into what exactly is a MapReduce cluster, with this kind of little bit different setup. So I'll take you through a couple... different topics then, specifically with virtual and with Cloud.