Tip:
Highlight text to annotate it
X
So again, why is the Cloud interesting?
We talked a little bit about this earlier, about...
I'll kind of hit both industry and academia.
So in industry, from a scalability... I can't get the scale I want
internally in my data center because I'm limited to the size of it.
Okay, well, why don't I just rent some computers
by the hour in an external data center?
So this is an example that I can go out and use 5,000 cores
to run a high-performance computing experiment.
Not on my computers, elasticity... great example
right now that's going on for us. Let's picture the nbcolympics.com.
How... what... let's wait... about three weeks and say,
"I wonder what the web traffic on nbcolympics.com's gonna be."
Or wait about two years. So here's a case where you got these
kind of bursts in demand, and I want scale when I need it,
and scale when I don't. So I'm gonna decrease my scale.
So I would guess that the web traffic on nbcolympics.com's
a lot higher right now than it will be in... two years from now.
So wait another four years, that scale's gonna go back up.
So there's a lot of... information out there, we're talking about
the infrastructure the people are putting behind the scenes
to get live streaming audio, live streaming video, web traffic,
websites, all ready to go for this big event obviously
that's happening over in London right now.
So could the Cloud be used to help scale out and back
in this elastic model of, well, all of a sudden I just had
everybody in the world that's interested in the Olympics
hit my website because they want to see who's winning
the 100-meter IM swimming event tonight,
or whatever it ends up being tonight. So again, back to
the utility computing, this is very similar to, I rent an apartment
and I need to pay for my electric bill, you've got a utility bill,
I can do that same thing with this utility model of, I can go out
and literally pay for the computers by the hour to be using.
It's a totally different model than, "I'm going to buy
all this hardware and put it into my data center tomorrow,
and it's gonna cost me a million bucks. And now I'll start using it."
This is a totally different charge model. So, as you can see,
here's some examples that I've already started to touch on,
but high performance and high throughput computing,
could we actually use these clusters and build scale
that maybe is not present today in a data center,
I'm a small bio-tech company, and I don't want to pay
for a large data center, I'm gonna use the Cloud for my analyses.
Online game development, actually you may find
some of your biggest Cloud users are Facebook app developers.
I better be prepared for the scale that if my game becomes popular,
am I ready to scale it? Because if I've got 10% of the Facebook users
hitting my application, all of a sudden I could quickly hit, what,
50 million people hitting my [unknown], so I'm not gonna
build the infrastructure, I'm gonna build it on top of
a Cloud that allows me to scale in that mentality.
So, and again, back to scalable web development,
nbcolympics.com, Indianapolis 500 probably has access
only right around the month of May, or something like this.
These are... examples of where the scale in this burst in demand
goes up and down, and I want to be able to appropriately
handle those events. Is there a question out there?
[pause]
Okay.
(new speaker) It's... everything is nice, but what about the
data security. For example, a company, how can it upload
their data if it cares about security? From secrets, I meant.
(Jonathan Klinginsmith) Sure, so security's always a big topic when we
talk about the Cloud, there are common mechanisms
that can be handled for dealing with secure data in the Cloud,
I mean obviously #1 case is encrypt your data on disk.
And so there's, I mean, public and private key
cryptography is pretty standard way of doing this.
The, I mean, the other thing that I would say out there is...
there are varying levels of data that a company deals with.
And so some companies that say, "I'm not trying to
get into it right away, well, maybe I'll learn from my easy use cases."
So there's industries, for example, that collaborate with academia,
I can look at public data, there's plenty of companies,
so again, go back to the pharmaceutical industry
and there's public cases out there of people talking about using the Cloud,
but there's public, there's tons of public datasets,
so we'll take one example of the 1,000 Genomes Project.
Amazon's published that data... excuse me, with the NAH,
the 1,000 Genomes Project's been published
out into Amazon as a public dataset. So here's a case
where a company, a pharmaceutical company,
could actually get rich knowledge out of a public dataset
with no security or intellectual property risks with using it.
So here's a case of, I'm gonna actually get analysis
off my internal cluster and run it in the Cloud
as opposed to run it internally because I've got little chance of...
if the data's stolen, so what, it's public data, but I can get
some analyses out of it, and actually run it out there.
So these are just examples of how people are...
trying to get into the Cloud without having to
maybe take some of the risks out of security. But again,
like I said, default is you come in with your security thoughts
on how are you going to secure your data out of the gate,
you don't make security your last topic of conversation,
you make it one of your first. What am I gonna do
out of the gate to encrypt my data, what am I gonna do
out of the gate to keep my keys secure, and work through
these non... kind of technical things, but these would be the things
that the lawyers and all of your other people are more interested in.
The other thing that I said to people out there is...
there's kind of a change in mentality, I've been working
at Starbucks, and I'm on a virtual private network.
Technology's advanced enough now that I can be
on IU's network sitting there literally in a secure session
working in a public internet area. So if you start to think about it
from that mentality, I can start to translate some of that into the Cloud.
Maybe a little bit different technology, but some of the same
overlaying principles of, you have to continue to learn
how much risk am I willing to take and what am I
gonna do to actually secure my risks in that.
There's also things called virtual private Clouds that allow
people to get these machines within their dataset within...
excuse me, within their IPs, so you can be actually in
Amazon's data center, but within your own network
because you've created this virtual private Cloud.
So your data may still reside on blocked storage
that's stored in the Cloud, but you can also say,
"Look, I'm gonna say as a principle, I'm never gonna store data
on block storage. I'm only gonna put data in this bursting,"
I believe bursting is another topic that's gonna be covered
through this, but I'm gonna literally burst out from
my internal cluster out to the Cloud, do some... work,
and then bring my results back in and never persistently
store that in the Cloud. So there's use cases that try to...
get around some of these security concerns people have.
Hopefully that helps.
[pause]
So we talked about industry, let's talk a little bit about academia.
So why would the Cloud be interesting for us as academics?
One of 'em is, and we'll talk a little bit more about this,
Stephen will come up and talk a little bit about reproducibility.
So this goes back to, I'm an application developer,
I have a specific domain that I like to deal with,
I'm not a computer scientist; can I share my work with others
so that they can either reproduce my experiment
or we as a group are working together
across academic institutions to collaborate on something?
Well, the Cloud allows you to have, then,
the same computer running, the same instance types,
working through all the details then that says,
"Well look, all I want to be dealing with is an application up here,
I don't want to have to deal with how do I install it,
how do I configure it, how do I work through all that stuff?"
Well, back to my machine image example.
As soon as you get it installed and configured,
you can take a snapshot of it now and say,
"Here's my software version 1.0. Any of us that want to
use it on the project team, go run that machine image,
and it's got the software installed and configured for you."
So you're already a lot further along with trying to bring
a new project member onto your team because you already
have software installed and configured for them to use.
Now, I could argue that you actually want to have that knowledge
down deeper and say, "How do I recreate it?",
than to just give them this... y'know, here's a snapshot of it, but...
and we can talk about that a little bit more in detail,
but really it just allows you to then say we can
reuse machine images across, or it can have
something registered out there that [unknown] can use
and we're working off the same kind of base.
So I'm looking at a paper as a graduate student, I say,
"Well, how do I reproduce that experiment that that person did?
I know I need to include it in my related work section.
How do I reproduce their experiment?" Well, the beauty of the Cloud
and machine images and all this now is it gives me the ability
to now actually go recreate somebody else's experiment.
How do I get it installed and configured and everything
back to the sample dataset that they used so that I can say,
"Do I have an apples-to-apples comparison with the work
that they're doing and the research work that I'm doing?"
The Cloud helps you do that in its virtualization technology as well.
So virtual environments allow you to then go in
and try to do a lot of variety of uses and configurations.
I think there was a question, like I said earlier,
about how do you know how many map tasks you should do
in a MapReduce cluster? Well, now you can actually
experiment with that. I can actually build a cluster,
change my parameters, how many cluster nodes should I have,
how did it work, scale it down, redo it again,
let's add nodes, let's decrease the nodes, whatever.
And you can start to look through and say,
"How did my algorithm scale? Did it actually work for me or not?"
And you can work your way up and say, "Oh, I got a bottleneck
at 100 nodes! Why did I get that bottleneck, what was my problem?"
You've got this Cloud environment now that allows you to do that.
So you just... and for those that aren't computer scientists,
you can actually dig into the foundational components
of operating systems and all that, I mean, this is literally
a test bed that allows you to understand more
about the systems that you're using.