Tip:
Highlight text to annotate it
X
So... now we'll look at those optimal clusterings
in more detail. We've finished looking at
the original labels and comparing them with that,
and we will do three clusters, which is actually the...
what we just did, in fact. And this particular clusters
were found by a methodology called
deterministic annealing, which we'll say a few
words on towards the end of this lesson.
[pause]
I pointed out we have a case of three clusters
which we just saw, in fact, as part of the comparison
with the original labeling, and we also have
a case of 28 clusters. You could ask
why on earth I chose a funny number like 28.
Well, I actually chose 30, but I mentioned, actually,
in the case of the Wikipedia, that sometimes in K-means,
or, in fact in the sophisticated K-means using
deterministic annealing, you get cases where clusters
either have no members or are too small to be interesting,
and this is what happened with the 30 clusters,
it actually found 30 clusters, but there's a
cleanup step where it removes small clusters,
and it removed 2 and got 28.
And we will see that in the case of the 28 clusters,
they're actually divided into geometrically
compact regions which are significantly smaller
than the three clusters. And so this is something
which clearly can be used in the classification algorithms
which are needed for recommender systems.
So here is the... this thing we just looked at
with the 3 clusters, and we have two screenshots
here, the one where it's very clear for the...
for... this particular clustering.
[pause]
And here's another case where it's still
reasonably clear with a different view.
So now we'll just... do that...
look at that again in... PlotViz.
[pause]
So with PlotViz we'll... let's try to find...
[pause]
I just closed the case with three, so...
we will not actually look at three again,
cuz we already looked at three.
Well, I can show you how to actually open PlotViz,
so if you have your PlotViz file on your computer...
which we will here, and... we want it to look...
this file here, we just click on that file,
.pviz is known to be... something to do with PlotViz,
so it... PlotViz opens it. We get rid of the axes,
I told you to do that... we set the glyphs to be
auto-oriented, and then we make it bigger.
We go to the full-screen view,
and now usually it's too small when you do that,
so we have to make it bigger...
which we can do in several...
[pause]
by a left mouse. And now we can, of course,
look at it as we just did. It's a little bigger
cuz previously I had two plots on the same screen.
Now I can look at this thousand... group of a thousand
in some... optimal fashion. And of course you try to...
you plot... in PlotViz you can... move it...
[pause]
resize it...
[pause]
With shift... mouse, you can just move it
without rotation. Ctrl rotates, and so on.
So these are all nice PlotViz commands.
[pause]
Now we'll finish that, and we'll go to...
we keep in PlotViz. Here we have the same
1000 points, and now they're divided into
28 clusters, this one's all set up correctly.
So what we have to do is to go to full-screen...
[pause]
rotate it so it's... we obviously choose a nice orientation,
so you can make it as big as possible
and still keep it on the screen.
So it's something like that, and now just look at this...
division of these 1000 points into 28 clusters.
Well, it's pretty messy, but if you actually look at it...
let's look at this blue set here. So the blue set here,
which has 51 points in it...
it's clearly pretty close to each other...
[pause]
So you can see here these points here are
nice and close to each other, these points here
are nice and close to each other.
And so then when you look at it, it looks a mess
cuz you're seeing 28 clusters, which in any view
some of those clusters are going to be on top of each other
cuz we've actually chopped it up in each of the dimensions.
You can see, however, that each of the clusters
is pretty compact, here's a nice green cluster
which is compact. I should say with 28 clusters
it's not easy to find unique colors,
so some of these colors actually look pretty similar,
here's another slightly different shade of green,
over here, which is, again, pretty compact.
So that's what K-means does for you. It takes a region
and divides it into geometrically compact regions.
So let's, again, think of this as a recommendation system.
You have a point, a new point,
so you've clustered these 1000,
you know how they're ranked and what they're...
how they're... looked at by different people.
And now you have a new point. And along it comes,
whether it be a person or an item.
And then you can just interpolate that person
into this space, see where they...
where they're near in this space,
and then if they say you're near this point here,
well, you can think that that... item is gonna be
similar to these items in the same green cluster.
So clustering is a nice way of classifying regions
in such a way... that those regions are relatively compact,
and you can get a rather precise identification
of similar items. So that's why I showed you
the 28 cluster case to illustrate this other view of
clustering, not trying to find separated...
[pause]
groups, but they're just classify regions.
So that's the end of that one.
[pause]
We need to go away from full screen,
and we get rid of that one.
And now we can go back to the powerpoint.
[pause]
This is just the same thing I did in real time,
interactively done as a screenshot, I just took that...
[pause]
clustering and screenshotted it so... and again,
you can see the compactness is a nice
compact brown cluster, and so on.
Here's a light blue cluster, also pretty compact, and so on.
[pause]
So this note at the top left just summarizes
what I've said, and for a different view, which is, again,
shows different clusters to be rather clearly
nicely compact and geometrically close together.