Clustering into More than 3 Clusters Spring 2014 Unit 20 Lesson 3 / Fall 2013 Mooc Unit 16 Lesson 3

So... now we'll look at those optimal clusterings in more detail. We've finished looking at the original labels and comparing them with that, and we will do three clusters, which is actually the... what we just did, in fact. And this particular clusters were found by a methodology called deterministic annealing, which we'll say a few words on towards the end of this lesson. [pause] I pointed out we have a case of three clusters which we just saw, in fact, as part of the comparison with the original labeling, and we also have a case of 28 clusters. You could ask why on earth I chose a funny number like 28. Well, I actually chose 30, but I mentioned, actually, in the case of the Wikipedia, that sometimes in K-means, or, in fact in the sophisticated K-means using deterministic annealing, you get cases where clusters either have no members or are too small to be interesting, and this is what happened with the 30 clusters, it actually found 30 clusters, but there's a cleanup step where it removes small clusters, and it removed 2 and got 28. And we will see that in the case of the 28 clusters, they're actually divided into geometrically compact regions which are significantly smaller than the three clusters. And so this is something which clearly can be used in the classification algorithms which are needed for recommender systems. So here is the... this thing we just looked at with the 3 clusters, and we have two screenshots here, the one where it's very clear for the... for... this particular clustering. [pause] And here's another case where it's still reasonably clear with a different view. So now we'll just... do that... look at that again in... PlotViz. [pause] So with PlotViz we'll... let's try to find... [pause] I just closed the case with three, so... we will not actually look at three again, cuz we already looked at three. Well, I can show you how to actually open PlotViz, so if you have your PlotViz file on your computer... which we will here, and... we want it to look... this file here, we just click on that file, .pviz is known to be... something to do with PlotViz, so it... PlotViz opens it. We get rid of the axes, I told you to do that... we set the glyphs to be auto-oriented, and then we make it bigger. We go to the full-screen view, and now usually it's too small when you do that, so we have to make it bigger... which we can do in several... [pause] by a left mouse. And now we can, of course, look at it as we just did. It's a little bigger cuz previously I had two plots on the same screen. Now I can look at this thousand... group of a thousand in some... optimal fashion. And of course you try to... you plot... in PlotViz you can... move it... [pause] resize it... [pause] With shift... mouse, you can just move it without rotation. Ctrl rotates, and so on. So these are all nice PlotViz commands. [pause] Now we'll finish that, and we'll go to... we keep in PlotViz. Here we have the same 1000 points, and now they're divided into 28 clusters, this one's all set up correctly. So what we have to do is to go to full-screen... [pause] rotate it so it's... we obviously choose a nice orientation, so you can make it as big as possible and still keep it on the screen. So it's something like that, and now just look at this... division of these 1000 points into 28 clusters. Well, it's pretty messy, but if you actually look at it... let's look at this blue set here. So the blue set here, which has 51 points in it... it's clearly pretty close to each other... [pause] So you can see here these points here are nice and close to each other, these points here are nice and close to each other. And so then when you look at it, it looks a mess cuz you're seeing 28 clusters, which in any view some of those clusters are going to be on top of each other cuz we've actually chopped it up in each of the dimensions. You can see, however, that each of the clusters is pretty compact, here's a nice green cluster which is compact. I should say with 28 clusters it's not easy to find unique colors, so some of these colors actually look pretty similar, here's another slightly different shade of green, over here, which is, again, pretty compact. So that's what K-means does for you. It takes a region and divides it into geometrically compact regions. So let's, again, think of this as a recommendation system. You have a point, a new point, so you've clustered these 1000, you know how they're ranked and what they're... how they're... looked at by different people. And now you have a new point. And along it comes, whether it be a person or an item. And then you can just interpolate that person into this space, see where they... where they're near in this space, and then if they say you're near this point here, well, you can think that that... item is gonna be similar to these items in the same green cluster. So clustering is a nice way of classifying regions in such a way... that those regions are relatively compact, and you can get a rather precise identification of similar items. So that's why I showed you the 28 cluster case to illustrate this other view of clustering, not trying to find separated... [pause] groups, but they're just classify regions. So that's the end of that one. [pause] We need to go away from full screen, and we get rid of that one. And now we can go back to the powerpoint. [pause] This is just the same thing I did in real time, interactively done as a screenshot, I just took that... [pause] clustering and screenshotted it so... and again, you can see the compactness is a nice compact brown cluster, and so on. Here's a light blue cluster, also pretty compact, and so on. [pause] So this note at the top left just summarizes what I've said, and for a different view, which is, again, shows different clusters to be rather clearly nicely compact and geometrically close together.