Info2040x Mod5 Kleinberg Power Laws V1

Now that we've been talking about mechanisms that lead people to copy the behavior of others, let's try applying these ideas to study the phenomenon of popularity. How famous, or how well-known, people or items are. This is obviously a concept that's of great interest to a lot of people-- why do celebrities become famous, why do certain books or movies or songs become hits, why do some politicians rise to prominence. All of this is about how certain things command the public's attention. Now one thing about popularity, part of its very essence, is it is characterized by extreme imbalances. A few very popular people and things command a huge fraction of the attention, and most things go relatively unnoticed. And the question is, is that inevitable? Is that inherent in the very definition of popularity that it should have to turn out that way, with these big imbalances? We're going to argue that it is, and that actually the idea of copying and imitation of the decisions of others is going to actually be at the heart of the reason why. OK. But first, before we get into mechanisms, and get into why copying might lead to these popularity imbalances, let's actually just ask a more basic question, a question about the state of the world. How is popularity actually distributed? Now we can ask this in a bunch of different domains. We could, for example, look at name recognition-- how many people recognize the names of different famous individuals. But that's, of course, very hard to collect data on. And so, instead, we're going to ask about something where we have very detailed data and where we have similar kind of popularity effects. And that's on the web. And in particular to go back and look at the web as a graph. So again, we're going to have nodes that are pages, links between the pages, and fame here, or popularity, is going to mean many, many links pointing into your site. So a site like Google, or Amazon, or Facebook, or a site of a big news organization-- All of these are very, very popular pages. Many, many incoming links. OK. So let's go through a sequence of definitions that will let us make this notion of popularity very concrete. So first of all, I'll say that the in-degree of a page is the number of incoming links. OK? So that's my term for that concept. So if a page has 2 links pointing to it, it has an in-degree of 2. Now the question is, as a function of k, what fraction of pages have in-degree of k? That's really the question of how popularity is distributed. So large values of k-- An in-degree of a million means you have a million people pointing to you. What fraction of web pages actually have such a level of popularity? Versus what fraction of pages have maybe 10 links coming in. We'll let f of k be this fraction. It's just a function of k, the fraction of pages with that level of popularity. Now we want to think about what f of k looks like, how k is distributed over the population. And when you're in the situation, a natural first guess for these things would always be a Gaussian, or a normal, distribution. That's a natural guess when you're thinking about how almost anything is distributed across the natural or social worlds. For example, a whole bunch of students all take a test. And then a week later, you've computed all the scores, and you write down a histogram of what the scores were. And you see this nice bell-curve shape, this normal distribution. You often find that. Or you look at a field of plants where a crop is being planted, and you see all the plants have slightly different heights. And if you were to try measuring their heights very carefully, you might see, again, a Gaussian distribution, this sort of concentration in the middle, and then the bell-curve shape around it. And there's a reason why Gaussian distributions are so, so prevalent. The reason why it's so prevalent is this thing called the central limit theorem, which roughly says that, if the quantity I'm observing is made up of a sum of many small, independent, random increments, then, as I take that set of random increments out to be very, very large, the distribution I'm dealing with should converge to a Gaussian distribution. So for example, on the test, if many students all answer a bunch of questions, and everyone gets each question wrong, with some small probability, then what you should see is the sum of all those random facts, whether you were right or wrong, adds up to a Gaussian distribution. Or I have these plants out in the field, all subjected to the same environmental conditions. Each of them grows by a small amount each day, slightly random. And so the sum of all those random effects gives me a Gaussian distribution. So that's, in effect, why they're ubiquitous in the world, these kind of distributions. But Gaussian distributions would make a particular prediction about popularity, which is that the fraction of popular pages would be extremely rare, in particular if f of k followed a Gaussian distribution. And it would decrease exponentially fast in k. So large values of k would be very, very hard to find. You would very rarely encounter popular pages. And we feel like that's not quite right, because we see popular pages on the web quite a bit. And in fact, that's what people found when they went and first began measuring popularity distributions in the web graph. What they found was that the fraction of pages with in-degree k does not fall off exponentially fast in k. It falls off much more gently, roughly like 1 over k squared. Actually a little faster than that. Sort of like 1 over k to the 2.1. But we can think of it as 1 over k squared, approximately, for right now. OK. So we refer to such a distribution as a power law, because it's 1 divided by k to some fixed power-- 2, in this case. And it turns out that, if we start looking at popularity across many, many different domains, approximating popularity by a power law turns out to be ubiquitous. Power laws are just very, very good approximations for popularity distributions in many settings. So, for example, I could ask about books-- the number of books that sell k copies. Or the number of phone numbers that receive k phone calls during the day. Or the number of scientific papers to receive k citations. In each of these domains, there's some measure of popularity-- incoming links on the web, book sales, calls to phone numbers, box office grosses for movies. And in all these cases, we see that the fraction of things with popularity k can be approximated by a power law. 1 over k to some power. It's also one of the simplest and quickest things you can do when handed a novel data set, which would be to identify some measure of popularity in it and check to see if it's a power law. Someone says, I'm running a music download site, I have a bunch of songs, here all the download records for them. A very quick thing you could do would be to say, look at all the songs, look at the most popular, and next most popular, and see, does their popularity follow a power law? It almost surely will. It's just a very, very robust phenomenon. But now the question is, if power laws are everywhere in the study of popularity, why is that? There must be some simple, robust explanation. And it can't be the central limit theorem, because the central limit theorem leads to Gaussian distributions. So it can't be that it's a sum of simple independent random increments, the way you get with Gaussians. But in fact, that's the whole point. The point is that somebody's-- a person, or an item-- this thing's rise to popularity was precisely not a sum of independent random effects. But rather, it was about a bunch of dependent effects. A few people made some early decisions, and then people began copying what they saw. And the crowd all moved and built on itself. And there was a feedback effect, where people imitated the decisions that had come earlier. So in fact, popularity is arising from a sequence of non-independent decisions, in which later decisions build on earlier ones. And so what we're going to do next is actually take this idea and turn it into a model. We're going to turn it into a way to have a process where later decisions depend on earlier ones. And because of that feedback effect, we're actually going to, in a very robust way, get a notion of popularity and show that it follows a power law.