Info2040x Mod4 Kleinberg Hubs And Authorities V1

We're talking about how we can use the network structure of the web to find high quality pages. And we're discovering that it's actually much more subtle than just using links to cast votes. We can actually get a lot more out of the link structure. What I'm going to be doing here is actually talking through the process that we did informally, by hand, in the last segment. So one thing I'd like you do is make sure that that's fresh in your mind. So go back and watch it again, if necessary, so that you remember the example that we were talking about, that we can start from. We were looking for museums, in our particular example. And pages cast votes. And we then decided that certain pages were really good lists because they linked to things that got a lot of votes. So we weighted them more highly and we voted again, but weighting those pages' votes by their quality as a list. In that way we got new scores. What's actually going on here is that, essentially, we've identified that there are really two kinds of quality measures for web pages. One is, what you might call, its value as an Authority, which is essentially the thing we thought we were looking for originally-- the level of endorsement it gets from the web. So an Authority on museums would be the Smithsonian, or the Prado, some page that really is the item you're looking for. There was this second thing also, which was a different kind of quality. Its quality as what we would call a Hub, a collection of outbound links pointing to the authorities. In other words, the way we think about Authority is that something which is endorsed, and particularly something which is endorsed by good Hubs. And if you asked me, what's a good Hub? I'd say, well that's something which links to a lot of good authorities. You see there's something a little worrisome about this pair of definitions when you first hear it, right? An Authority is something that gets links from Hubs. A Hub is something that points to authorities. It's sound circular. Each thing is defined in terms of the other. But, in fact, we kind of break that circularity by just using each definition in terms of the other, and going back and forth. That's effectively what we were doing. Start with a guess for how good something is as a Hub on each page, and then update our guess about [? everything's ?] score as an Authority. Take that guess about how good everything is as an Authority and use it to update our estimate for how good things are as a Hub. If we just go back and forth then we can actually, in a sense, unwind the circularity in the definition and end up at reasonable answers that make sense. Here's how we're going to do this in a more formal, precise way. For each page on the web, call it page p, we're going to create two scores. One will be called Authority of p, one will be called Hub of p. Those are its scores as a Hub and as an Authority. I should mention that, in our simple example here, each page is drawn so that it seems to be only clearly a Hub or clearly an Authority. But of course, there are things which are both. So for example, the Cornell athletics page might be both an Authority on the Cornell athletics program but also a Hub page linking to many different Cornell sports teams, for example. So a page could be both, and that would be fine. Now, we want to figure out how to update these scores in terms of each other, and in effect, we've already done that in our work example. We have these two update rules. So the Authority Update Rule says, take the current estimate of the scores you have and let's update the Authority of page p to be the sum of the Hub scores of all pages that point to p. So there is a bunch of links coming into p, they each have Hub scores. Make the Authority score be the sum of all those scores. That's what we did with our museums. The quality of the lists that pointed to it gave us the new score. On the other side, we have the Hub Update Rule. And remember, Hubs get better when they point a lot of authorities, so we should update the Hub score of a page p to be the sum of the Authority scores of all pages that p points to. Notice that really, the Hub Update Rule and the Authority Update Rule, they're very similar to each other. The Hub Update Rule sort of looks like the Authority Rule with the links going the other way. Now, to unwind this circular relationship between the two, we just run these definitions in sequence, alternately, over and over. We start with estimates of 1 for each page, for all scores, indicating that when we start out our search algorithm knows nothing. It knows nothing about the world it just has it's big sample of pages. And now, we're going to run these two things in turn. So in fact, that's what we did in our example. If every page has a Hub score of 1 to start with, then when we run the Authority Update Rule, then the initial Authority score is just the total number of links coming into you. That was our voting scheme. Now that's their new Authority score. Let's run the Hub Update Rule. When we run the Hub Update Rule, everybody gets a Hub score equal to the total amount of Authority it points to, which was, in fact, our little heuristic for finding good lists on the web. Now we run the Authority Update Rule again. When we do that, we're re-weighting all the pages by their scores as a Hub. So this is the way in which, for example, the Prado acquired a score of 18, which was 8 plus 10, the two Hub scores that pointed to it. The Smithsonian acquired a score of 26, which was actually the sum of the four Hub scores the pointed to it. And again, we don't have to stop there. We could now, in fact, update the Hub scores again. So that, for example, a page the points to the Louvre, and the Prado, and the Smithsonian, now gets a very, very large Hub score numerically, because we're adding up 18 plus 18 plus 26. This is something we can do, we can run it. And of course, the numbers are getting big. But it's sort of an illusion that the numbers are getting big because all we really care about if we're trying to rank these pages is the relative ordering of numbers. And so what we can do is, after these iterations, we can just divide all the scores down so they add up to 1. So we're actually going to be thinking about working with a bunch of decimals here, because it doesn't matter what the actual magnitudes of the numbers are. Only which ones are bigger or smaller relative to which other ones. And that is how we're going to do computations involving Hub and Authority scores. We're going to start with this flat set of scores and we're just going to keep updating, in turn, Authority Update, Hub Update, back and forth. One thing you can actually show mathematically, which we're not going to prove here, is that if you keep doing this, and you keep dividing down the scores so that they don't grow-- they always add up to 1-- then the scores will shift around but they will actually converge. So they'll converge to a stable set of values that, over time, change less, and less, and less, over the iterations of this Hub and Authority Update Rule. And actually, at the point they're converging to, the scores won't change at all, they'll actually remain fixed. You apply the Hub Update Rule, the Authority Update Rule, they stay the same. That's a stable, self consistent set of Hub and Authority values. After rescaling, the Authority value is the sum of the Hub values that point to it. And the Hub values are the sum of the authority> values that they point to, just like we wanted. And in the end, that becomes a way of ranking web pages. We get two kinds of good pages, the authorities themselves, and these lists that point to them. And they've been found by this back-and-forth process in the link structure where each one is supporting the other.