Iu X - Informatics unit 14 - Recommender system iii - 2 item - Based collaborative filtering ii

So it points out here that just doing item-based filtering doesn't... collaborative filtering doesn't solve scalability directly. In the... paper of Amazon in 2003, it describes a pre-processing approach where you calculate all the pair-wise similarities between items in advice... in advance. You just do that, say, at midnight. Although actually this is... doesn't really matter when you do it, cuz you're actually... these sites are online... 24 hours a day, cuz there's always some part of the world where these sites are being accessed. [pause] Of all those pair-wise similarities, you don't actually use all of them, of course, cuz you're already going to use them for items that the user has ranked. [pause] Here's some philosophy, which again... I have to take on face value that it is better to use items than users cuz the items are more stable than users. [pause] And you don't actually have to calculate all... if you have... 'N' items, you don't actually have to calculate N squared numbers... because many of the items are not actually... don't have any rankings... users in common. So... that's there for them, there is no similarity at all. And you can also, obviously, do various restrictions... [pause] which can cut out even more cases. [pause] Here this is some remarks, this is actually for both item-based or user-based... [pause] questions, how you do... ratings or rankings. You just... if you go to any of these sites, you can see how they do it. They use stars or other... some sort of way of doing this. Likert is famous for having... defined a whole set of response scales, I mean the 1 to 3 is 'bad', 'good', 'best', or something like that, and 1 to 7 goes from... 'violently dislike' to 'violently like' with various gradations in-between, and so on. And so these... numerical scales, which do not have much granularity, turned out to be reasonably successful, because it's probably true that most users really can't rank terribly accurately, even better than this 1 to 5 or similar type of scale. [pause] So in movies I gather that a 10-point scale was... accepted. There is something we mentioned earlier, the Joker website, we discussed that in a previous talk, and there they actually had a continuous scale and some sort of input bar. [pause] And obviously we also... can use... rankings of different... parts of the... item. If you go to... Audible.com, where you rank books, which I sometimes do, you will separately rank the story... the reader, and so on. And so there are multiple rankings which... you can make. Although again, you often only look at the overall ranking... unless there's some particular case where, say, the... [pause] reader of the book is violently... has given a negative... review which could then drag down the overall reviews. So you may wish to have this. But this is probably, again, I doubt if it affects things too much. Although users are not willing to rank many items, and are people like me, rarely rank anything, and we ignore my emails from Amazon telling me to rank how their... how a particular marketplace vendor performed. I rarely respond to that. So I'm not helping Amazon give feedback to its users. But there are so many users these sites have that they get lots and lots of rankings. And I consider it one of the miracles of the Internet that... I wouldn't have predicted that... and a very good example of Big Data and the data-oriented approach... that these rankings are an enormously powerful way of making very precise statements as to which items a user would like. And that's not what I would have guessed, but... cuz I would have guessed the more obvious content-based... [pause] recommendations, where you looked at the actual physical features... of an item and used that to try to predict what other items a user would like. [pause] And here's some comments on ratings [unknown] when you actually click... and say this is 1 star through 5 stars. But just the fact that you clicked and go to a particular buyer, say, at Amazon... sorry, I should say seller on Amazon, is already telling you something. And also when I spend a long time on a page, I probably like it better than when I immediately dismiss it and go to another page. And so there are a lot of so-called implicit ratings. Sometimes there's a core of context, and... as they point out here, the fact that I bought a book doesn't mean I liked it, cuz often I buy a book which I don't actually like so much. So the... characteristic which works, if I buy a book I don't like, I don't buy another book of that author. So you can look at some implicit rating: how many authors do I buy multiple books of? That is a strong statement that I like those books. [pause] Now we here have... remember, though, we're doing Big Data. Some aspects of Big Data doesn't have so much data. So that's worth... discussing. Obviously when an item suddenly arrives, we... not so easy to recommend it, cuz it has no rankings. And you either have a... you sort of have a sort of... forced march, where you actually drag users along to rate these items. You can also use, of course, the content-based approach initially, and then gradually phase in the ranking-based approach. [pause] There are, presumably, always better algorithms. Remember with Netflix, I think the winning algorithm combined 100 different approaches, or some number like that. And... that's obviously an area of research. Again, I suspect most of this research is going on within Amazon and Netflix and so on, and they don't tell you what they're doing. [pause] One... approach, which is illustrated in the next slide, is transitivity, which says that if you have... a set of users near... 'B' near 'A', and another set of users 'C' near B, probably C near A, then that could be a non-trivial deduction, because A and some of the users in C may not have the rankings in common to allow a direct calculation. However, there may be rankings in common to allow you to know that A is near B, and a different set of rankings in common to know that B is near C. So this is a very interesting feature of the sparseness of these vectors. Cuz this transitivity is actually trivial if the vectors are... all fully populated, it's just a property of... distances. If A is near... B and B is near C and these are real distances, then A is pretty near C. And that's the triangle inequality. But we know that triangle inequality is definitely not true, and cannot be used directly. So you have to have some... slightly non-trivial assumptions to use it here. [pause] So here's the recursive collaborative filtering algorithm briefly discussed, and this is an example of transitivity where you effectively apply collaborative filtering recursively, so that if we have this example here where User1 is near Alice and they have a strong similarity, but unfortunately we can't use User1 directly to help Alice rank Item5, we may be able to find other users near User1 where they have rankings in common so we can actually get a ranking of... estimate a ranking of User1, and then use that ranking of User1 to help Alice. There the whole point is that we might have identified a set of wonderful users near Alice, but those users are not... directly ranked to the... some of the items we care about, but those users near Alice have friends or whatever you want to call these people who are near them, who have done relevant rankings. Obviously this could be applied in very complicated iterative fashions. It says here recursive. Recursion or iteration sort of mean the same thing now.