The Query Complexity of Estimating Weighted Averages

>> WIRTH: This is some work that I did. So, you see, it puts Venkat Guruswami's affiliations, University of Washington, which he no longer is. He's in Carnegie Mellon. But he was there when I did this work with him and with Amit Chakrabarti who was visiting from Dartmouth there. And then I went home and did continue working on this and eventually worked my father into this. So, that's why he is a sort of another author on the end. Okay. Now, I guess I'm going to--I should probably start with the motivation a little bit before the definition. And just a little bit of motivation is that there was a bit of foray of interest in evaluating search engines called retrieval systems and that kind of thing, and this information retrieval community maybe a five or a seven or eight or something like that. I think it's still of interest but not quite the level of papers that we saw at that stage. And so, one thing that happened was that some people said "Well, maybe the metrics people are using to evaluate this is not right and we should try something different", which also inspired me to think, well, maybe working on the query theoretically about these sort of results. Okay. So, the question that we're interested in is this: We have--we're given a vector A of weights between zero and one, okay? [INDISTINCT]. And these non-increasing weights, okay? And what we--these are known to us in events. That's very important. But what's not known here are some other quantities Y between zero and one which somehow represents something like the relevance of something to a search query. And so, Yi represents whether or not the Ai to return by this retrieval system is in fact relevant to our query. And what we want to compute is the score. So, I'm interested in linear scores, which is to say is the weighted average of these things. Now, why weighted, because you care much more about the first few things returning both system relevant than whether the 100 or 300 [INDISTINCT]. And this makes the problem--So, this makes the problem of estimation inherently, actually easier--or at least depending on weight family--easier than if it were just sort of a main. And if we just--well, let's look at how many of the documents returned are relevant. That's somehow a--actually a [INDISTINCT] program, as I'll show you the weighting of the Weighted Mean, okay? So, what we want to do is come up with a good estimate for this Weighted Mean and we want to look at as few Y values as possible because those represent asking somebody to actually read the document and assess whether or not it's relevant; and it's extensive, so it's sort of query complexity level. And what we're going to do is produce a good estimate with the... >> N and A are given. >> WIRTH: N and A are given, indeed. Although, we're going to be doing some sort of asymptotic things to do with, we sort of--N, once it's given, we're going to imagine conditions logically what in some sense. >> But it's not in the increasing our text issued query? >> WIRTH: No. Okay. >> Are you going to begin the more precise equation of the model later? >> WIRTH: Not necessarily. >> So, in that case, can I ask you a question? >> WIRTH: Yes. >> So, do you get to choose which of the Ys you want to look at? >> WIRTH: Yes. >> Or is it coming from some...? >> WIRTH: No, no, no. So, you told the agents what the N, you told all of these hidden Ys, and you can choose whichever look you want to look at. >> Okay. >> WIRTH: And at unit cost to look at each Y. >> And your notions of approximation, was it very good? >> WIRTH: I'll explain that in a moment, yeah, sure, so that, I will define really carefully. Okay. So, I've sort of already spoken a little bit about why we're doing this, but I thought that query complexity was a sort of a last thing to do and I have papers years ago on communication complexity, which isn't terribly far away. So, I thought why not reap it with something that worked in the past? And one thing I should point out is this issue of metric we've--I've got implying is sort of linear metric here. But that's not necessarily the--what the community uses. They often use things like Mean Average Precision, which is sort of a heavily quadratic kind of quantity where you multiply some of the Yis by the Yis. I mean, they've got other justifications for. But it also means it turns out to be a quadratic metric. So, much of the research, you see, is in trying to get good sampling strategies for these slightly more complicated metrics. But I thought, "Well, we don't really understand the linear ones first. Let's try and get some better appreciations then." Okay. So, you could imagine a kind of a decision tree, which is a sort of randomized decision tree so you can make choices at all the nodes in the trees as to which item you're going to look at so you can sort of have some coins or dice or whatever you got from random generator and you say, "Okay, I want a little bit of this document. I want a little bit of that one," and so forth. At the least of the tree, you say, "Okay, I believe that the Weighted Mean is this." So, we've got this randomized decision tree and we accept that it might fail with the probability delta and we also accept that we might not get exactly the right answers, so we're going to send the--this is a relative--we're going to classify this a relative--not relative--an absolute error between the estimate we obtained and the actual value, okay? So, we're going to say it's an epsilon-delta approximator if it's probably one minus delta. We reached a leaf that who's absolute difference from the thing where it really from is to maybe [INDISTINCT] epsilon. So, one of the bugs in my slide is on the next line where it said these people show that this many samples suffice to estimate the Mean. I think it should read that you needed these that many to estimate the mean. I don't know why it read suffice, but, anyway. So, there's a lower bound, which is one over epsilon squared log one over delta which is something that people have seen before in different context, okay? >> Is there some bound on the Ys? >> WIRTH: Sorry? >> Is there some bound to the values of the Ys? >> WIRTH: Oh, yes. So, the Ys--everything is constrained to be between zero and one. >> Zero and one, okay. >> WIRTH: So, you could imagine that in some instance they are zero-one, strictly zero or one. But you can also say that there's a concept of--it won't hurt you to allow half a little of this in a sense. >> And that something. What's the probability underlying this random process? >> WIRTH: So, it's up to you. So, you can choose to sample this or sample that. So you can imagine sort of decision tree with probabilities of the nodes of the trees. >> I think the Is are drawn from some distribution. >> WIRTH: No, no. The Is and Rs are fixed. So, I'll just go back. I mean, one more thing I am--Okay, so I'll just retreat a little. So the AR isn't given to you, N is given to you, the Yis are given but they're hidden and its up to you to choose which Yis you want to look at. And, in a sense, the probability is entirely to do with your choice of samples and substance or the choice of the Ys that you'll look at. >> So, you're choosing them randomly? >> WIRTH: That's right. >> I see. Okay. >> WIRTH: So that you have a relatively high like one minus delta chance of succeeding. >> Thanks. >> WIRTH: That was very small. You'll start at the epsilon, which is the error you obtained which is very small. >> Okay. Thanks. Sure. >> WIRTH: Now it's important to get this right. And I admit sometimes that--each time I present this, I get slightly different questions about the model and sometimes going to stop, right, make sure that people are, you know, working out. Okay, great. So, it's--so it's known what did the [INDISTINCT] of the Mean question is, okay? But, as I said, it might be that because some of these documents are sort of more important than others that somehow we can afford not to look at quite so many things as if we were just [INDISTINCT] Mean. Okay. Now, so as well as this moment I define results of Bar-Yossef. He's a sort of a Google person, but he's not here. He described two things in the paper as it does in one. And one is block sensitivity, which is the--it says, "Okay. There are certain parts of the input that we have to look at" and I'll explain more about that later. And also, they talked about another technique we're getting at lower bound on the Mean which [INDISTINCT]. Okay. So, let me show you what we sort of did. So, okay, so the Weighted Mean problem, in general, is a sort of a more general problem than the mean and so if you didn't carry anything specific about the weights, then we're sure [INDISTINCT] in the same number of queries you can match the lower bounds of the Mean. But that's not particularly interesting because maybe there's something interesting, maybe there's something special about the family of the weight series. So one is--so we'd look at three different weight families, one is Geometric, which is inspired by some work from my colleagues who believe that you ought to have Geometrically decreasing weights to evaluate retrieval systems. Another is--another peculiar job is where the weight sort of diverged. So, you could imagine that N is a really, really large and then, you know, these As form a divergent series. And that turns out to be as bad as the Mean case. I mean the Mean is included in the divergent series because the weights of this one, one, one, one, one, one, one, one, one, one, one, one, whatever you got. Sum of constant, repeated. >> So weight to Ys? >> WIRTH: The weights of the As. >> Okay. >> WIRTH: So, the problem changes depending on what--did you say you were trying to estimate? >> No, I'm... >> WIRTH: So, you've got these weights that go down very quickly. It's sort of easy because you can look at the things that--the things with heavy weights only. On the otherhand, if the weights form a divergent series, then it's as bad as the Mean and so--I mean, it would be nice to have a general sort of theory here, but the best we could do is sampling between cases of the power law, which is the most interesting one, which is sort of converges. It doesn't converge as quickly as the Geometric case. Okay. So, here is a particular flavor of the [INDISTINCT] of Generic Upper Bounds that is most useful to us. And a little--aside that I've been giving--presented this is that, you know, I did an undergraduate statistics and yet never learned about Generic Upper Bounds. I don't know where the [INDISTINCT]. Do you hear about Generic Upper Bounds? >> Yes. >> WIRTH: You did? I came to get my PhDs. I'm like, "What's this? There are changes in equality, [INDISTINCT]. What's this?" Generic Upper Bounds. And I guess, you know, theory--computer science theory people are interested in slightly different question sometimes from the statisticians. Okay. So, I actually went back and took [INDISTINCT] form I wanted. I honestly retrieved--I went to JSTOR and retrieved and old paper from 1950 something [INDISTINCT]. Okay. So, the form I want is this where you've got these bounded independent variables, which are in this case between zero and one. And another actually have to be--I've seen [INDISTINCT] and I don't believe that actually [INDISTINCT] I don't think they distribute it. They just have to be downloaded and independent to the case. So, you add them together then you look at the sample mean and the divergence to that is nicely bounded [INDISTINCT] exponentially. Okay, so what do we do? The As are essentially going to--be it a [INDISTINCT] probability distribution for us and that we normalize them, we choose, we do sampling with replacement with unequal probabilities which we can do quite easily. Okay. We choose--with each trial, we choose a particular item in the probability and specify by its normalized weight in some sense. And then if you do some algebra, you can find that we've one over epsilon squared log, one on delta; approximately trials, you can get the right accuracy. Okay. So that was nice, but not very informative so far. But we think, "Well what can you do with these As," right? So, that was--yes? >> Just a comment. It's kind of a little a lot of computation that is--and is large your conveying all these. Like I was thinking there'd be something more adaptive, I guess, and that [INDISTINCT] and different probabilities, and then choosing the Ys with those probabilities. >> WIRTH: Oh, I see. >> I was thinking something more--that there's been about like find the largest--You have a power lesson to that yet or not yet? >> WIRTH: No. That's just purely generic. So, you go to any--this works for any family weights. It's just showing that even though Weighted Mean in general is a sort of a [INDISTINCT] seems like a more general form of the Mean. It really isn't sort of the same complexity in some sense. >> I see. I think I'm just jumping ahead, then. >> WIRTH: No, that's a reasonable thing to do. Okay. So, maybe some of these As are more important else--because of the weights of the As, some of these things are more important than others. Okay. So what we're going to do is say that some of the items as we call heavy and regarded definitely sample those. Okay. So the--maybe we assure that somehow we ought to look at the top 10 documents or something like that. And then afterwards, we'll run this generic scheme that I described, except that we'll re-weight everything because it will renormalize the weights, okay, based on what's left, okay? And this scheme--so this scheme appears in two other instances. So, I discovered only fairly recently that people at Stanford in 2006 worked on a somewhat similar problem. And I have heard of something strange about this paper, so in another side about this is that it hasn't actually been appeared anywhere yet. We submitted it somewhere and it got rejected. And, you see, what the guys at AT&T said was that it wasn't fair that [INDISTINCT] was my advisor because I couldn't--I never got the experience of being in paper rejected. >> Yeah. >> WIRTH: And this is an important thing to learn when you're a PhD student. So, this is my first paper rejection. I'm going to be shocked. I said, "What do I do?" So I still haven't got over that yet and we still haven't re-submitted it. But that's also because I need to just read that thing [INDISTINCT] to know exactly what [INDISTINCT] from the--Okay. So, they have the sort of similar scheme with these, some things that are heavier and some things in the latter and so forth. Okay. This also appears actually in the statistics literature. So, it took me awhile to work out what the statistics--relevant statistics that are true [INDISTINCT]. And it turns out that it's--you need to look up--survey sampling, is being sort of keyword that you want. And they have situations where the one [INDISTINCT] of some unbiased estimated for something. And they--given a situations where you--that was what we call list sequential strategy, which is what we're also doing here with the use of--I'm sorry, actually [INDISTINCT]. But anyway, that is the situation we should walk along each item and say, "Okay, do I look at this, do I look at this, yes or no?" I mean, these have different probabilities. And they have some things where you've got probability one at beginning, and then other things. That sort of--it drifts down to something lower. So, this has appeared in all the circumstances. Okay. So we run the--we've got the heavy ones. We do the generic schemes to the light ones to rescale their weights. So what do we get? We get H--with a little H, which is a number of heavy things plus this one with epsilon squared log one of delta are perfectly scaled. Okay. And it seems like a good strategy. But what's the appropriate H to use? So, one thing needs to say, "Well, maybe H ought to be all of the terms that have a normalized weight of at least epsilon over two" or something of the order of epsilon anyway, right, because if there's something with a weight that's certainly larger than epsilon and you don't sample it, then you guarantee to be out. I mean, not guarantee--there's no way you can guarantee that you're with an epsilon at the right ends, all right? So, okay, you think, well, let's look at those things. And then... >> Because it's an additive epsilon, right? >> WIRTH: Because it's an additive error, right. It's not a relative error. So, some people worked on it, so obviously some of the papers have worked from relative errors, actually. Okay, so we want an additive error. So, you have to look at these--some of these terms and then somehow, you know, well just for the generic scheme and everything else. The problem is that if you look back at this formula here, we might end up with a situation where the second term is actually significantly larger than the first. And one of these sorts of rules of thumb that you've learned from doing this research is--this kind of researches, if you're adding two things and trying to minimize that sum, it's probably a good idea if the two things are roughly equal. So, somehow, you don't want the second one to dominate the first. So in fact, what we're going to do is to make H larger and then somehow [INDISTINCT] contribution [INDISTINCT]. So, we try this with the power law case for instance. Based on the [INDISTINCT] convergent series and we sort everything out and then we have H items and the weight of the light things is starting to be roughly--to do the integration, roughly sort of H to 1 minus P over P. You want to just think about the individual rough--approximately that--okay, assuming that everything works nicely with the N and the H and P and so forth. And if you do the [INDISTINCT], okay, so H should be roughly epsilon to the power of two over one minus two P, and don't ask me why. That sort of makes sense. But it just sort of pops out in some sense [INDISTINCT]. No, that's a--it's an interesting quantity, but it certainly balances and maybe it's indeed the right answer. Okay. So we then claim that, okay, that if you do this--roughly this many multiply by log one over delta, then if you query that many things then you will get a good bound. Now, note that this is lower or this is better than epsilon to the minus two, right? So when P equals one for instance, what do you get here? You get two of the--one minus two, which is epsilon of minus two, all right? So as P increases, then there's going to be--this quantity is going to be smaller. All right, so, that seems like a useful thing to do. But the question was "Was it--is it, in fact, the right answer? Could we get a matching lower bound?" And that actually took quite awhile to work out, so--okay. So, if refer to the paper I mentioned before--so if assuming that epsilon and delta behave nicely and the N over four part doesn't dominate you what this sort of, I think, precise formula for a lower bound if you want to estimate the Mean, okay? So, we're going to use this result fairly heavily because we're going to somehow fake the weighted--the Mean problem is the Weighted Mean problem. Even though, now that somebody yesterday said to me, "Well, wait a minute, isn't the Weighted Mean problem more general and so therefore, it doesn't include the Mean." But we're going to do a reduction from Mean to specific kinds of Weighted Mean, which is Weighted Mean in a certain kind of A--certain A vector. Okay, so, let's look at block sensitivity for a moment, which seems to be fairly powerful of a tool. Imagine that you have some part of the input, these hidden Ys, say, and you said, "Somebody--some adversary changes all of their values maybe from one--from whatever they are to something else." What does this do? What could this potentially do to the output? So, if it changes the output by at least two epsilon, then you know that you have to look at something in that block, otherwise, there's no way you can guarantee that you're within additive error of epsilon, the truth, okay? So--then you say, "Okay, maybe there are similar blocks like this. Like there are some part of this--here's one part of the input, you must look into this. And here's another part of the input where if you change their values the output might change by at least two epsilon on another part and so forth". And so, you know, if you've got these disjoint subsets to the input, you know that you have to look at least at something in each of these. Okay. So, this--you've got--so here, you've got to multiply it by one minus two delta, but basically--and there's a slight difference because you've got expected query complexity. In any case, it's still a recently powerful technique. You can multiply blocks sensitivity to one minus two delta, and that's a lower bound. Now, if you think about the Weighted Mean problem, at the very least, you've got to look at every item whose weight--has normalized weight is at least two epsilon, right, because if it changed from one to zero or zero to one, potentially, then the output move by this two epsilon. >> So, it's a block part itself. >> WIRTH: So, yeah, so that's a block part itself. But that's a rather extreme case. >> So, block is [INDISTINCT] the worst case change would cause at least--at least two epsilon? >> WIRTH: That's right. >> Okay. >> WIRTH: Okay. So, that was one sort of tool you're going to use. The other thing I mentioned was that we're going to use some reductions, okay? So, again, there's a slight flaw in this slide is that the term Weighted Mean, not the top, there should probably have an accompanying A vector, right? Because Weighted Mean in general is, of course, a more general problem than Mean. But what we're going to do is to reduce the Mean problem--we're going to fake the Mean problem with the Weighted Mean problem with the specific A vector. So, we're going to--okay, if we can solve this instance with the Weighted Mean that sort of almost solves or approximately solves the Mean problem. And then because the Mean problem is hard, that must mean this Weighted Mean problem is also hard, okay? So what we're going to do is something fairly obvious, we're going to do sort of partition the weights in the Weighted Mean problem into some subsets that are roughly equal, okay? So that the weights in the Weighted Mean problem, if you sort of glue them together somehow and they--the glued sets become almost the same, then somehow you can fake the Mean problem, okay? We're going to say, "Well, if the subset of what--we're going to associate the subset of Y to the particular and the accompanying Y values with the particular X value up here." We're going to say, "Whatever this X value was, the appropriate Y value is down here are going to be the same. And then maybe some of these weights don't actually appear in any subset because they're going to just throw them away--assuming that the Y is zero." Okay. Now, note that we don't actually make any--we don't somehow really copy the X values and the Ys, of course, we're just sort of making references so that when you begin to query this Y, you query the appropriate X. Moreover, every time you look at a new X term, it forces you to look at a new subset and therefore a new Y term. And so, any lower bound on the Xs is also going to apply to the Ys. And the other point is that we're not using any factor about the Xs to determine these subsets. It's purely--something to do with the weights and then nature that determine these subsets and therefore, how hard the accompanying Weighted Mean problem is. Okay. So, we're going to choose the subsets so that somehow the consequence in the Weighted Mean that we end up with is very close, so with an epsilon of the trimming of the Mean problem and, therefore, if anything that were an epsilon approximator, the Weighted Mean is also going to be a two epsilon approximator of Mean. So that's how we're going to chain our lower bound. Now, let's start with--So we're going to start with series of warm-up things and then we're going to get to the more interesting one, it's the Power Law Weights. Yes? >> Is it easy to choose such subsets? That kind of flashed by really quickly. >> WIRTH: I mean, so we'll see. So, in some cases, yes, in power law, it took me few months on and off thinking about how to do this. I mean, it turned out to be kind of nice in the end, but I didn't quite get my head around on what one needed to do. So that's the interesting part. I'm leaving that as the dessert for the talk. >> Okay. >> WIRTH: Okay. So, let's look at some easy stuff. So, Geometric Weights--I mean, it seems to be kind of obvious what you do here because somehow they've got these very heavy weights in the beginning and then this tail that sort of doesn't matter. And somehow the size of the tail is proportional to, you know, the weight itself or the weight--mixed weights, let's say, or something like that, right? So, seemingly, the obvious strategy here is you sample some things down to some point and then you just throw the rest away, but you might as well make it rigorous. So, my colleagues believed--sort of published something that sort of says, well we're [INDISTINCT] be using these geometrically decreasing weights because this involves how humans grade things. That's like "Oh, do I follow this link? Well maybe follow the next one and then this." So, human is sort of grading the outputs of a grading system is sort of a geometrically decreasing interest in some sense. I mean that's reasonable. Anyway, if you look at--if you were [INDISTINCT] simple lower bound, of course, you must look at every Bi at least two epsilon and you do some slightly annoying things and you end up with something that's approximately log two epsilon over log Q, and log Q is equals negative. So you essentially got log of one over epsilon, okay? Now, remember that epsilon is small, so one over epsilon is large. And so log of one over epsilon is a lot better than whatever is on squared, which is our generic lower bound, okay? And in particular, it's also an obvious upper bound because you can just, at some point, throw away the tail, right, so if you look at the tail terms and if the total weight is at most epsilon over two, so [INDISTINCT] kind of affect the answer very much fairly enough to look there. And somehow that--the position of where that tail starts to--where the sort of lower bound in some sense--position ends as a winner, the constant determined by queue, right? So, this is obviously--these upper and lower bounds match to each other in a constant. So instead of epsilon--one over epsilon squared, we've got log of one over epsilon being appropriate number of samples we need to take. And in particularly in the geometric case, you just look at the stuff--at the beginning. Great citing, but it's worth sort of [INDISTINCT]. Okay. So, something a little bit more interesting is what happens when the weights diverge? Okay. And it turns out in this phase is that the problem is as hard as the Mean problem. And so, what do you do? Well, you're going to fake the Mean problem by collecting together blocks of weights that are sort of roughly the same. And in diversion series case--and we're sort of assuming that N is really, really big here, okay--we can do this because we know we'll never run out of weight because the weights' series diverges. And we also know that the weights themselves is abounded between zero and one. So somehow by taking lots and lots of weight, we can get these plumps that kind of roughly. So what we do is we keep taking weights until we got a sum that's at least somewhere between two over epsilon and two over epsilon plus one. We know that we can do this, okay, because of the bound and the fact that series diverges, so we carve that off and then we take another chunk of weight between two over epsilon and two over epsilon plus one and so forth. And if you sort of add things together and so forth and use the one over twenty-four whatever the thing is and you get exactly that many subsets, then you can somehow estimate the Mean within epsilon over two. Sorry, [INDISTINCT] we're doing epsilon over two plus an epsilon over two plus the minus for the Mean and so forth. So, you can--it's not too hard to fake the Mean problem here because you've been just taking secondary chunks. So, in fact here, [INDISTINCT] of the subsets is quite easy. Okay. Now, the one that--the thing that was kind of--a bit more interesting is what do you do in the power law case? So, we thought, okay, well, let's start with using block sensitivity, which is my first one. And so you group together contiguous sequences--so you start with the--some of the weights are just really big, at least two epsilon. So you treat those as block by themselves. Then afterwards, you form contiguous blocks of weights that add up to at least two epsilon. >> Okay. Can I go back one slide? >> WIRTH: Sure. >> If I am seeing the results, you said for every fixed delta--so you fixed delta and then reduced... >> WIRTH: Yes. >> ...the mean to it, but you don't obtain the--sort of a lower bound in terms of delta or [INDISTINCT]. >> WIRTH: The delta is not the particularly interesting part in a way that the epsilon is the sort of... >> I can believe that. I was just confirming that that... >> WIRTH: Yes, yes, yes. So, obviously, you should fix the delta and then it's the epsilon that you play with, right? So there's a --if you can estimate these within epsilon over two--sorry, the gap between the Mean and the Weighted Mean is epsilon over two and you've estimated the Weighted Mean of epsilon over two then you've got estimated the reason one of epsilon. So, you know that's hard so therefore you know this other thing must be hard. >> And delta is hard. Yes. Okay. Sure, okay. >> WIRTH: So--all right. So, in this case, we've got--to recall, we've got these very heavy weights. We have to look at each of those items individually and then--and then we look at blocks and contiguous weights that add up to these two epsilon. Now, note that those blocks are going to be a weight between two epsilon and four epsilon because by now, for that stage, we know that all of--all the weights [INDISTINCT], right? So, we don't really shoot by too much in some sense, okay? So this gives us a bound on the blocks sensitivity of the number very heavy turns plus the remaining weight of stuff divided by four epsilon. It seems to be the number of blocks you obtain. But unfortunately if you do this, you'd end up with a lower bound--I left out the delta part that you expected it and so forth. But the important point is that you end up with a bound that's something like epsilon to the minus one over P, which isn't just quite what we hoped. We want to get epsilon to the two, assuming that what we do together down is correct, and I believe it is--I believed it is. I know now that it is--that I wanted to get epsilon to the two over one minus two P, and we didn't quite get that. Okay, so what then? And I spent sometime thinking, "Well, maybe I can use some property of these parallel terms and I can actually look at--" you know, I--I must be naïve or something and thought that maybe I can use something about the properties of these sequences to add certain terms together and they'll end up being about the same" and this is no good. And, of course, one does not actually need to construct this thing, one just need to describe a method for constructing equal weight subsets even if one doesn't do it explicitly, all right? And it took me a while there to appreciate that point. And then I started thinking, "Well, what is this--what--what is this"--oh, sorry I've missed it on this slide, yeah, but what does this look like, I'll set it in a moment. So just a little bit more precisely, what we're going to say is imagine--what we want to connect are these subsets of weights that are somehow between row and row minus epsilon over two M. So, we've got this M subsets in order to be almost equal. And then, it means that the--[INDISTINCT] oh, I see. I can sort of draw this. Okay. So it then means that this weighted mean that we obtain is somehow between row--something between row multiplied by the total some of the Xs within epsilon over two over there, okay? And then if we do the appropriate division, we're getting a probability with one minus delta, we're getting an epsilon over M times row approximated to the mean, okay? It's really too many formulas on this slide, actually. Okay. [INDISTINCT] it's a bit old. Okay. So, what's the interesting thing? The interesting thing is how do you actually tame these weights, collective weights together to be roughly equal? And I remember thinking to myself, "Well, maybe it's something like being tacky". And I hear that this was--I heard a conference deadline I was aiming for this and I still couldn't work out what to do. Maybe if I tell myself I'll make the deadline, then I can do it. And it's sort of went by listening [INDISTINCT] [INDISTINCT] then I went to my father, I said, "What is this?" And is--so, the reason he got brought into this was through a discussion with him. He said, "No, I don't think that's actually what you want to be doing. What you really want to be doing is a scheduling kind of problem." Oh, okay. And we talked about what might be good and it turns out that if you look at the last line of this that the thing that we want to do is to max-serve. Well, the thing that inspired us was the problem of maximizing the neuron completion time. So, you've got some problem with jobs and you've got N machines and you assign these jobs to these machines and you want the machines to sort of all finish around the same time. So, we're going to maximize the one who finishes first, in some sense. We're going to make that as light as possible and then somehow that will mean that everybody finished can [INDISTINCT] for everybody else. That's a rather strange objective and there are [INDISTINCT] on it. I talked to somebody [INDISTINCT] about this and he's one of the people who worked on this. And he said it has something to do with airplane construction--I never quite--we never finished our discussion for some reason. I'll get to the bottom of that one day why he wouldn't care, but I cared for different reasons. It's just that I was trying to make this equal subsets. And it turns out that this longest processing time strategy is a good thing. So, you start with the longer--the jobs that takes the--you saw that the jobs in non-increasing order of duration and you--as you sort of proceed in time, you assign the job to the first machine that's available. And so you sort of sort them and then assign these jobs, okay? And this is I think what we've got because we've got these As and they're in this decreasing order and then somehow we want to assign to these subsets, so they end up being almost the same at the end. >> Can I add a comment? >> WIRTH: Yes. >> In the scheduling problem, don't you want to minimize the maximum, not maximum to minimum? >> WIRTH: Well, it depends. No, normally one does want to do that, yes. But some--this is an unusual scheduling problem which we want to maximize the minimum. >> What is your readiness to this one? >> If you put too much weight on one of the buckets, that's okay. >> WIRTH: Well, no. I'm just saying that there are many scheduling problems around and this is somewhat quirky one that only a handful of people looked at. And it so happens that it relates to what I want to do. But, you know, it was just fortunate that someone had thought of this before. >> So is this the reason why minimizing the maximum will not be good for you? >> WIRTH: Um, well, it doesn't--it doesn't necessarily make things equal, right? The way I see what you mean [INDISTINCT]. So, you can have a maximum and then--I guess it's sort of the same thing, isn't it? I don't know. I mean, I don't know why I like looking at this rather than that. You'd have to ask me that three years ago and something [INDISTINCT] a good point. Actually, I don't know. I guess I have this feeling that maybe you could still do that and somehow end up with something very uneven. Perhaps not. Anyway--so there are various strategies that these papers use. But one simple on is to use this longest processing time heuristic, and with all, maybe this helps us. So what do we do here? So, we have a--we have these jobs and what we're concerned about is the tail, okay? So, imagine we have a job that has a very large title after it. So large that if we add up all the jobs of [INDISTINCT] weights afterwards and the tail, and divide it by M minus one--divide the number of machine minus 1. If that's larger than the duration of the current job then somehow we know that the difference of the machine completion times is at most the duration of the next job. Why do I say that? Okay, so you'll see why that's true in a moment. >> Otherwise the [INDISTINCT] there in that gap? >> WIRTH: Well, there's sort of not enough time to finish everything, basically. So--okay, so here's what happens. So, consider the last job that finishes. And when the job started, all the machines were busy. All--perhaps, that was the very first job, okay? Now, imagine that this tail is--you got this large tail of jobs that you still have to do, and at that large tail, divided by M minus one is still greater than the duration of the current job. Okay, so we're saying this--sorry, [INDISTINCT] this last job to finish. So here's this job L, this last job to finish and we've got this large tail of stuff. We're going to assign it to all the other machines while this one is working. But somehow this tail is so large that there's not enough machine capacity in the time of which it's supposed to finish to do everything. And as there are M minus one jobs and then we've only got time Tl, if you like, time over M minus one machine, sorry, to finish everything else. So, if indeed the tail divided by M minus one were larger than Tl then that would mean there would not be enough machine capacity to finish everything. In other words, this Tl, this L, could not in fact be last job. And so, this can only be [INDISTINCT] something can only be the last job when the tail is sufficiently small, as this capital N was kind of like an indication of which job was sort of--had very large tails, okay? So, this is just one final slide--almost final side, in some sense. So, what is meant to be--look at, this is good. So, I know that I can now place a bound--using this LPT, I can now place a bound on the finishing times and everything and guarantee that somehow while the tail is too big--sorry, the--knowing some properties about the tail and so forth, I can now say that I can guarantee what the difference in finishing times between these machines are. In other words, what the error between the subsets is. And so, we wanted to get epsilon to the two over one months two P, okay? And we know that at some stage, at least up to some capital N, the tail is too large so we sort of--we set--anyway, this is somewhat a confusing slide--but we set the R, the, M, and the rho almost to do what we want. We can't quite get it right. So, we can estimate the size of the tail because it's just a sort of [INDISTINCT] and it's a power law, okay? And when we sort of set everything up, we find that we almost get what we want. We do, in fact, get a lower bound of epsilon to the two over one minus two P, but we also unfortunately going to divide it by P squared in there which, it depends on how you look at it. It may or may not be important, okay? So, you can in fact fake the Main problem with the Weighted Mean problem by applying this LPT heuristic, getting all of the subsets that right. And then there's a--these are somewhat crude estimates right, based on its integrals, but--so that that's sort of--if the crudeness of the estimate that's probably causing the P squared to be in there in some sense. Okay. So, I guess--the one thing I'm interested to do is what to do next. So one thing is I'm actually looking at somewhat pragmatic--so, this was a [INDISTINCT] abstract thing that the--I bet the retrieval people would care about your depth. But I'm actually trying to look at some estimation techniques that minimize the variance side in these estimates. So if someone biased the estimators and the more direct [INDISTINCT] to do in that case is--the other two things are--well, okay, we've got diversion weights, we've got very convergent weights and we've got this power law that is something between, but maybe this doesn't actually--you know, it'll be good to capture something more generic, you know? Like why is it that this mode changes from lower one over epsilon then into this epsilon to the two area one minus two P and then to epsilon to the minus two? And what's--you know, how sensitive--like if you have slight changes in the weights--because some of these series look somewhat similar. Like what's happening and why, you need to do something more generic. And, of course, the third point is that the most interesting question is do we actually compare multiple systems at once, right? This is where you need to consider about [INDISTINCT] just to make the performance of one system. So, of course, you've got lots of common documents and they appear with different weights according to different--one system may give a documentary high weight, another one with a very low weight and so forth. And maybe what we are going to do is minimize the sum of the variances of all these things or--it's not clear what you want do. One thought I had--the right thing to aim for is that you--what you're ultimately producing is a ranking of the systems. So, you want to sort of say, "Maybe with high probability we got the ranking correct". And that's ultimately what you can send--sort of, you know, what--how many things do we need to sample so that with 95% probability we actually got the rankings of these systems right. So I just want to thank those people for hosting me, although no one's actually there. I know there's no one who's actually there anymore. And my colleagues, Alistair Moffat and Justin Zobel and William Webber is a PhD student of Alistair Moffat. And just one last little story that William and I used to sort of see each other at the department for a year and not talk to each other and just grunt at each other as males often do, and in fact the thing that got us started talking and then led to some of this was that I discovered he went to the same school as me and Lawrence for that matter too. And suddenly that made us start talking. It's a very noble story, I think. Okay.