Tip:
Highlight text to annotate it
X
>>
WIRTH: This is some work that I did. So, you see, it puts Venkat Guruswami's affiliations,
University of Washington, which he no longer is. He's in Carnegie Mellon. But he was there
when I did this work with him and with Amit Chakrabarti who was visiting from Dartmouth
there. And then I went home and did continue working on this and eventually worked my father
into this. So, that's why he is a sort of another author on the end. Okay. Now, I guess
I'm going to--I should probably start with the motivation a little bit before the definition.
And just a little bit of motivation is that there was a bit of foray of interest in evaluating
search engines called retrieval systems and that kind of thing, and this information retrieval
community maybe a five or a seven or eight or something like that. I think it's still
of interest but not quite the level of papers that we saw at that stage. And so, one thing
that happened was that some people said "Well, maybe the metrics people are using to evaluate
this is not right and we should try something different", which also inspired me to think,
well, maybe working on the query theoretically about these sort of results. Okay. So, the
question that we're interested in is this: We have--we're given a vector A of weights
between zero and one, okay? [INDISTINCT]. And these non-increasing weights, okay? And
what we--these are known to us in events. That's very important. But what's not known
here are some other quantities Y between zero and one which somehow represents something
like the relevance of something to a search query. And so, Yi represents whether or not
the Ai to return by this retrieval system is in fact relevant to our query. And what
we want to compute is the score. So, I'm interested in linear scores, which is to say is the weighted
average of these things. Now, why weighted, because you care much more about the first
few things returning both system relevant than whether the 100 or 300 [INDISTINCT].
And this makes the problem--So, this makes the problem of estimation inherently, actually
easier--or at least depending on weight family--easier than if it were just sort of a main. And if
we just--well, let's look at how many of the documents returned are relevant. That's somehow
a--actually a [INDISTINCT] program, as I'll show you the weighting of the Weighted Mean,
okay? So, what we want to do is come up with a good estimate for this Weighted Mean and
we want to look at as few Y values as possible because those represent asking somebody to
actually read the document and assess whether or not it's relevant; and it's extensive,
so it's sort of query complexity level. And what we're going to do is produce a good estimate
with the... >> N and A are given.
>> WIRTH: N and A are given, indeed. Although, we're going to be doing some sort of asymptotic
things to do with, we sort of--N, once it's given, we're going to imagine conditions logically
what in some sense. >> But it's not in the increasing our text
issued query? >> WIRTH: No. Okay.
>> Are you going to begin the more precise equation of the model later?
>> WIRTH: Not necessarily. >> So, in that case, can I ask you a question?
>> WIRTH: Yes. >> So, do you get to choose which of the Ys
you want to look at? >> WIRTH: Yes.
>> Or is it coming from some...? >> WIRTH: No, no, no. So, you told the agents
what the N, you told all of these hidden Ys, and you can choose whichever look you want
to look at. >> Okay.
>> WIRTH: And at unit cost to look at each Y.
>> And your notions of approximation, was it very good?
>> WIRTH: I'll explain that in a moment, yeah, sure, so that, I will define really carefully.
Okay. So, I've sort of already spoken a little bit about why we're doing this, but I thought
that query complexity was a sort of a last thing to do and I have papers years ago on
communication complexity, which isn't terribly far away. So, I thought why not reap it with
something that worked in the past? And one thing I should point out is this issue of
metric we've--I've got implying is sort of linear metric here. But that's not necessarily
the--what the community uses. They often use things like Mean Average Precision, which
is sort of a heavily quadratic kind of quantity where you multiply some of the Yis by the
Yis. I mean, they've got other justifications for. But it also means it turns out to be
a quadratic metric. So, much of the research, you see, is in trying to get good sampling
strategies for these slightly more complicated metrics. But I thought, "Well, we don't really
understand the linear ones first. Let's try and get some better appreciations then." Okay.
So, you could imagine a kind of a decision tree, which is a sort of randomized decision
tree so you can make choices at all the nodes in the trees as to which item you're going
to look at so you can sort of have some coins or dice or whatever you got from random generator
and you say, "Okay, I want a little bit of this document. I want a little bit of that
one," and so forth. At the least of the tree, you say, "Okay, I believe that the Weighted
Mean is this." So, we've got this randomized decision tree and we accept that it might
fail with the probability delta and we also accept that we might not get exactly the right
answers, so we're going to send the--this is a relative--we're going to classify this
a relative--not relative--an absolute error between the estimate we obtained and the actual
value, okay? So, we're going to say it's an epsilon-delta approximator if it's probably
one minus delta. We reached a leaf that who's absolute difference from the thing where it
really from is to maybe [INDISTINCT] epsilon. So, one of the bugs in my slide is on the
next line where it said these people show that this many samples suffice to estimate
the Mean. I think it should read that you needed these that many to estimate the mean.
I don't know why it read suffice, but, anyway. So, there's a lower bound, which is one over
epsilon squared log one over delta which is something that people have seen before in
different context, okay? >> Is there some bound on the Ys?
>> WIRTH: Sorry? >> Is there some bound to the values of the
Ys? >> WIRTH: Oh, yes. So, the Ys--everything
is constrained to be between zero and one. >> Zero and one, okay.
>> WIRTH: So, you could imagine that in some instance they are zero-one, strictly zero
or one. But you can also say that there's a concept of--it won't hurt you to allow half
a little of this in a sense. >> And that something. What's the probability
underlying this random process? >> WIRTH: So, it's up to you. So, you can
choose to sample this or sample that. So you can imagine sort of decision tree with probabilities
of the nodes of the trees. >> I think the Is are drawn from some distribution.
>> WIRTH: No, no. The Is and Rs are fixed. So, I'll just go back. I mean, one more thing
I am--Okay, so I'll just retreat a little. So the AR isn't given to you, N is given to
you, the Yis are given but they're hidden and its up to you to choose which Yis you
want to look at. And, in a sense, the probability is entirely to do with your choice of samples
and substance or the choice of the Ys that you'll look at.
>> So, you're choosing them randomly? >> WIRTH: That's right.
>> I see. Okay. >> WIRTH: So that you have a relatively high
like one minus delta chance of succeeding. >> Thanks.
>> WIRTH: That was very small. You'll start at the epsilon, which is the error you obtained
which is very small. >> Okay. Thanks. Sure.
>> WIRTH: Now it's important to get this right. And I admit sometimes that--each time I present
this, I get slightly different questions about the model and sometimes going to stop, right,
make sure that people are, you know, working out. Okay, great. So, it's--so it's known
what did the [INDISTINCT] of the Mean question is, okay? But, as I said, it might be that
because some of these documents are sort of more important than others that somehow we
can afford not to look at quite so many things as if we were just [INDISTINCT] Mean. Okay.
Now, so as well as this moment I define results of Bar-Yossef. He's a sort of a Google person,
but he's not here. He described two things in the paper as it does in one. And one is
block sensitivity, which is the--it says, "Okay. There are certain parts of the input
that we have to look at" and I'll explain more about that later. And also, they talked
about another technique we're getting at lower bound on the Mean which [INDISTINCT]. Okay.
So, let me show you what we sort of did. So, okay, so the Weighted Mean problem, in general,
is a sort of a more general problem than the mean and so if you didn't carry anything specific
about the weights, then we're sure [INDISTINCT] in the same number of queries you can match
the lower bounds of the Mean. But that's not particularly interesting because maybe there's
something interesting, maybe there's something special about the family of the weight series.
So one is--so we'd look at three different weight families, one is Geometric, which is
inspired by some work from my colleagues who believe that you ought to have Geometrically
decreasing weights to evaluate retrieval systems. Another is--another peculiar job is where
the weight sort of diverged. So, you could imagine that N is a really, really large and
then, you know, these As form a divergent series. And that turns out to be as bad as
the Mean case. I mean the Mean is included in the divergent series because the weights
of this one, one, one, one, one, one, one, one, one, one, one, one, whatever you got.
Sum of constant, repeated. >> So weight to Ys?
>> WIRTH: The weights of the As. >> Okay.
>> WIRTH: So, the problem changes depending on what--did you say you were trying to estimate?
>> No, I'm... >> WIRTH: So, you've got these weights that
go down very quickly. It's sort of easy because you can look at the things that--the things
with heavy weights only. On the otherhand, if the weights form a divergent series, then
it's as bad as the Mean and so--I mean, it would be nice to have a general sort of theory
here, but the best we could do is sampling between cases of the power law, which is the
most interesting one, which is sort of converges. It doesn't converge as quickly as the Geometric
case. Okay. So, here is a particular flavor of the [INDISTINCT] of Generic Upper Bounds
that is most useful to us. And a little--aside that I've been giving--presented this is that,
you know, I did an undergraduate statistics and yet never learned about Generic Upper
Bounds. I don't know where the [INDISTINCT]. Do you hear about Generic Upper Bounds?
>> Yes. >> WIRTH: You did? I came to get my PhDs.
I'm like, "What's this? There are changes in equality, [INDISTINCT]. What's this?" Generic
Upper Bounds. And I guess, you know, theory--computer science theory people are interested in slightly
different question sometimes from the statisticians. Okay. So, I actually went back and took [INDISTINCT]
form I wanted. I honestly retrieved--I went to JSTOR and retrieved and old paper from
1950 something [INDISTINCT]. Okay. So, the form I want is this where you've got these
bounded independent variables, which are in this case between zero and one. And another
actually have to be--I've seen [INDISTINCT] and I don't believe that actually [INDISTINCT]
I don't think they distribute it. They just have to be downloaded and independent to the
case. So, you add them together then you look at the sample mean and the divergence to that
is nicely bounded [INDISTINCT] exponentially. Okay, so what do we do? The As are essentially
going to--be it a [INDISTINCT] probability distribution for us and that we normalize
them, we choose, we do sampling with replacement with unequal probabilities which we can do
quite easily. Okay. We choose--with each trial, we choose a particular item in the probability
and specify by its normalized weight in some sense. And then if you do some algebra, you
can find that we've one over epsilon squared log, one on delta; approximately trials, you
can get the right accuracy. Okay. So that was nice, but not very informative so far.
But we think, "Well what can you do with these As," right? So, that was--yes?
>> Just a comment. It's kind of a little a lot of computation that is--and is large your
conveying all these. Like I was thinking there'd be something more adaptive, I guess, and that
[INDISTINCT] and different probabilities, and then choosing the Ys with those probabilities.
>> WIRTH: Oh, I see. >> I was thinking something more--that there's
been about like find the largest--You have a power lesson to that yet or not yet?
>> WIRTH: No. That's just purely generic. So, you go to any--this works for any family
weights. It's just showing that even though Weighted Mean in general is a sort of a [INDISTINCT]
seems like a more general form of the Mean. It really isn't sort of the same complexity
in some sense. >> I see. I think I'm just jumping ahead,
then. >> WIRTH: No, that's a reasonable thing to
do. Okay. So, maybe some of these As are more important else--because of the weights of
the As, some of these things are more important than others. Okay. So what we're going to
do is say that some of the items as we call heavy and regarded definitely sample those.
Okay. So the--maybe we assure that somehow we ought to look at the top 10 documents or
something like that. And then afterwards, we'll run this generic scheme that I described,
except that we'll re-weight everything because it will renormalize the weights, okay, based
on what's left, okay? And this scheme--so this scheme appears in two other instances.
So, I discovered only fairly recently that people at Stanford in 2006 worked on a somewhat
similar problem. And I have heard of something strange about this paper, so in another side
about this is that it hasn't actually been appeared anywhere yet. We submitted it somewhere
and it got rejected. And, you see, what the guys at AT&T said was that it wasn't fair
that [INDISTINCT] was my advisor because I couldn't--I never got the experience of being
in paper rejected. >> Yeah.
>> WIRTH: And this is an important thing to learn when you're a PhD student. So, this
is my first paper rejection. I'm going to be shocked. I said, "What do I do?" So I still
haven't got over that yet and we still haven't re-submitted it. But that's also because I
need to just read that thing [INDISTINCT] to know exactly what [INDISTINCT] from the--Okay.
So, they have the sort of similar scheme with these, some things that are heavier and some
things in the latter and so forth. Okay. This also appears actually in the statistics literature.
So, it took me awhile to work out what the statistics--relevant statistics that are true
[INDISTINCT]. And it turns out that it's--you need to look up--survey sampling, is being
sort of keyword that you want. And they have situations where the one [INDISTINCT] of some
unbiased estimated for something. And they--given a situations where you--that was what we call
list sequential strategy, which is what we're also doing here with the use of--I'm sorry,
actually [INDISTINCT]. But anyway, that is the situation we should walk along each item
and say, "Okay, do I look at this, do I look at this, yes or no?" I mean, these have different
probabilities. And they have some things where you've got probability one at beginning, and
then other things. That sort of--it drifts down to something lower. So, this has appeared
in all the circumstances. Okay. So we run the--we've got the heavy ones. We do the generic
schemes to the light ones to rescale their weights. So what do we get? We get H--with
a little H, which is a number of heavy things plus this one with epsilon squared log one
of delta are perfectly scaled. Okay. And it seems like a good strategy. But what's the
appropriate H to use? So, one thing needs to say, "Well, maybe H ought to be all of
the terms that have a normalized weight of at least epsilon over two" or something of
the order of epsilon anyway, right, because if there's something with a weight that's
certainly larger than epsilon and you don't sample it, then you guarantee to be out. I
mean, not guarantee--there's no way you can guarantee that you're with an epsilon at the
right ends, all right? So, okay, you think, well, let's look at those things. And then...
>> Because it's an additive epsilon, right? >> WIRTH: Because it's an additive error,
right. It's not a relative error. So, some people worked on it, so obviously some of
the papers have worked from relative errors, actually. Okay, so we want an additive error.
So, you have to look at these--some of these terms and then somehow, you know, well just
for the generic scheme and everything else. The problem is that if you look back at this
formula here, we might end up with a situation where the second term is actually significantly
larger than the first. And one of these sorts of rules of thumb that you've learned from
doing this research is--this kind of researches, if you're adding two things and trying to
minimize that sum, it's probably a good idea if the two things are roughly equal. So, somehow,
you don't want the second one to dominate the first. So in fact, what we're going to
do is to make H larger and then somehow [INDISTINCT] contribution [INDISTINCT]. So, we try this
with the power law case for instance. Based on the [INDISTINCT] convergent series and
we sort everything out and then we have H items and the weight of the light things is
starting to be roughly--to do the integration, roughly sort of H to 1 minus P over P. You
want to just think about the individual rough--approximately that--okay, assuming that everything works
nicely with the N and the H and P and so forth. And if you do the [INDISTINCT], okay, so H
should be roughly epsilon to the power of two over one minus two P, and don't ask me
why. That sort of makes sense. But it just sort of pops out in some sense [INDISTINCT].
No, that's a--it's an interesting quantity, but it certainly balances and maybe it's indeed
the right answer. Okay. So we then claim that, okay, that if you do this--roughly this many
multiply by log one over delta, then if you query that many things then you will get a
good bound. Now, note that this is lower or this is better than epsilon to the minus two,
right? So when P equals one for instance, what do you get here? You get two of the--one
minus two, which is epsilon of minus two, all right? So as P increases, then there's
going to be--this quantity is going to be smaller. All right, so, that seems like a
useful thing to do. But the question was "Was it--is it, in fact, the right answer? Could
we get a matching lower bound?" And that actually took quite awhile to work out, so--okay. So,
if refer to the paper I mentioned before--so if assuming that epsilon and delta behave
nicely and the N over four part doesn't dominate you what this sort of, I think, precise formula
for a lower bound if you want to estimate the Mean, okay? So, we're going to use this
result fairly heavily because we're going to somehow fake the weighted--the Mean problem
is the Weighted Mean problem. Even though, now that somebody yesterday said to me, "Well,
wait a minute, isn't the Weighted Mean problem more general and so therefore, it doesn't
include the Mean." But we're going to do a reduction from Mean to specific kinds of Weighted
Mean, which is Weighted Mean in a certain kind of A--certain A vector. Okay, so, let's
look at block sensitivity for a moment, which seems to be fairly powerful of a tool. Imagine
that you have some part of the input, these hidden Ys, say, and you said, "Somebody--some
adversary changes all of their values maybe from one--from whatever they are to something
else." What does this do? What could this potentially do to the output? So, if it changes
the output by at least two epsilon, then you know that you have to look at something in
that block, otherwise, there's no way you can guarantee that you're within additive
error of epsilon, the truth, okay? So--then you say, "Okay, maybe there are similar blocks
like this. Like there are some part of this--here's one part of the input, you must look into
this. And here's another part of the input where if you change their values the output
might change by at least two epsilon on another part and so forth". And so, you know, if you've
got these disjoint subsets to the input, you know that you have to look at least at something
in each of these. Okay. So, this--you've got--so here, you've got to multiply it by one minus
two delta, but basically--and there's a slight difference because you've got expected query
complexity. In any case, it's still a recently powerful technique. You can multiply blocks
sensitivity to one minus two delta, and that's a lower bound. Now, if you think about the
Weighted Mean problem, at the very least, you've got to look at every item whose weight--has
normalized weight is at least two epsilon, right, because if it changed from one to zero
or zero to one, potentially, then the output move by this two epsilon.
>> So, it's a block part itself. >> WIRTH: So, yeah, so that's a block part
itself. But that's a rather extreme case. >> So, block is [INDISTINCT] the worst case
change would cause at least--at least two epsilon?
>> WIRTH: That's right. >> Okay.
>> WIRTH: Okay. So, that was one sort of tool you're going to use. The other thing I mentioned
was that we're going to use some reductions, okay? So, again, there's a slight flaw in
this slide is that the term Weighted Mean, not the top, there should probably have an
accompanying A vector, right? Because Weighted Mean in general is, of course, a more general
problem than Mean. But what we're going to do is to reduce the Mean problem--we're going
to fake the Mean problem with the Weighted Mean problem with the specific A vector. So,
we're going to--okay, if we can solve this instance with the Weighted Mean that sort
of almost solves or approximately solves the Mean problem. And then because the Mean problem
is hard, that must mean this Weighted Mean problem is also hard, okay? So what we're
going to do is something fairly obvious, we're going to do sort of partition the weights
in the Weighted Mean problem into some subsets that are roughly equal, okay? So that the
weights in the Weighted Mean problem, if you sort of glue them together somehow and they--the
glued sets become almost the same, then somehow you can fake the Mean problem, okay? We're
going to say, "Well, if the subset of what--we're going to associate the subset of Y to the
particular and the accompanying Y values with the particular X value up here." We're going
to say, "Whatever this X value was, the appropriate Y value is down here are going to be the same.
And then maybe some of these weights don't actually appear in any subset because they're
going to just throw them away--assuming that the Y is zero." Okay. Now, note that we don't
actually make any--we don't somehow really copy the X values and the Ys, of course, we're
just sort of making references so that when you begin to query this Y, you query the appropriate
X. Moreover, every time you look at a new X term, it forces you to look at a new subset
and therefore a new Y term. And so, any lower bound on the Xs is also going to apply to
the Ys. And the other point is that we're not using any factor about the Xs to determine
these subsets. It's purely--something to do with the weights and then nature that determine
these subsets and therefore, how hard the accompanying Weighted Mean problem is. Okay.
So, we're going to choose the subsets so that somehow the consequence in the Weighted Mean
that we end up with is very close, so with an epsilon of the trimming of the Mean problem
and, therefore, if anything that were an epsilon approximator, the Weighted Mean is also going
to be a two epsilon approximator of Mean. So that's how we're going to chain our lower
bound. Now, let's start with--So we're going to start with series of warm-up things and
then we're going to get to the more interesting one, it's the Power Law Weights. Yes?
>> Is it easy to choose such subsets? That kind of flashed by really quickly.
>> WIRTH: I mean, so we'll see. So, in some cases, yes, in power law, it took me few months
on and off thinking about how to do this. I mean, it turned out to be kind of nice in
the end, but I didn't quite get my head around on what one needed to do. So that's the interesting
part. I'm leaving that as the dessert for the talk.
>> Okay. >> WIRTH: Okay. So, let's look at some easy
stuff. So, Geometric Weights--I mean, it seems to be kind of obvious what you do here because
somehow they've got these very heavy weights in the beginning and then this tail that sort
of doesn't matter. And somehow the size of the tail is proportional to, you know, the
weight itself or the weight--mixed weights, let's say, or something like that, right?
So, seemingly, the obvious strategy here is you sample some things down to some point
and then you just throw the rest away, but you might as well make it rigorous. So, my
colleagues believed--sort of published something that sort of says, well we're [INDISTINCT]
be using these geometrically decreasing weights because this involves how humans grade things.
That's like "Oh, do I follow this link? Well maybe follow the next one and then this."
So, human is sort of grading the outputs of a grading system is sort of a geometrically
decreasing interest in some sense. I mean that's reasonable. Anyway, if you look at--if
you were [INDISTINCT] simple lower bound, of course, you must look at every Bi at least
two epsilon and you do some slightly annoying things and you end up with something that's
approximately log two epsilon over log Q, and log Q is equals negative. So you essentially
got log of one over epsilon, okay? Now, remember that epsilon is small, so one over epsilon
is large. And so log of one over epsilon is a lot better than whatever is on squared,
which is our generic lower bound, okay? And in particular, it's also an obvious upper
bound because you can just, at some point, throw away the tail, right, so if you look
at the tail terms and if the total weight is at most epsilon over two, so [INDISTINCT]
kind of affect the answer very much fairly enough to look there. And somehow that--the
position of where that tail starts to--where the sort of lower bound in some sense--position
ends as a winner, the constant determined by queue, right? So, this is obviously--these
upper and lower bounds match to each other in a constant. So instead of epsilon--one
over epsilon squared, we've got log of one over epsilon being appropriate number of samples
we need to take. And in particularly in the geometric case, you just look at the stuff--at
the beginning. Great citing, but it's worth sort of [INDISTINCT]. Okay. So, something
a little bit more interesting is what happens when the weights diverge? Okay. And it turns
out in this phase is that the problem is as hard as the Mean problem. And so, what do
you do? Well, you're going to fake the Mean problem by collecting together blocks of weights
that are sort of roughly the same. And in diversion series case--and we're sort of assuming
that N is really, really big here, okay--we can do this because we know we'll never run
out of weight because the weights' series diverges. And we also know that the weights
themselves is abounded between zero and one. So somehow by taking lots and lots of weight,
we can get these plumps that kind of roughly. So what we do is we keep taking weights until
we got a sum that's at least somewhere between two over epsilon and two over epsilon plus
one. We know that we can do this, okay, because of the bound and the fact that series diverges,
so we carve that off and then we take another chunk of weight between two over epsilon and
two over epsilon plus one and so forth. And if you sort of add things together and so
forth and use the one over twenty-four whatever the thing is and you get exactly that many
subsets, then you can somehow estimate the Mean within epsilon over two. Sorry, [INDISTINCT]
we're doing epsilon over two plus an epsilon over two plus the minus for the Mean and so
forth. So, you can--it's not too hard to fake the Mean problem here because you've been
just taking secondary chunks. So, in fact here, [INDISTINCT] of the subsets is quite
easy. Okay. Now, the one that--the thing that was kind of--a bit more interesting is what
do you do in the power law case? So, we thought, okay, well, let's start with using block sensitivity,
which is my first one. And so you group together contiguous sequences--so you start with the--some
of the weights are just really big, at least two epsilon. So you treat those as block by
themselves. Then afterwards, you form contiguous blocks of weights that add up to at least
two epsilon. >> Okay. Can I go back one slide?
>> WIRTH: Sure. >> If I am seeing the results, you said for
every fixed delta--so you fixed delta and then reduced...
>> WIRTH: Yes. >> ...the mean to it, but you don't obtain
the--sort of a lower bound in terms of delta or [INDISTINCT].
>> WIRTH: The delta is not the particularly interesting part in a way that the epsilon
is the sort of... >> I can believe that. I was just confirming
that that... >> WIRTH: Yes, yes, yes. So, obviously, you
should fix the delta and then it's the epsilon that you play with, right? So there's a --if
you can estimate these within epsilon over two--sorry, the gap between the Mean and the
Weighted Mean is epsilon over two and you've estimated the Weighted Mean of epsilon over
two then you've got estimated the reason one of epsilon. So, you know that's hard so therefore
you know this other thing must be hard. >> And delta is hard. Yes. Okay. Sure, okay.
>> WIRTH: So--all right. So, in this case, we've got--to recall, we've got these very
heavy weights. We have to look at each of those items individually and then--and then
we look at blocks and contiguous weights that add up to these two epsilon. Now, note that
those blocks are going to be a weight between two epsilon and four epsilon because by now,
for that stage, we know that all of--all the weights [INDISTINCT], right? So, we don't
really shoot by too much in some sense, okay? So this gives us a bound on the blocks sensitivity
of the number very heavy turns plus the remaining weight of stuff divided by four epsilon. It
seems to be the number of blocks you obtain. But unfortunately if you do this, you'd end
up with a lower bound--I left out the delta part that you expected it and so forth. But
the important point is that you end up with a bound that's something like epsilon to the
minus one over P, which isn't just quite what we hoped. We want to get epsilon to the two,
assuming that what we do together down is correct, and I believe it is--I believed it
is. I know now that it is--that I wanted to get epsilon to the two over one minus two
P, and we didn't quite get that. Okay, so what then? And I spent sometime thinking,
"Well, maybe I can use some property of these parallel terms and I can actually look at--"
you know, I--I must be naïve or something and thought that maybe I can use something
about the properties of these sequences to add certain terms together and they'll end
up being about the same" and this is no good. And, of course, one does not actually need
to construct this thing, one just need to describe a method for constructing equal weight
subsets even if one doesn't do it explicitly, all right? And it took me a while there to
appreciate that point. And then I started thinking, "Well, what is this--what--what
is this"--oh, sorry I've missed it on this slide, yeah, but what does this look like,
I'll set it in a moment. So just a little bit more precisely, what we're going to say
is imagine--what we want to connect are these subsets of weights that are somehow between
row and row minus epsilon over two M. So, we've got this M subsets in order to be almost
equal. And then, it means that the--[INDISTINCT] oh, I see. I can sort of draw this. Okay.
So it then means that this weighted mean that we obtain is somehow between row--something
between row multiplied by the total some of the Xs within epsilon over two over there,
okay? And then if we do the appropriate division, we're getting a probability with one minus
delta, we're getting an epsilon over M times row approximated to the mean, okay? It's really
too many formulas on this slide, actually. Okay. [INDISTINCT] it's a bit old. Okay. So,
what's the interesting thing? The interesting thing is how do you actually tame these weights,
collective weights together to be roughly equal? And I remember thinking to myself,
"Well, maybe it's something like being tacky". And I hear that this was--I heard a conference
deadline I was aiming for this and I still couldn't work out what to do. Maybe if I tell
myself I'll make the deadline, then I can do it. And it's sort of went by listening
[INDISTINCT] [INDISTINCT] then I went to my father, I said, "What is this?" And is--so,
the reason he got brought into this was through a discussion with him. He said, "No, I don't
think that's actually what you want to be doing. What you really want to be doing is
a scheduling kind of problem." Oh, okay. And we talked about what might be good and it
turns out that if you look at the last line of this that the thing that we want to do
is to max-serve. Well, the thing that inspired us was the problem of maximizing the neuron
completion time. So, you've got some problem with jobs and you've got N machines and you
assign these jobs to these machines and you want the machines to sort of all finish around
the same time. So, we're going to maximize the one who finishes first, in some sense.
We're going to make that as light as possible and then somehow that will mean that everybody
finished can [INDISTINCT] for everybody else. That's a rather strange objective and there
are [INDISTINCT] on it. I talked to somebody [INDISTINCT] about this and he's one of the
people who worked on this. And he said it has something to do with airplane construction--I
never quite--we never finished our discussion for some reason. I'll get to the bottom of
that one day why he wouldn't care, but I cared for different reasons. It's just that I was
trying to make this equal subsets. And it turns out that this longest processing time
strategy is a good thing. So, you start with the longer--the jobs that takes the--you saw
that the jobs in non-increasing order of duration and you--as you sort of proceed in time, you
assign the job to the first machine that's available. And so you sort of sort them and
then assign these jobs, okay? And this is I think what we've got because we've got these
As and they're in this decreasing order and then somehow we want to assign to these subsets,
so they end up being almost the same at the end.
>> Can I add a comment? >> WIRTH: Yes.
>> In the scheduling problem, don't you want to minimize the maximum, not maximum to minimum?
>> WIRTH: Well, it depends. No, normally one does want to do that, yes. But some--this
is an unusual scheduling problem which we want to maximize the minimum.
>> What is your readiness to this one? >> If you put too much weight on one of the
buckets, that's okay. >> WIRTH: Well, no. I'm just saying that there
are many scheduling problems around and this is somewhat quirky one that only a handful
of people looked at. And it so happens that it relates to what I want to do. But, you
know, it was just fortunate that someone had thought of this before.
>> So is this the reason why minimizing the maximum will not be good for you?
>> WIRTH: Um, well, it doesn't--it doesn't necessarily make things equal, right? The
way I see what you mean [INDISTINCT]. So, you can have a maximum and then--I guess it's
sort of the same thing, isn't it? I don't know. I mean, I don't know why I like looking
at this rather than that. You'd have to ask me that three years ago and something [INDISTINCT]
a good point. Actually, I don't know. I guess I have this feeling that maybe you could still
do that and somehow end up with something very uneven. Perhaps not. Anyway--so there
are various strategies that these papers use. But one simple on is to use this longest processing
time heuristic, and with all, maybe this helps us. So what do we do here? So, we have a--we
have these jobs and what we're concerned about is the tail, okay? So, imagine we have a job
that has a very large title after it. So large that if we add up all the jobs of [INDISTINCT]
weights afterwards and the tail, and divide it by M minus one--divide the number of machine
minus 1. If that's larger than the duration of the current job then somehow we know that
the difference of the machine completion times is at most the duration of the next job. Why
do I say that? Okay, so you'll see why that's true in a moment.
>> Otherwise the [INDISTINCT] there in that gap?
>> WIRTH: Well, there's sort of not enough time to finish everything, basically. So--okay,
so here's what happens. So, consider the last job that finishes. And when the job started,
all the machines were busy. All--perhaps, that was the very first job, okay? Now, imagine
that this tail is--you got this large tail of jobs that you still have to do, and at
that large tail, divided by M minus one is still greater than the duration of the current
job. Okay, so we're saying this--sorry, [INDISTINCT] this last job to finish. So here's this job
L, this last job to finish and we've got this large tail of stuff. We're going to assign
it to all the other machines while this one is working. But somehow this tail is so large
that there's not enough machine capacity in the time of which it's supposed to finish
to do everything. And as there are M minus one jobs and then we've only got time Tl,
if you like, time over M minus one machine, sorry, to finish everything else. So, if indeed
the tail divided by M minus one were larger than Tl then that would mean there would not
be enough machine capacity to finish everything. In other words, this Tl, this L, could not
in fact be last job. And so, this can only be [INDISTINCT] something can only be the
last job when the tail is sufficiently small, as this capital N was kind of like an indication
of which job was sort of--had very large tails, okay? So, this is just one final slide--almost
final side, in some sense. So, what is meant to be--look at, this is good. So, I know that
I can now place a bound--using this LPT, I can now place a bound on the finishing times
and everything and guarantee that somehow while the tail is too big--sorry, the--knowing
some properties about the tail and so forth, I can now say that I can guarantee what the
difference in finishing times between these machines are. In other words, what the error
between the subsets is. And so, we wanted to get epsilon to the two over one months
two P, okay? And we know that at some stage, at least up to some capital N, the tail is
too large so we sort of--we set--anyway, this is somewhat a confusing slide--but we set
the R, the, M, and the rho almost to do what we want. We can't quite get it right. So,
we can estimate the size of the tail because it's just a sort of [INDISTINCT] and it's
a power law, okay? And when we sort of set everything up, we find that we almost get
what we want. We do, in fact, get a lower bound of epsilon to the two over one minus
two P, but we also unfortunately going to divide it by P squared in there which, it
depends on how you look at it. It may or may not be important, okay? So, you can in fact
fake the Main problem with the Weighted Mean problem by applying this LPT heuristic, getting
all of the subsets that right. And then there's a--these are somewhat crude estimates right,
based on its integrals, but--so that that's sort of--if the crudeness of the estimate
that's probably causing the P squared to be in there in some sense. Okay. So, I guess--the
one thing I'm interested to do is what to do next. So one thing is I'm actually looking
at somewhat pragmatic--so, this was a [INDISTINCT] abstract thing that the--I bet the retrieval
people would care about your depth. But I'm actually trying to look at some estimation
techniques that minimize the variance side in these estimates. So if someone biased the
estimators and the more direct [INDISTINCT] to do in that case is--the other two things
are--well, okay, we've got diversion weights, we've got very convergent weights and we've
got this power law that is something between, but maybe this doesn't actually--you know,
it'll be good to capture something more generic, you know? Like why is it that this mode changes
from lower one over epsilon then into this epsilon to the two area one minus two P and
then to epsilon to the minus two? And what's--you know, how sensitive--like if you have slight
changes in the weights--because some of these series look somewhat similar. Like what's
happening and why, you need to do something more generic. And, of course, the third point
is that the most interesting question is do we actually compare multiple systems at once,
right? This is where you need to consider about [INDISTINCT] just to make the performance
of one system. So, of course, you've got lots of common documents and they appear with different
weights according to different--one system may give a documentary high weight, another
one with a very low weight and so forth. And maybe what we are going to do is minimize
the sum of the variances of all these things or--it's not clear what you want do. One thought
I had--the right thing to aim for is that you--what you're ultimately producing is a
ranking of the systems. So, you want to sort of say, "Maybe with high probability we got
the ranking correct". And that's ultimately what you can send--sort of, you know, what--how
many things do we need to sample so that with 95% probability we actually got the rankings
of these systems right. So I just want to thank those people for hosting me, although
no one's actually there. I know there's no one who's actually there anymore. And my colleagues,
Alistair Moffat and Justin Zobel and William Webber is a PhD student of Alistair Moffat.
And just one last little story that William and I used to sort of see each other at the
department for a year and not talk to each other and just grunt at each other as males
often do, and in fact the thing that got us started talking and then led to some of this
was that I discovered he went to the same school as me and Lawrence for that matter
too. And suddenly that made us start talking. It's a very noble story, I think. Okay.