Tip:
Highlight text to annotate it
X
So, before we embark in the analysis, what are we hoping to understand? Well, it
seems intuitively clear is that there is going to be some trade off between the two
resources of the bloom filter. One resource is space consumption, the other
resource is essentially correctness so the more space we use, the larger number of
bits, we'd hope that we'd make fewer and fewer errors. And then as we compress the
table more and more, we use bits more and more for different objects then presumably
the error rate is going to increase. So, the goal of the analysis that we're about
to do is to understand this trade off precisely at qualitative level. Once we
understand the trade off occur between these two resources, then we can ask is
there is a sweet spot which gives us a useful data structure? Quite small space
and quite manageable error probability. So the way we're going to proceed with the
analysis, we'll be familiar to those of you who watched the open addressing video
about hash tables so to make the mathematical analysis tractable, I'm going
to make a heuristic assumption The strong assumption which is not really satisfied
by hash functions you would use in the practice. We're going to use that
assumption to derive a performance guarantee for bloom filters but as all as
any implementation you should check that your implementation actually is getting
performance comparable to what the idealizing analysis suggest. That said, if
you use a good hash function and if you have a non-pathological data, the hopes
and this is going out many empirical studies is that you will see performance
comparable to what this heuristic analysis will suggest. So, what is the heuristic
assumption? Well, it's going to be again familiar from my hashing discussions.
We're just going to assume that all the hashing is totally random. So, for each
choice of a hash function hi and for each possible object ax, the slots, the
position of the array which the hash functions gives for that object is
uniformly random and first of all and it's independen t from all other outputs of all
hash functions on all objects. So the set up then is we have n bits. We have a data
set, S which we have inserted into our bloom filter. Now our eventual goal is to
understand the error rate or the false positive probability. That is the chance
that an object which we haven't inserted into the bloom filter looks as if it has
been inserted into the bloom filter but as a preliminary step, I want to ask about
the population of 1s after we've inserted this data set S into the bloom filter. So,
specifically let's focus on a particular position of the array and by symmetry it
doesn't matter which one. And let's ask what is the probability that a given bit,
a given position on this array has been set to one after we've inserted this
entire data set S? Alright, so this, this is a somewhat difficult quiz question
actually. The correct answer is the second answer. It's one - quantity one - 1/n
raised to the number of hash functions k the number of objects cardinality of S,
that's the probability let's say the first bit of the bloom filter has been set to
one after the data set S has been inserted. So the, maybe the easiest way to
see this is to first focus on the first answer. So, the first answer is going to
be the probability I claim that the first bit is zero after the entire data set has
been inserted. Then of course it's probably it's a one, is just the one - its
quantity which is equal to the second answer. So we just seem to understand why
the first choice is probably the first bit = zero. Well, it's initially zero,
remember stuff is only set from zero to one. So we really need to analyze the
probability that this first bit survives all of these darts that are getting thrown
to the bloom filter over the course of this entire data set being inserted. So
there, the cardinality of these objects each get inserted on an insertion k darts
uniformly at random and independent from each other or effectively thrown at the
array at the bloom filter. Any position of the dart hits, gets set to one. Maybe it
was one already but if it was zero, it gets set to one. If it's one then it stays
one. So, how is this first pick going to stay zero? We'll have to be missed by all
of the darts. A given dart, a given bit flick is uniformly likely to be any of the
n bits so the probability of the ones that being this bit is only 1/n but, if it even
it's fortunately somebody else? Well, that's one - 1/n so you have a chance of
surviving a single dart with probably one - 1/n There is the number of hash
functions k the number of objects cardinality that's a dart being thrown.
Right k per object that gets inserted so the overall probability of eluding all of
the darts is one - one or n raised to the number of hash functions k the number of
insertions cardinality of S. Again, the probability that is one which is the one -
that quantity which is the second option in the quiz. So, let's go ahead and resume
our analysis using the answer to that quiz. So, what do we discover, discover
the probability that a given bit is one, is one - quantity one - 1/n or n is the
number of position raised to the number of hash functions k the number of insertions
cardinality of S. So, that's the kind of messy quantity so let's recall a simple
estimation facts that we used once earlier. You saw this when we analyzed
cardinals construction algorithm and the benefit of multiple repetitions or
cardinals contraction algorithm. And the trick here is to estimate a quantity
that's on the form of one + x or one - x by either the x or the - x as the case
maybe. So you take the function one + x which goes through the points -ten and 01.
And of course it's a straight line and then you also look at the function e to
the x. Well, those two functions are going to kiss at the point 0,1 and everywhere
else e to the x is going to be above one + x. So for any real value of x we can
always upper bound the quantity one + x by either the - x. So let's apply this fact
to this quantity here, one - 1/n raise to the k cardinality of S. We're going to
take x to the - 1/n so that gives us an upper bound on this probability of one - e
to th e - k the number of insertions over n, okay? So that's taking x to the - 1/n.
Let's simplify and finalize a little bit further by introducing some notation. So,
I'm going to let b denote the number of bits that were using per object. So this
is the quantity I was telling you to think about as eighth previously. This is the
ratio n, the total number of bits divided by the cardinality of S. So, this green
expression becomes one - e^k where b is the number of bits per object. And now
we're already seeing this type of trade off that we're expecting. Remember we're
expecting that as we use more and more space, then the error rate we think should
go down so if you can press the table a lot or use bits for lots of different
objects that's when you start going to see a lot of false positives so in this light
blue expression if you take the number of bits per objects with the number space,
the amount of space, little b if you take that going very large expanding to
infinity, this exponent to zero. So either the -zero is one. So overall, this
probability of a given bit being one is turning to zero. So, that is, the more
bits you have, the bigger space you have. The, well, the smaller of the fraction of
1s. The bigger the faction of 0s. That should translate to a smaller false
positive probability unless we will make precise on the next and final slot. So
let's, let's rewrite the upshot form the last slide but probability that a given
bit is equal to one is that at above by one - e to the - k over b where k is the
number of hash functions and b is the number of bits we're using per object. Now
this is not the quantity that you care about. The quantity we care about is a
false positive probability where something looks like it's in the bloom filter even
though it's never been inserted so it's focused on some object like some IP
address which is never ever been inserted into this bloom filter. So for a given
object x which is not in the data set, that this has not been inserted into the
bloom filter or what has to happen for us to have a success ful look up for false
positive for this object? Well each one of its k bits has to be set to one. So, we
already computed the probability that a given bit is set to one. So, what has to
happen for all k of the bits that indicates x's membership in the bloom
filter all k of them has to be set to one. So we just take the quantity we computed
on the previous slide and we raise that to the kth power. Indicating that it has to
happen k different times. So believe it or not we now have exactly what we wanted.
What we set out to do which is derive a qualitative understanding of the intuitive
trade off between the one hand space used and on the other hand on the error
probability. The false positive of probability. So, we're going to call this
green circle quantity and name it. We'll call it epsilon for the error rate and
again all errors are false positives. And again as b goes to infinity, as we use
more and more space, this exponent goes to zero so one - e to that quantity is going
to zero as well. And of course, once we power it through the kth power, it gets
even closer to zero. So if the bigger b gets the small of this error rate epsilon
gets. So now let's get to the punch line. So remember the question is, is this data
structure actually useful? Can we actually set all of the parameters in a way that we
could both really usefully small space but a tolerable error epsilon? And, of course
we wouldn't be giving this video if the answer wasn't yes. Now one thing I've been
alluding all along is how do we set k? How do we choose the number of hash functions?
I told you at the very beginning We think of k as a small constant like 2345. And
now that we have this really nice qualitative version of how the error rate
in the space trade off with each other. We can answer how to set k. Namely set k
optimally so what do I mean? Well, fix the number of bits that you're using per
object. Eight, sixteen, 24, whatever. For fixed b, you can just choose the k that
minimize the screen quantity. That minimizes the error rate epsilon. So, how
do you minimize t his quantity? Well, you do it just like you learn in calculus and
I'll leave this as an exercise for you to do in the privacy of your own home. But
for fixed b, the way to get this green quantity epsilon as small as possible is
to set the number of hash functions k to be roughly the natural log of two. That's
a number of < one notice that's like .693 b. So, in other words the number of
hash functions for the optimal implementation of the bloom filter is
scaling linearly than the number of bits that you're using per object. It's about
.693 the bits per object. Of course this is generally not going to be an integer so
you just pick k either this number rounded up or this number rounded down. But,
continuing the heuristic analysis, now that we know how to set k optimally to
minimize the error for a given amount of space we can plug that value of k back in
and see well, how does the space and the error rate trade off against each other
and we get a very nice answer. Specifically, we get that the error rate
epsilon is just under an optimal trades to the number of hash functions k decreases
exponentially in the number of bits that you use per object. So, it's roughly one
half raised to the natural log of two or .693 roughly the number of bits per object
b. But, again the key qualitative point here is notice that epsilon is going down
really quickly as you scale b. If you double the number of bits that you're
allocating per object, you're squaring the error rate and for small error rates,
squaring it makes it much, much, much smaller. And of course this is just one
equation in two variables. If you prefer, you can solve this equation to express b,
the space requirement as a function of an error requirement. So if you know that the
tolerance for false positives in your application is one percent you can just
solve this for b and figure out how many bits per object you need to allocate. And
so rewriting what you get is that the number of bits per object that you need is
roughly 1.44 the log base two of one over epsilon. So, as expected as epsilon gets
smaller and smaller, you want fewer and fewer errors, the space requirements will
increase. So, the final question is, is it a useful data structure? Can you set all
the parameters so that you get you know, really interesting space error trade off
and the answer is totally. So, let me give you an example. Let's go back to having
eight bits of storage per object so that corresponds to b = eight. Then, what this
pick formula indicates is we should use five or six hash functions and already you
have an error probability of something like two percent which for a lot of the
motivating applications we talked about is already good enough. And again, if you
double the number of bits to say sixteen per object, then this error probability
would be really small. Pushing you know one in 5,000 or something like that. So,
to conclude at least in this idealized analysis which again, you should check
against at any real world implementation although empirically, it is definitely
achievable with well implemented bloom filter in nonpathological data to get this
kind of performance even with really a ridiculously minuscule amount of space per
object much less generally than storing the object itself, you can get fast
inserts, fast look ups, you do have to have false positives but with a very
controllable amount of error rates and that what's make bloom filters a win in a
number of applications.