Tip:
Highlight text to annotate it
X
So we clearly have something to be excited about.
There's clearly this opportunity to design binary prefix-free codes which
improve over the obvious fixed link solution.
So, we'd like to have in some sense, optimal algorithm for this problem and
for that, we of course need a crisp problem definition.
So, to do that it turns out to be useful to think of codes as binary trees.
So, this video will develop that connection concluding with the final
formal problem statement. So, the last video introduced us to this
very interesting computational problem, namely given characters from an alphabet
with frequencies, find the best binary prefix-free encoding,
find the code which minimizes the average number of bits needed to encode a
character. Crucial, the reasoning about this problem
is thinking of binary codes as binary trees. So to give you an idea about this
correspondence, let's just revisit three of the binary codes we saw in the
previous video and see what kind of trees they correspond to.
So let's just continue with the four symbol alphabet A B C D.
The obvious fixed length code where we encode A, B, C, D was 0, 0, 0, 1, 1, 0
and 1, 1 is just going to correspond to the complete binary tree with four
leaves. So let me label this complete binary tree
as follows. I'm going to label the leaves A through D from left to right, and I'm
going to label each edge of the tree with a 0 if it corresponds to a left-child
relationship and with a 1 if it corresponds to a right-child
relationship. And now what you see is there's a
correspondence between the bits along root to leaf paths and the fixed length
encoding. So for example for the symbol C, if we
follow the path from the root to the leaf labeled C, we first encounter a 1 because
it's a right-child, then we encounter a 0 because it's a left-child.
That gives us a 1 and a 0, that's the same as our encoding of the
symbol C in this fixed length encoding and the same of course is true for the
other three leaves. Next, when we first started playing
around with variable-length encodings to motivate the prefix-free property, we
studied a code where we replaced the double 0 for an A with a single 0 and the
double 1 for a D with a single 1. Now this code was not prefix-free,
but we can still represent it as a binary tree.
It's just not going to be a complete one. So I'm going to label the edges of this
tree the same way as before. Left-child edges will be given a label of
0, right-child edges will be given a label
of 1. I'm going to label the left and right
children of the root A and D respectively and the two leaves will be given the
labels B and C. The reason I labeled the nodes in this
way is, because then we have the same kind of correspondence between the
encodings that we proposed for the various symbols and the bits that you see
along a path from the root to nodes with those symbols.
So for example, if you at the node labeled D, the path from the root only
has a single bit 1 and that coincides with the proposed encoding of the symbol
D. Now, remember, this code is not
prefix-free and so therefore, as we saw, it had ambiguity.
So if you're wondering what some encoded message means and you see a 0, you're not
sure that 0 might be representing the symbol A or alternatively it might be the
first half of an encoding of the symbol B.
So, this ambiguity is also noticeable in the tree.
And the property in the tree that tips you off to the ambiguity is that there
are symbols at internal nodes of the tree.
The symbols are not merely at the leaves as they were with the first tree with the
fixed length in coding, but there are also two internal nodes
that have symbols. So let's draw the tree for our final
example which, was the variable length but prefix-free code that we looked at in
the previous video. So this is going to correspond to a tree
which is not perfectly balanced, but it will have labels only at the leaves of
the tree. So, if you label the edges of this tree
the way we've been doing, all the left-child edges get a 0, all the
right-left edges get a 1, and we label the leaves of the tree from
A to D going from left to right. You'll see we have the same
correspondence as in the previous two trees.
the sequence of bits from the root to a leaf coincide with the proposed encoding
for it. So, for example, if you look at the leaf
labeled C, because you traverse a right-child, another right-child and a
left-child to get to it, that's the sequence 1, 1, 0 and that is precisely
the proposed encoding for the symbol C. So in general, any binary code can be
expressed as a tree in this way, with the left-child pointers being labeled with
0's, the right child pointers being labeled with 1's,
and the various nodes being labeled the symbols of the given alphabets, and the
bits from the root down to the node labeled with the given symbol
corresponding to the proposed encoding for that symbol.
And what's cool about thinking about codes as trees is that the really
important prefix-free condition, which seems like a nuisance to check in the
abstract, shows up in a really clean way in these trees,
namely the prefix-free condition is the same as leaves being the only nodes that
have labels. No internal nodes are allowed to have a
label in a prefix-free code. The reason for this is that we've set it
up so that the encodings correspond to the bits along paths from the root to the
labeled node. So being a prefix of another corresponds
to one node being an ancestor of the other, and so, if all the labels are at
the leaves, then of course nobody is an ancestor of the other and we have no
prefixes. The other things that's really cool about
this tree representation of codes is, it becomes pictorially obvious how you do
the coding given this sequence of 0's and 1's from a prefix-free binary code,
namely, you start at the beginning and you start at the root of your tree,
whenever you see a 0 you go left, whenever you see a 1 you go right.
At some point you'll hit a leaf, that leaf will have some label and that's the
symbol that's being encoded, and after you hit a leaf, you just start all over
again back to the root. So for example, if you were using our
variable-length prefix-free code for the four letter alphabet, as in our running
example, and you were given the sequence of 0s and
1s, 0, 1, 1, 0, 1, 1, 1. What would you do?
Well, you'd start at the root and you see a 0, so you follow the left-child
pointer, and you immediately get to a leaf.
It's labeled A, so you're going to output and A as the first symbol.
Now you start all over. You return to the root, now what do you see?
You see a 1, so you go right, you see another one, so you go right, and now you
see a 0, so you go left and that gets you to the leaf labeled C.
Now you start all over again. You see a 1, you go right, you see a 1,
you go right again, and then, finally you see yet one more 1 and you wind up at the
lead labelled D. So by repeated traversals through the
tree, you decode the sequence of 0s and 1s as a C, D.
There's never any ambiguity, because when you hit a label, you know you're going to
leave, there's no where to go. And every internal note, it's unlabeled,
you know to expect another bit, another traversal further down in the tree.
So a final important point about this correspondence is that the encoding
lengths of the symbols, the number of bits needed to encode the various
symbols, are just the depths of the corresponding leaves in the tree that
corresponds to the code. So for example, in our running example
the symbol A is the only one that needs only one bit to encode and it's also the
only leaf in level one of the tree. And similarly B needs two bits and shows
up in the next level, the C and the D which need three bits
show up in the third level. So this correspondence is, really just by
construction, so how do you encode a given symbol.
Well, it's just the bits on the path from the root, and that the number of such
bits is just the number of pointer traversals you need to get from the root
down to that leaf, and that's just the depth of that leaf in the tree.
So we're now in a great position to really have a crisp definition of the
problem. The input of course is just the
frequencies for a bunch different symbols i from some alphabet capital sigma.
I'm going to use P sub i as notation for the frequency of symbol i.
So we know what it is we want to optimize.
We want to minimize the expected number of bits needed to encode a symbol, where
the average of the expectation is taken over the provided frequencies of the
various symbols. Well, let's express this objective
function using our newfound correspondence with binary trees.
In particular, the fact that we can think about encoding lengths as depths of
leaves in trees. So, given a tree, T, which corresponds to
a prefix-free binary code. That is it should be a binary tree and
the leaves of this tree should be in one-to-one correspondence with the
symbols of sigma. We're going to define capital L(T) as the
average encoding length. It's an average in the sense that we sum
over all of the symbols of the alphabet, we weight each symbol by the frequency,
and remember, this is part of the input, so whether the provided frequency P
survived that symbol i. And then how many bits do we need to
encode that symbol i? Well, it's just the depth of the leaf
which is labelled i in the given tree, capital T.
So this is what we want to make as small as possible.
So, for instance, using the data from the previous video, the letters A, B, C, D
with frequencies 60, 25, 10, and 5%. Then if we use the complete binary tree,
that is the fixed length encoding, we just get two bits per character.
While if we use the lopsided tree optimized, so that each A only takes one
bit while suffering three bits for C and D, then the average encoding length drops
to 1.55, as we saw in the last video. So what then is the goal,
what's the responsibility of our algorithm?
Well, amongst all binary trees, which have leaves and correspondence to the
symbols of sigma, we want to compute the one which makes this average encoding
length as small as possible which minimizes our objective function capital
L. Turns out Huffman's greedy algorithm does
it. More details to come.
[SOUND]