8.2 - Problem definition - Huffman codes - [dsa 2] by tim roughgarden

So we clearly have something to be excited about. There's clearly this opportunity to design binary prefix-free codes which improve over the obvious fixed link solution. So, we'd like to have in some sense, optimal algorithm for this problem and for that, we of course need a crisp problem definition. So, to do that it turns out to be useful to think of codes as binary trees. So, this video will develop that connection concluding with the final formal problem statement. So, the last video introduced us to this very interesting computational problem, namely given characters from an alphabet with frequencies, find the best binary prefix-free encoding, find the code which minimizes the average number of bits needed to encode a character. Crucial, the reasoning about this problem is thinking of binary codes as binary trees. So to give you an idea about this correspondence, let's just revisit three of the binary codes we saw in the previous video and see what kind of trees they correspond to. So let's just continue with the four symbol alphabet A B C D. The obvious fixed length code where we encode A, B, C, D was 0, 0, 0, 1, 1, 0 and 1, 1 is just going to correspond to the complete binary tree with four leaves. So let me label this complete binary tree as follows. I'm going to label the leaves A through D from left to right, and I'm going to label each edge of the tree with a 0 if it corresponds to a left-child relationship and with a 1 if it corresponds to a right-child relationship. And now what you see is there's a correspondence between the bits along root to leaf paths and the fixed length encoding. So for example for the symbol C, if we follow the path from the root to the leaf labeled C, we first encounter a 1 because it's a right-child, then we encounter a 0 because it's a left-child. That gives us a 1 and a 0, that's the same as our encoding of the symbol C in this fixed length encoding and the same of course is true for the other three leaves. Next, when we first started playing around with variable-length encodings to motivate the prefix-free property, we studied a code where we replaced the double 0 for an A with a single 0 and the double 1 for a D with a single 1. Now this code was not prefix-free, but we can still represent it as a binary tree. It's just not going to be a complete one. So I'm going to label the edges of this tree the same way as before. Left-child edges will be given a label of 0, right-child edges will be given a label of 1. I'm going to label the left and right children of the root A and D respectively and the two leaves will be given the labels B and C. The reason I labeled the nodes in this way is, because then we have the same kind of correspondence between the encodings that we proposed for the various symbols and the bits that you see along a path from the root to nodes with those symbols. So for example, if you at the node labeled D, the path from the root only has a single bit 1 and that coincides with the proposed encoding of the symbol D. Now, remember, this code is not prefix-free and so therefore, as we saw, it had ambiguity. So if you're wondering what some encoded message means and you see a 0, you're not sure that 0 might be representing the symbol A or alternatively it might be the first half of an encoding of the symbol B. So, this ambiguity is also noticeable in the tree. And the property in the tree that tips you off to the ambiguity is that there are symbols at internal nodes of the tree. The symbols are not merely at the leaves as they were with the first tree with the fixed length in coding, but there are also two internal nodes that have symbols. So let's draw the tree for our final example which, was the variable length but prefix-free code that we looked at in the previous video. So this is going to correspond to a tree which is not perfectly balanced, but it will have labels only at the leaves of the tree. So, if you label the edges of this tree the way we've been doing, all the left-child edges get a 0, all the right-left edges get a 1, and we label the leaves of the tree from A to D going from left to right. You'll see we have the same correspondence as in the previous two trees. the sequence of bits from the root to a leaf coincide with the proposed encoding for it. So, for example, if you look at the leaf labeled C, because you traverse a right-child, another right-child and a left-child to get to it, that's the sequence 1, 1, 0 and that is precisely the proposed encoding for the symbol C. So in general, any binary code can be expressed as a tree in this way, with the left-child pointers being labeled with 0's, the right child pointers being labeled with 1's, and the various nodes being labeled the symbols of the given alphabets, and the bits from the root down to the node labeled with the given symbol corresponding to the proposed encoding for that symbol. And what's cool about thinking about codes as trees is that the really important prefix-free condition, which seems like a nuisance to check in the abstract, shows up in a really clean way in these trees, namely the prefix-free condition is the same as leaves being the only nodes that have labels. No internal nodes are allowed to have a label in a prefix-free code. The reason for this is that we've set it up so that the encodings correspond to the bits along paths from the root to the labeled node. So being a prefix of another corresponds to one node being an ancestor of the other, and so, if all the labels are at the leaves, then of course nobody is an ancestor of the other and we have no prefixes. The other things that's really cool about this tree representation of codes is, it becomes pictorially obvious how you do the coding given this sequence of 0's and 1's from a prefix-free binary code, namely, you start at the beginning and you start at the root of your tree, whenever you see a 0 you go left, whenever you see a 1 you go right. At some point you'll hit a leaf, that leaf will have some label and that's the symbol that's being encoded, and after you hit a leaf, you just start all over again back to the root. So for example, if you were using our variable-length prefix-free code for the four letter alphabet, as in our running example, and you were given the sequence of 0s and 1s, 0, 1, 1, 0, 1, 1, 1. What would you do? Well, you'd start at the root and you see a 0, so you follow the left-child pointer, and you immediately get to a leaf. It's labeled A, so you're going to output and A as the first symbol. Now you start all over. You return to the root, now what do you see? You see a 1, so you go right, you see another one, so you go right, and now you see a 0, so you go left and that gets you to the leaf labeled C. Now you start all over again. You see a 1, you go right, you see a 1, you go right again, and then, finally you see yet one more 1 and you wind up at the lead labelled D. So by repeated traversals through the tree, you decode the sequence of 0s and 1s as a C, D. There's never any ambiguity, because when you hit a label, you know you're going to leave, there's no where to go. And every internal note, it's unlabeled, you know to expect another bit, another traversal further down in the tree. So a final important point about this correspondence is that the encoding lengths of the symbols, the number of bits needed to encode the various symbols, are just the depths of the corresponding leaves in the tree that corresponds to the code. So for example, in our running example the symbol A is the only one that needs only one bit to encode and it's also the only leaf in level one of the tree. And similarly B needs two bits and shows up in the next level, the C and the D which need three bits show up in the third level. So this correspondence is, really just by construction, so how do you encode a given symbol. Well, it's just the bits on the path from the root, and that the number of such bits is just the number of pointer traversals you need to get from the root down to that leaf, and that's just the depth of that leaf in the tree. So we're now in a great position to really have a crisp definition of the problem. The input of course is just the frequencies for a bunch different symbols i from some alphabet capital sigma. I'm going to use P sub i as notation for the frequency of symbol i. So we know what it is we want to optimize. We want to minimize the expected number of bits needed to encode a symbol, where the average of the expectation is taken over the provided frequencies of the various symbols. Well, let's express this objective function using our newfound correspondence with binary trees. In particular, the fact that we can think about encoding lengths as depths of leaves in trees. So, given a tree, T, which corresponds to a prefix-free binary code. That is it should be a binary tree and the leaves of this tree should be in one-to-one correspondence with the symbols of sigma. We're going to define capital L(T) as the average encoding length. It's an average in the sense that we sum over all of the symbols of the alphabet, we weight each symbol by the frequency, and remember, this is part of the input, so whether the provided frequency P survived that symbol i. And then how many bits do we need to encode that symbol i? Well, it's just the depth of the leaf which is labelled i in the given tree, capital T. So this is what we want to make as small as possible. So, for instance, using the data from the previous video, the letters A, B, C, D with frequencies 60, 25, 10, and 5%. Then if we use the complete binary tree, that is the fixed length encoding, we just get two bits per character. While if we use the lopsided tree optimized, so that each A only takes one bit while suffering three bits for C and D, then the average encoding length drops to 1.55, as we saw in the last video. So what then is the goal, what's the responsibility of our algorithm? Well, amongst all binary trees, which have leaves and correspondence to the symbols of sigma, we want to compute the one which makes this average encoding length as small as possible which minimizes our objective function capital L. Turns out Huffman's greedy algorithm does it. More details to come. [SOUND]