Tip:
Highlight text to annotate it
X
[Walkthrough - Problem Set 6]
[Zamyla Chan - Harvard University]
[This is CS50. - CS50.TV]
Hello, everyone, and welcome to Walkthrough 6: Huff'n Puff.
In Huff'n Puff what we are doing is going to be dealing with a Huffman compressed file
and then puffing it back up, so decompressing it,
so that we can translate from the 0s and 1s that the user sends us
and convert it back into the original text.
Pset 6 is going to be pretty cool because you're going to see some of the tools
that you used in pset 4 and pset 5 and kind of combining them into 1 pretty neat concept
when you come to think about it.
Also, arguably, pset 4 and 5 were the most challenging psets that we had to offer.
So from now, we have this 1 more pset in C,
and then after that we're on to web programming.
So congratulate yourselves for getting over the toughest hump in CS50.
Moving on for Huff'n Puff, our toolbox for this pset are going to be Huffman trees,
so understanding not only how binary trees work but also specifically Huffman trees,
how they're constructed.
And then we're going to have a lot of distribution code in this pset,
and we'll come to see that actually some of the code
we might not be able to fully understand yet,
and so those will be the .c files, but then their accompanying .h files
will give us enough understanding that we need so that we know how those functions work
or at least what they are supposed to do--their inputs and outputs--
even if we don't know what's happening in the black box
or don't understand what's happening in the black box within.
And then finally, as usual, we are dealing with new data structures,
specific types of nodes that point to certain things,
and so here having a pen and paper not only for the design process
and when you're trying to figure out how your pset should work
but also during debugging.
You can have GDB alongside your pen and paper while you take down what the values are,
where your arrows are pointing, and things like that.
First let's look at Huffman trees.
Huffman trees are binary trees, meaning that each node only has 2 children.
In Huffman trees the characteristic is that the most frequent values
are represented by the fewest bits.
We saw in lecture examples of Morse code, which kind of consolidated some letters.
If you're trying to translate an A or an E, for example,
you're translating that often, so instead of having to use the full set of bits
allocated for that usual data type, you compress it down to fewer,
and then those letters who are represented less often are represented with longer bits
because you can afford that when you weigh out the frequencies that those letters appear.
We have the same idea here in Huffman trees
where we are making a chain, a kind of path to get to the certain characters.
And then the characters who have the most frequency
are going to be represented with the fewest bits.
The way that you construct a Huffman tree
is by placing all of the characters that appear in the text
and calculating their frequency, how often they appear.
This could either be a count of how many times those letters appear
or perhaps a percentage of out of all the characters how many each one appears.
And so what you do is once you have all of that mapped out,
then you look for the 2 lowest frequencies and then join them as siblings
where then the parent node has a frequency which is the sum of its 2 children.
And then you by convention say that the left node,
you follow that by following the 0 branch,
and then the rightmost node is the 1 branch.
As we saw in Morse code, the one gotcha was that if you had just a beep and the beep
it was ambiguous.
It could either be 1 letter or it could be a sequence of 2 letters.
And so what Huffman trees does is because by nature of the characters
or our final actual characters being the last nodes on the branch--
we refer to those as leaves--by virtue of that there can't be any ambiguity
in terms of which letter you're trying to encode with the series of bits
because nowhere along the bits that represent 1 letter
will you encounter another whole letter, and there won't be any confusion there.
But we'll go into examples that you guys can actually see that
instead of us just telling you that that's true.
Let's look at a simple example of a Huffman tree.
I have a string here that is 12 characters long.
I have 4 As, 6 Bs, and 2 Cs.
My first step would be to count.
How many times does A appear? It appears 4 times in the string.
B appears 6 times, and C appears 2 times.
Naturally, I'm going to say I'm using B most often,
so I want to represent B with the fewest number of bits, the fewest number of 0s and 1s.
And then I'm also going to expect C to require the most amount of 0s and 1s as well.
First what I did here is I placed them in ascending order in terms of frequency.
We see that the C and the A, those are our 2 lowest frequencies.
We create a parent node, and that parent node doesn't have a letter associated with it,
but it does have a frequency, which is the sum.
The sum becomes 2 + 4, which is 6.
Then we follow the left branch.
If we were at that 6 node, then we would follow 0 to get to C
and then 1 to get to A.
So now we have 2 nodes.
We have the value 6 and then we also have another node with the value 6.
And so those 2 are not only the 2 lowest but also just the 2 that are left,
so we join those by another parent, with the sum being 12.
So here we have our Huffman tree
where to get to B, that would just be the bit 1
and then to get to A we would have 01 and then C having 00.
So here we see that basically we're representing these chars with either 1 or 2 bits
where the B, as predicted, has the least.
And then we had expected C to have the most, but since it's such a small Huffman tree,
then the A is also represented by 2 bits as opposed to somewhere in the middle.
Just to go over another simple example of the Huffman tree,
say you have the string "Hello."
What you do is first you would say how many times does H appear in this?
H appears once and then e appears once and then we have l appearing twice
and o appearing once.
And so then we expect which letter to be represented by the least number of bits?
[student] l. >>l. Yeah. l is right.
We expect l to be represented by the least number of bits
because l is used most in the string "Hello."
What I'm going to do now is draw out these nodes.
I have 1, which is H, and then another 1, which is e, and then a 1, which is o--
right now I'm putting them in order--and then 2, which is l.
Then I say the way that I build a Huffman tree is to find the 2 nodes with the least frequencies
and make them siblings by creating a parent node.
Here we have 3 nodes with the lowest frequency. They're all 1.
So here we choose which one we're going to link first.
Let's say I choose the H and the e.
The sum of 1 + 1 is 2, but this node doesn't have a letter associated with it.
It just holds the value.
Now we look at the next 2 lowest frequencies.
That's 2 and 1. That could be either of those 2, but I'm going to choose this one.
The sum is 3.
And then finally, I only have 2 left, so then that becomes 5.
Then here, as expected, if I fill in the encoding for that,
1s are always the right branch and 0s are the left one.
Then we have l represented by just 1 bit and then the o by 2
and then the e by 2 and then the H falls down to 3 bits.
So you can transmit this message "Hello" instead of actually using the characters
by just 0s and 1s.
However, remember that in several cases we had ties with our frequency.
We could have either joined the H and the o first maybe.
Or then later on when we had the l represented by 2
as well as the joined one represented by 2, we could have linked either one.
And so when you send the 0s and 1s, that actually doesn't guarantee
that the recipient can fully read your message right off the bat
because they might not know which decision you made.
So when we're dealing with Huffman compression,
somehow we have to tell the recipient of our message how we decided--
They need to know some kind of extra information
in addition to the compressed message.
They need to understand what the tree actually looks like,
how we actually made those decisions.
Here we were just doing examples based on the actual count,
but sometimes you can also have a Huffman tree
based on the frequency at which letters appear, and it's the exact same process.
Here I'm expressing it in terms of percentages or a fraction,
and so here the exact same thing.
I find the 2 lowest, sum them, the next 2 lowest, sum them,
until I have a full tree.
Even though we could do it either way, when we're dealing with percentages,
that means we're dividing things and dealing with decimals or rather floats
if we're thinking about data structures of a head.
What do we know about floats?
What's a common problem when we're dealing with floats?
[student] Imprecise arithmetic. >>Yeah. Imprecision.
Because of floating point imprecision, for this pset so that we make sure
that we don't lose any values, then we're actually going to be dealing with the count.
So if you were to think of a Huffman node, if you look back to the structure here,
if you look at the green ones it has a frequency associated with it
as well as it points to a node to its left as well as a node to its right.
And then the red ones there also have a character associated with them.
We're not going to make separate ones for the parents and then the final nodes,
which we refer to as leaves, but rather those will just have NULL values.
For every node we'll have a character, the symbol that that node represents,
then a frequency as well as a pointer to its left child as well as its right child.
The leaves, which are at the very bottom, would also have node pointers
to their left and to their right, but since those values aren't pointing to actual nodes,
what would their value be? >>[student] NULL. >>NULL. Exactly.
Here's an example of how you might represent the frequency in floats,
but we're going to be dealing with it with integers,
so all I did is change the data type there.
Let's go on to a little bit more of a complex example.
But now that we've done the simple ones, it's just the same process.
You find the 2 lowest frequencies, sum the frequencies
and that's the new frequency of your parent node,
which then points to its left with the 0 branch and the right with the 1 branch.
If we have the string "This is cs50," then we count how many times is T mentioned,
h mentioned, i, s, c, 5, 0.
Then what I did here is with the red nodes I just planted,
I said I'm going to have these characters eventually at the bottom of my tree.
Those are going to be all of the leaves.
Then what I did is I sorted them by frequency in ascending order,
and this is actually the way that the pset code does it
is it sorts it by frequency and then alphabetically.
So it has the numbers first and then alphabetically by the frequency.
Then what I would do is I would find the 2 lowest. That's 0 and 5.
I would sum them, and that's 2. Then I would continue, find the next 2 lowest.
Those are the two 1s, and then those become 2 as well.
Now I know that my next step is going to be joining the lowest number,
which is the T, the 1, and then choosing one of the nodes that has 2 as the frequency.
So here we have 3 options.
What I'm going to do for the slide is just visually rearrange them for you
so that you can see how I'm building it up.
What the code and your distribution code is going to do would be join the T one
with the 0 and 5 node.
So then that sums to 3, and then we continue the process.
The 2 and the 2 now are the lowest, so then those sum to 4.
Everyone following so far? Okay.
Then after that we have the 3 and the 3 that need to be added up,
so again I'm just switching it so that you can see visually so that it doesn't get too messy.
Then we have a 6, and then our final step is now that we only have 2 nodes
we sum those to make the root of our tree, which is 10.
And the number 10 makes sense because each node represented,
their value, their frequency number, was how many times they appeared in the string,
and then we have 5 characters in our string, so that makes sense.
If we look up at how we would actually encode it,
as expected, the i and the s, which appear the most often
are represented by the fewest number of bits.
Be careful here.
In Huffman trees the case actually matters.
An uppercase S is different than a lowercase s.
If we had "This is CS50" with capital letters, then the lowercase s would only appear twice,
would be a node with 2 as its value, and then uppercase S would only be once.
So then your tree would change structures because you actually have an extra leaf here.
But the sum would still be 10.
That's what we're actually going to be calling the checksum,
the addition of all of the counts.
Now that we've covered Huffman trees, we can dive into Huff'n Puff, the pset.
We're going to start with a section of questions,
and this is going to get you accustomed with binary trees and how to operate around that:
drawing nodes, creating your own typedef struct for a node,
and seeing how you might insert into a binary tree, one that's sorted,
traversing it, and things like that.
That knowledge is definitely going to help you when you dive into the Huff'n Puff portion
of the pset.
In the standard edition of the pset, your task is to implement Puff,
and in the hacker version your task is to implement Huff.
What Huff does is it takes text and then it translates it into the 0s and 1s,
so the process that we did above where we counted the frequencies
and then made the tree and then said, "How do I get T?"
T is represented by 100, things like that,
and then Huff would take the text and then output that binary.
But also because we know that we want to allow our recipient of the message
to recreate the exact same tree, it also includes information about the frequency counts.
Then with Puff we are given a binary file of 0s and 1s
and given also the information about the frequencies.
We translate all of those 0s and 1s back into the original message that was,
so we're decompressing that.
If you're doing the standard edition, you don't need to implement Huff,
so then you can just use the staff implementation of Huff.
There are instructions in the spec on how to do that.
You can run the staff implementation of Huff upon a certain text file
and then use that output as your input to Puff.
As I mentioned before, we have a lot of distribution code for this one.
I'm going to start going through it.
I'm going to spend most of the time on the .h files
because in the .c files, because we have the .h
and that provides us with the prototypes of the functions,
we don't fully need to understand exactly--
If you don't understand what's going on in the .c files, then don't worry too much,
but definitely try to take a look because it might give some hints
and it's useful to get used to reading other people's code.
Looking at huffile.h, in the comments it declares a layer of abstraction for Huffman-coded files.
If we go down, we see that there is a maximum of 256 symbols that we might need codes for.
This includes all the letters of the alphabet--uppercase and lowercase--
and then symbols and numbers, etc.
Then here we have a magic number identifying a Huffman-coded file.
Within a Huffman code they're going to have a certain magic number
associated with the header.
This might look like just a random magic number,
but if you actually translate it into ASCII, then it actually spells out HUFF.
Here we have a struct for a Huffman-encoded file.
There's all of these characteristics associated with a Huff file.
Then down here we have the header for a Huff file, so we call it Huffeader
instead of adding the extra h because it sounds the same anyway.
Cute.
We have a magic number associated with it.
If it's an actual Huff file, it's going to be the number up above, this magic one.
And then it will have an array.
So for each symbol, of which there are 256,
it's going to list what the frequency of those symbols are within the Huff file.
And then finally, we have a checksum for the frequencies,
which should be the sum of those frequencies.
So that's what a Huffeader is.
Then we have some functions that return the next bit in the Huff file
as well as writes a bit to the Huff file, and then this function here, hfclose,
that actually closes the Huff file.
Before, we were dealing with straight just fclose,
but when you have a Huff file, instead of fclosing it
what you're actually going to do is hfclose and hfopen it.
Those are specific functions to the Huff files that we're going to be dealing with.
Then here we read in the header and then write the header.
Just by reading the .h file we can kind of get a sense of what a Huff file might be,
what characteristics it has, without actually going into the huffile.c,
which, if we dive in, is going to be a bit more complex.
It has all of the file I/O here dealing with pointers.
Here we see that when we call hfread, for instance, it's still dealing with fread.
We're not getting rid of those functions entirely, but we're sending those to be taken care of
inside the Huff file instead of doing all of it ourselves.
You can feel free to scan through this if you're curious
and go and peel the layer back a little bit.
The next file that we're going to look at is tree.h.
Before in the Walkthrough slides we said we expect a Huffman node
and we made a typedef struct node.
We expect it to have a symbol, a frequency, and then 2 node stars.
In this case what we're doing is this is essentially the same
except instead of node we're going to call them trees.
We have a function that when you call make tree it returns you a tree pointer.
Back to Speller, when you were making a new node
you said node* new word = malloc(sizeof) and things like that.
Basically, mktree is going to be dealing with that for you.
Similarly, when you want to remove a tree,
so that's essentially freeing the tree when you're done with it,
instead of explicitly calling free on that, you're actually just going to use the function rmtree
where you pass in the pointer to that tree and then tree.c will take care of that for you.
We look into tree.c.
We expect the same functions except to see the implementation as well.
As we expected, when you call mktree it mallocs the size of a tree into a pointer,
initializes all of the values to the NULL value, so 0s or NULLs,
and then returns the pointer to that tree that you've just malloc'd to you.
Here when you call remove tree it first makes sure that you're not double freeing.
It makes sure that you actually have a tree that you want to remove.
Here because a tree also includes its children,
what this does is it recursively calls remove tree on the left node of the tree
as well as the right node.
Before it frees the parent, it needs to free the children as well.
Parent is also interchangeable with root.
The first ever parent, so like the great-great-great-great-grandfather
or grandmother tree, first we have to free down the levels first.
So traverse to the bottom, free those, and then come back up, free those, etc.
So that's tree.
Now we look at forest.
Forest is where you place all of your Huffman trees.
It's saying that we're going to have something called a plot
that contains a pointer to a tree as well as a pointer to a plot called next.
What structure does this kind of look like?
It kind of says it over there.
Right over here.
A linked list.
We see that when we have a plot it's like a linked list of plots.
A forest is defined as a linked list of plots,
and so the structure of forest is we're just going to have a pointer to our first plot
and that plot has a tree within it or rather points to a tree
and then points to the next plot, so on and so forth.
To make a forest we call mkforest.
Then we have some pretty useful functions here.
We have pick where you pass in a forest and then the return value is a Tree*,
a pointer to a tree.
What pick will do is it will go into the forest that you're pointing to
then remove a tree with the lowest frequency from that forest
and then give you the pointer to that tree.
Once you call pick, the tree won't exist in the forest anymore,
but the return value is the pointer to that tree.
Then you have plant.
Provided that you pass in a pointer to a tree that has a non-0 frequency,
what plant will do is it will take the forest, take the tree, and plant that tree inside of the forest.
Here we have rmforest.
Similar to remove tree, which basically freed all of our trees for us,
remove forest will free everything contained in that forest.
If we look into forest.c, we'll expect to see at least 1 rmtree command in there,
because to free memory in the forest if a forest has trees in it,
then eventually you're going to have to remove those trees too.
If we look into forest.c, we have our mkforest, which is as we expect.
We malloc things.
We initialize the first plot in the forest as NULL because it's empty to begin with,
then we see pick, which returns the tree with the lowest weight, the lowest frequency,
and then gets rid of that particular node that points to that tree and the next one,
so it takes that out of the linked list of the forest.
And then here we have plant, which inserts a tree into the linked list.
What forest does is it nicely keeps it sorted for us.
And then finally, we have rmforest and, as expected, we have rmtree called there.
Looking at the distribution code so far, huffile.c was probably by far the hardest to understand,
whereas the other files themselves were pretty simple to follow.
With our knowledge of pointers and linked lists and such,
we were able to follow pretty well.
But all we need to really make sure that we fully understand is the .h files
because you need to be calling those functions, dealing with those return values,
so make sure that you fully understand what action is going to be performed
whenever you call one of those functions.
But actually understanding inside of it isn't quite necessary because we have those .h files.
We have 2 more files left in our distribution code.
Let's look at dump.
Dump by its comment here takes a Huffman-compressed file
and then translates and dumps all of its content out.
Here we see that it's calling hfopen.
This is kind of mirroring to file* input = fopen,
and then you pass in the information.
It's almost identical except instead of a file* you're passing in a Huffile;
instead of fopen you're passing in hfopen.
Here we read in the header first, which is kind of similar to how we read in the header
for a bitmap file.
What we're doing here is checking to see whether the header information
contains the right magic number that indicates that it's an actual Huff file,
then all of these checks to make sure that the file that we open is an actual huffed file or not.
What this does is it outputs the frequencies of all of the symbols that we can see
within a terminal into a graphical table.
This part is going to be useful.
It has a bit and reads bit by bit into the variable bit and then prints it out.
So if I were to call dump on hth.bin, which is the result of huffing a file
using the staff solution, I would get this.
It's outputting all of these characters and then putting the frequency at which they appear.
If we look, most of them are 0s except for this: H, which appears twice,
and then T, which appears once.
And then here we have the actual message in 0s and 1s.
If we look at hth.txt, which is presumably the original message that was huffed,
we expect to see some Hs and Ts in there.
Specifically, we expect to see just 1 T and 2 Hs.
Here we are in hth.txt. It indeed has HTH.
Included in there, although we can't see it, is a newline character.
The Huff file hth.bin is also encoding the newline character as well.
Here because we know that the order is HTH and then newline,
we can see that probably the H is represented by just a single 1
and then the T is probably 01 and then the next H is 1 as well
and then we have a newline indicated by two 0s.
Cool.
And then finally, because we're dealing with multiple .c and .h files,
we're going to have a pretty complex argument to the compiler,
and so here we have a Makefile that makes dump for you.
But actually, you have to go about making your own puff.c file.
The Makefile actually doesn't deal with making puff.c for you.
We're leaving that up to you to edit the Makefile.
When you enter a command like make all, for instance, it will make all of them for you.
Feel free to look at the examples of Makefile from the past pset
as well as going off of this one to see how you might be able to make your Puff file
by editing this Makefile.
That's about it for our distribution code.
Once we've gotten through that, then here's just another reminder
of how we're going to be dealing with the Huffman nodes.
We're not going to be calling them nodes anymore; we're going to be calling them trees
where we're going to be representing their symbol with a char,
their frequency, the number of occurrences, with an integer.
We're using that because it's more precise than a float.
And then we have another pointer to the left child as well as the right child.
A forest, as we saw, is just a linked list of trees.
Ultimately, when we're building up our Huff file,
we want our forest to contain just 1 tree--
1 tree, 1 root with multiple children.
Earlier on when we were just making our Huffman trees,
we started out by placing all of the nodes onto our screen
and saying we're going to have these nodes,
eventually they're going to be the leaves, and this is their symbol, this is their frequency.
In our forest if we just have 3 letters, that's a forest of 3 trees.
And then as we go on, when we added the first parent,
we made a forest of 2 trees.
We removed 2 of those children from our forest and then replaced it with a parent node
that had those 2 nodes as children.
And then finally, our last step with making our example with the As, Bs, and Cs
would be to make the final parent,
and so then that would bring our total count of trees in the forest to 1.
Does everyone see how you start out with multiple trees in your forest
and end up with 1? Okay. Cool.
What do we need to do for Puff?
What we need to do is ensure that, as always, they give us the right type of input
so that we can actually run the program.
In this case they're going to be giving us after their first command-line argument
2 more: the file that we want to decompress and the output of the decompressed file.
But once we make sure that they pass us in the right amount of values,
we want to ensure that the input is a Huff file or not.
And then once we guarantee that it's a Huff file, then we want to build our tree,
build up the tree such that it matches the tree that the person who sent the message built.
Then after we build the tree, then we can deal with the 0s and 1s that they passed in,
follow those along our tree because it's identical,
and then write that message out, interpret the bits back into chars.
And then at the end because we're dealing with pointers here,
we want to make sure that we don't have any memory leaks
and that we free everything.
Ensuring proper usage is old hat for us by now.
We take in an input, which is going to be the name of the file to puff,
and then we specify an output,
so the name of the file for the puffed output, which will be the text file.
That's usage. And now we want to ensure that the input is huffed or not.
Thinking back, was there anything in the distribution code that might help us
with understanding whether a file is huffed or not?
There was information in huffile.c about the Huffeader.
We know that every Huff file has a Huffeader associated with it with a magic number
as well as an array of the frequencies for each symbol
as well as a checksum.
We know that, but we also took a peek at dump.c,
in which it was reading into a Huff file.
And so to do that, it had to check whether it really was huffed or not.
So perhaps we could use dump.c as a structure for our puff.c.
Back to pset 4 when we had the file copy.c that copied in RGB triples
and we interpreted that for Whodunit and Resize,
similarly, what you could do is just run the command like cp dump.c puff.c
and use some of the code there.
However, it's not going to be as straightforward of a process
for translating your dump.c into puff.c,
but at least it gives you somewhere to start
on how to ensure that the input is actually huffed or not
as well as a few other things.
We have ensured proper usage and ensured that the input is huffed.
Every time that we've done that we have done our proper error checking,
so returning and quitting the function if some failure occurs, if there's a problem.
Now what we want to do is build the actual tree.
If we look in Forest, there are 2 main functions
that we're going to want to become very familiar with.
There's the Boolean function plant that plants a non-0 frequency tree inside our forest.
And so there you pass in a pointer to a forest and a pointer to a tree.
Quick question: How many forests will you have when you're building a Huffman tree?
Our forest is like our canvas, right?
So we're only going to have 1 forest, but we're going to have multiple trees.
So before you call plant, you're presumably going to want to make your forest.
There is a command for that if you look into forest.h on how you can make a forest.
You can plant a tree. We know how to do that.
And then you can also pick a tree from the forest,
removing a tree with the lowest weight and giving you the pointer to that.
Thinking back to when we were doing the examples ourselves,
when we were drawing it out, we simply just added the links.
But here instead of just adding the links,
think of it more as you're removing 2 of those nodes and then replacing it by another one.
To express that in terms of picking and planting,
you're picking 2 trees and then planting another tree
that has those 2 trees that you picked as children.
To build Huffman's tree, you can read in the symbols and frequencies in order
because the Huffeader gives that to you,
gives you an array of the frequencies.
So you can go ahead and just ignore anything with the 0 in it
because we don't want 256 leaves at the end of it.
We only want the number of leaves that are characters
that are actually used in the file.
You can read in those symbols, and each of those symbols that have non-0 frequencies,
those are going to be trees.
What you can do is every time you read in a non-0 frequency symbol,
you can plant that tree in the forest.
Once you plant the trees in the forest, you can join those trees as siblings,
so going back to planting and picking where you pick 2 and then plant 1,
where that 1 that you plant is the parent of the 2 children that you picked.
So then your end result is going to be a single tree in your forest.
That's how you build your tree.
There are several things that could go wrong here
because we're dealing with making new trees and dealing with pointers and things like that.
Before when we were dealing with pointers,
whenever we malloc'd we wanted to make sure that it didn't return us a NULL pointer value.
So at several steps within this process there are going to be several cases
where your program could fail.
What you want to do is you want to make sure that you handle those errors,
and in the spec it says to handle them gracefully,
so like print out a message to the user telling them why the program has to quit
and then promptly quit it.
To do this error handling, remember that you want to check it
every single time that there could be a failure.
Every single time that you're making a new pointer
you want to make sure that that's successful.
Before what we used to do is make a new pointer and malloc it,
and then we would check whether that pointer is NULL.
So there are going to be some instances where you can just do that,
but sometimes you're actually calling a function
and within that function, that's the one that's doing the mallocing.
In that case, if we look back to some of the functions within the code,
some of them are Boolean functions.
In the abstract case if we have a Boolean function called foo,
basically, we can assume that in addition to doing whatever foo does,
since it's a Boolean function, it returns true or false--
true if successful, false if not.
So we want to check whether the return value of foo is true or false.
If it's false, that means that we're going to want to print some kind of message
and then quit the program.
What we want to do is check the return value of foo.
If foo returns false, then we know that we encountered some kind of error
and we need to quit our program.
A way to do this is have a condition where the actual function itself is your condition.
Say foo takes in x.
We can have as a condition if (foo(x)).
Basically, that means if at the end of executing foo it returns true,
then we can do this because the function has to evaluate foo
in order to evaluate the whole condition.
So then that's how you can do something if the function returns true and is successful.
But when you're error checking, you only want to quit if your function returns false.
What you could do is just add an == false or just add a *** in front of it
and then you have if (!foo).
Within that body of that condition you would have all of the error handling,
so like, "Could not create this tree" and then return 1 or something like that.
What that does, though, is that even though foo returned false--
Say foo returns true.
Then you don't have to call foo again. That's a common misconception.
Because it was in your condition, it's already evaluated,
so you already have the result if you're using make tree or something like that
or plant or pick or something.
It already has that value. It's already executed.
So it's useful to use Boolean functions as the condition
because whether or not you actually execute the body of the loop,
it executes the function anyway.
Our second to last step is writing the message to the file.
Once we build the Huffman tree, then writing the message to the file is pretty straightforward.
It's pretty straightforward now to just follow the 0s and 1s.
And so by convention we know that in a Huffman tree the 0s indicate left
and the 1s indicate right.
So then if you read in bit by bit, every time that you get a 0
you'll follow the left branch, and then every time you read in a 1
you're going to follow the right branch.
And then you're going to continue until you hit a leaf
because the leaves are going to be at the end of the branches.
How can we tell whether we've hit a leaf or not?
We said it before.
[student] If the pointers are NULL. >>Yeah.
We can tell if we've hit a leaf if the pointers to both the left and right trees are NULL.
Perfect.
We know that we want to read in bit by bit into our Huff file.
As we saw before in dump.c, what they did is they read in bit by bit into the Huff file
and just printed out what those bits were.
We're not going to be doing that. We're going to be doing something that's a bit more complex.
But what we can do is we can take that bit of code that reads in to the bit.
Here we have the integer bit representing the current bit that we're on.
This takes care of iterating all of the bits in the file until you hit the end of the file.
Based on that, then you're going to want to have some kind of iterator
to traverse your tree.
And then based on whether the bit is 0 or 1,
you're going to want to either move that iterator to the left or move it to the right
all the way until you hit a leaf, so all the way until that node that you're on
doesn't point to any more nodes.
Why can we do this with a Huffman file but not Morse code?
Because in Morse code there's a bit of ambiguity.
We could be like, oh wait, we've hit a letter along the way, so maybe this is our letter,
whereas if we continued just a bit longer, then we would have hit another letter.
But that's not going to happen in Huffman encoding,
so we can rest assured that the only way that we're going to hit a character
is if that node's left and right children are NULL.
Finally, we want to free all of our memory.
We want to both close the Huff file that we've been dealing with
as well as remove all of the trees in our forest.
Based on your implementation, you're probably going to want to call remove forest
instead of actually going through all of the trees yourself.
But if you made any temporary trees, you'll want to free that.
You know your code best, so you know where you're allocating memory.
And so if you go in, start by even Control F'ing for malloc,
seeing whenever you malloc and making sure that you free all of that
but then just going through your code,
understanding where you might have allocated memory.
Usually you might just say, "At the end of a file I'm just going to remove forest on my forest,"
so basically clear that memory, free that,
"and then I'm also going to close the file and then my program is going to quit."
But is that the only time that your program quits?
No, because sometimes there might have been an error that happened.
Maybe we couldn't open a file or we couldn't make another tree
or some kind of error happened in the memory allocation process and so it returned NULL.
An error happened and then we returned and quit.
So then you want to make sure that any possible time that your program can quit,
you want to free all of your memory there.
It's not just going to be at the very end of the main function that you quit your code.
You want to look back to every instance that your code potentially might return prematurely
and then free whatever memory makes sense.
Say you had called make forest and that returned false.
Then you probably won't need to remove your forest
because you don't have a forest yet.
But at every point in the code where you might return prematurely
you want to make sure that you free any possible memory.
So when we're dealing with freeing memory and having potential leaks,
we want to not only use our judgment and our logic
but also use Valgrind to determine whether we've freed all of our memory properly or not.
You can either run Valgrind on Puff and then you have to also pass it
the right number of command-line arguments to Valgrind.
You can run that, but the output is a bit cryptic.
We've gotten a bit used to it with Speller, but we still need a bit more help,
so then running it with a few more flags like the leak-check=full,
that will probably give us some more helpful output on Valgrind.
Then another useful tip when you're debugging is the diff command.
You can access the staff's implementation of Huff, run that on a text file,
and then output it to a binary file, a binary Huff file, to be specific.
Then if you run your own puff on that binary file,
then ideally, your outputted text file is going to be identical
to the original one that you passed in.
Here I'm using hth.txt as the example, and that's the one talked about in your spec.
That's literally just HTH and then a newline.
But definitely feel free and you are definitely encouraged to use longer examples
for your text file.
You can even take a shot at maybe compressing and then decompressing
some of the files that you used in Speller like War and Peace
or Jane Austen or something like that--that would be kind of cool--or Austin Powers,
kind of dealing with larger files because we wouldn't come down to it
if we used the next tool here, ls -l.
We're used to ls, which basically lists all the contents in our current directory.
Passing in the flag -l actually displays the size of those files.
If you go through the pset spec, it actually walks you through creating the binary file,
of huffing it, and you see that for very small files
the space cost of compressing it and translating all of that information
of all the frequencies and things like that outweighs the actual benefit
of compressing the file in the first place.
But if you run it on some longer text files, then you might see that you start to get some benefit
in compressing those files.
And then finally, we have our old pal GDB, which is definitely going to come in handy too.
Do we have any questions on Huff trees or the process perhaps of making the trees
or any other questions on Huff'n Puff?
Okay. I'll stay around for a bit.
Thanks, everyone. This was Walkthrough 6. And good luck.
[CS50.TV]