Tip:
Highlight text to annotate it
X
So, in these next few videos, we're going to continue our discussion of the
minimum cost spanning tree problem. And I'm going to focus on a second good
algorithm, good greedy algorithm for the problem,
namely Kruskal's algorithm. Now, we already have an excellent
algorithm for computing the minimum cost spanning tree in Prim's algorithm.
So you might wonder, you know, why bother spending the time learning a
second one? Well, let me give you three reasons.
So the first reason is this is just a, it's a cool algorithm.
It's definitely a candidate for the greatest hits.
It's something I think you should know. It's competitive with Prim's algorithm
both in theory and in practice. So it's another great greedy solution for
the minimum cost spanning problem. The second reason is it'll give us an
opportunity to learn a new data structure,
one we haven't discussed yet in this course.
It's called the union-find data structure.
So, in exactly the same way or in a similar way to how we used heaps to get a
super fast implementation of Prim's algorithm.
We use a unionfind data structure to get a super fast implementation of Kruskal's
algorithm. So that'll be a fun topic just in its own
right. The third reason is, is there's some very
cool connections between Kruskal's algorithm and certain types of clustering
algorithms. So that's a nice application that we'll
spend some time talking about. I'll discuss how natural greedy
algorithms in a clustering context are best understood as a variance of
Kruskal's minimum spanning tree algorithm.
So let me just briefly review some of the things I expect you to remember about the
minimum cost spanning tree problem. So the input of course is an undirected
graph, G and each edge has a cost. And what we're trying to output, the
responsibility of the algorithm is a spanning tree.
That is a subgraph which has no cycles and is connected.
There's a path between each pairs of vertices and amongst all potentially
exponential many spanning trees, the algorithm is supposed to output the one
with the smallest cost, smallest sum of edge costs.
So let me reiterate the, the standing assumptions that we've got through the
minimum spanning tree lectures. So first of all, in assuming if the graph
is connected, that's obviously necessary for it to have
any spanning trees. that said, Kruskal's algorithm actually
extends in a really, just easy, elegant way to the case where G is disconnected.
But I'm not going to talk about that here.
secondly, remember, we're going to assume that all of the edge costs are distincts,
that is there are no ties. Now don't worry, Kruskal's algorithm is
just as correct. If there are ties amongst the edge costs.
I'm just not going to give you a proof that covers that case, but don't worry, a
proof does, indeed exist. Finally, of the various machinery that we
developed to prove the correctness of Prim's algorithm perhaps the most
important and most subtle point was what's called the cut property.
So this is a condition which guarantees you're not making a mistake in a greedy
algorithm. It guarantees that a given edge is indeed
in the minimum spanning tree. And remember, the cut property says,
if you have an edge of a graph and you could find just a single cut for which
this edge is the cheapest one crossing it.
Okay? So, the edge E crosses is cut and every
other edge that crosses it is more expensive.
That certifies the presence of this edge in the minimum spanning tree,
guarantees that it's a safe edge to include.
So we'll definitely be using that again in Kruskal's algorithm when we prove it's
correct. So as with Prim's algorithm, before I hit
you with the pseudocode, let me just show you how the algorithm works in an
example. I think you'll find it very natural.
So let's look at the following graph with five vertices.
So the graph has seven edges and I've annotated them with their edge costs in
blue. So here's the big difference in
philosophy between Prim's algorithm and Kruskal's algorithm.
In Prim's algorithm, you insisted on growing like a mold from a starting
point, always maintaining connectivity, and spanning one new vertex in each
iteration. Kruskal's going to just throw out the
desire to have a connected subgraph at each step of the iteration.
Kruskal's algorithm will be totally content to grow a tree in parallel with
lots of simultaneous little pieces, only having them coalesce at the very end of
the algorithm. So in Prim's algorithm, while we were
only allowed to pick the cheapest edge subject to this constraint of spanning
some new vertex. In Kruskal's algorithm we're just going to pick the cheapest
edge that we haven't looked at yet. Now, there is an issue, of course,
we want to construct a spanning tree at the end.
So, we certainly don't want to create any cycles,
so we'll skip over edges that will create cycles.
But other than that constraint, we'll just look at the cheapest edge next in
Kruskal's algorithm and pick it if there is no cycles.
So let's look at this five vertex example.
Again, there is no starting point. We're just going to look at the cheapest
edge overall. So that's obviously this unit cost edge
and we're going to include that in our tree.
Right? Why not? Why not pick the cheapest edge? It's a
greedy algorithm. So what do we do next?
Well, now we have this edge of cost two, that looks good, so let's go ahead and
pick that one. Cool.
Notice these two edges are totally disjoint.
Kay.' So we are not maintaining a connectivity
of our subgraph at each iteration of Kruskal's algorithm.
Now, it just so happens that when we look at the next edge, the edge of cost 3, we
will fuse together the two disjoint pieces that we had previously. Now, we
happen to have one connected piece. Now, here's where it gets interesting.
When we look at the next edge, the edge of cost 4, we notice that we're not
allowed to pick the edge of cost 4. Why?
Well, that would create a triangle with the edges of costs 2 and 3,
and that of course is a no-no. We want to span the tree at the end of
the day, so we can't have any cycles. So we skip over the 4 because we have no
choice, we can't pick it, we move on to the 5 and the 5 is fine.
So when we pick the edge of cost 5, there's no cycles,
we go ahead and include it. And now we have a spanning tree and we
stop or if you prefer, you could think of it that we do, we do consider the edge of
cost 6. That would create a triangle with the
edges of costs 3 and 5, so we skip the 6. And then, for completeness, we think
about considering the 7, but that would form a triangle with the
edges of costs 1 and 5, so we skip that. So after this single scan through the
edges in assorted order, we find ourselves with these four pink edges.
In this case, it's a spanning tree and as we'll see, not just in this graph but in
every graph it's actually the minimum cost spanning tree.
So, with the intuition hopefully solidly in place, I don't think the following
pseudocode will surprise you. We want to get away with a single scan
through the edges in short order. So, obviously in the preprocessing step,
we want to take the unsorted array of edges and sort them by edge cost.
To keep the notation and the pseudocode simple, let me just, for the purposes of
the algorithm, description only, rename the edges 1, 2, 3, 4, all the way up to m
conforming to this sorted order, right? So, the algorithm's just going to scan
through the edges in this newly found sorted order.
So we're going to call the tree in progress capital T, like we did in Prim's
algorithm. And now, we're just going to zip through
the edges once in sorted order. And we take an edge, unless it's
obviously a bad idea. And here a bad idea means it creates a
cycle, that's a no-no, but as long as there's no cycle, we go
ahead and include the edge. And that's it, after you finish up the
for loop you just return the tree capital T.
It's easy to imagine various optimizations that you could do.
So for example, once you've added enough edges to have a spanning tree. So n - 1
edges, where n is the number of vertices, you can get ahead, go ahead and abort
this for loop and terminate the algorithm.
But let's just keep things simple and analyze this three line version of
Kruskal's algorithm in this lecture. So just like in our discussion of Primm's
algorithm, what I want to do is first just focus on why is Kruskal's algorithm
correct? Why does it output a spanning tree at all? And then, the spanning tree
that it outputs? Why on earth should it have the minimum cost amongst all
spanning trees? That's when we're, once we're convinced
of the correctness, we'll move on to a naive running time implementation and
then finally, a fast implementation using suitable data structures.