5.1 - Algorithm - Kruskal's minimum spanning tree algorithm - [dsa 2] by tim roughgarden

So, in these next few videos, we're going to continue our discussion of the minimum cost spanning tree problem. And I'm going to focus on a second good algorithm, good greedy algorithm for the problem, namely Kruskal's algorithm. Now, we already have an excellent algorithm for computing the minimum cost spanning tree in Prim's algorithm. So you might wonder, you know, why bother spending the time learning a second one? Well, let me give you three reasons. So the first reason is this is just a, it's a cool algorithm. It's definitely a candidate for the greatest hits. It's something I think you should know. It's competitive with Prim's algorithm both in theory and in practice. So it's another great greedy solution for the minimum cost spanning problem. The second reason is it'll give us an opportunity to learn a new data structure, one we haven't discussed yet in this course. It's called the union-find data structure. So, in exactly the same way or in a similar way to how we used heaps to get a super fast implementation of Prim's algorithm. We use a unionfind data structure to get a super fast implementation of Kruskal's algorithm. So that'll be a fun topic just in its own right. The third reason is, is there's some very cool connections between Kruskal's algorithm and certain types of clustering algorithms. So that's a nice application that we'll spend some time talking about. I'll discuss how natural greedy algorithms in a clustering context are best understood as a variance of Kruskal's minimum spanning tree algorithm. So let me just briefly review some of the things I expect you to remember about the minimum cost spanning tree problem. So the input of course is an undirected graph, G and each edge has a cost. And what we're trying to output, the responsibility of the algorithm is a spanning tree. That is a subgraph which has no cycles and is connected. There's a path between each pairs of vertices and amongst all potentially exponential many spanning trees, the algorithm is supposed to output the one with the smallest cost, smallest sum of edge costs. So let me reiterate the, the standing assumptions that we've got through the minimum spanning tree lectures. So first of all, in assuming if the graph is connected, that's obviously necessary for it to have any spanning trees. that said, Kruskal's algorithm actually extends in a really, just easy, elegant way to the case where G is disconnected. But I'm not going to talk about that here. secondly, remember, we're going to assume that all of the edge costs are distincts, that is there are no ties. Now don't worry, Kruskal's algorithm is just as correct. If there are ties amongst the edge costs. I'm just not going to give you a proof that covers that case, but don't worry, a proof does, indeed exist. Finally, of the various machinery that we developed to prove the correctness of Prim's algorithm perhaps the most important and most subtle point was what's called the cut property. So this is a condition which guarantees you're not making a mistake in a greedy algorithm. It guarantees that a given edge is indeed in the minimum spanning tree. And remember, the cut property says, if you have an edge of a graph and you could find just a single cut for which this edge is the cheapest one crossing it. Okay? So, the edge E crosses is cut and every other edge that crosses it is more expensive. That certifies the presence of this edge in the minimum spanning tree, guarantees that it's a safe edge to include. So we'll definitely be using that again in Kruskal's algorithm when we prove it's correct. So as with Prim's algorithm, before I hit you with the pseudocode, let me just show you how the algorithm works in an example. I think you'll find it very natural. So let's look at the following graph with five vertices. So the graph has seven edges and I've annotated them with their edge costs in blue. So here's the big difference in philosophy between Prim's algorithm and Kruskal's algorithm. In Prim's algorithm, you insisted on growing like a mold from a starting point, always maintaining connectivity, and spanning one new vertex in each iteration. Kruskal's going to just throw out the desire to have a connected subgraph at each step of the iteration. Kruskal's algorithm will be totally content to grow a tree in parallel with lots of simultaneous little pieces, only having them coalesce at the very end of the algorithm. So in Prim's algorithm, while we were only allowed to pick the cheapest edge subject to this constraint of spanning some new vertex. In Kruskal's algorithm we're just going to pick the cheapest edge that we haven't looked at yet. Now, there is an issue, of course, we want to construct a spanning tree at the end. So, we certainly don't want to create any cycles, so we'll skip over edges that will create cycles. But other than that constraint, we'll just look at the cheapest edge next in Kruskal's algorithm and pick it if there is no cycles. So let's look at this five vertex example. Again, there is no starting point. We're just going to look at the cheapest edge overall. So that's obviously this unit cost edge and we're going to include that in our tree. Right? Why not? Why not pick the cheapest edge? It's a greedy algorithm. So what do we do next? Well, now we have this edge of cost two, that looks good, so let's go ahead and pick that one. Cool. Notice these two edges are totally disjoint. Kay.' So we are not maintaining a connectivity of our subgraph at each iteration of Kruskal's algorithm. Now, it just so happens that when we look at the next edge, the edge of cost 3, we will fuse together the two disjoint pieces that we had previously. Now, we happen to have one connected piece. Now, here's where it gets interesting. When we look at the next edge, the edge of cost 4, we notice that we're not allowed to pick the edge of cost 4. Why? Well, that would create a triangle with the edges of costs 2 and 3, and that of course is a no-no. We want to span the tree at the end of the day, so we can't have any cycles. So we skip over the 4 because we have no choice, we can't pick it, we move on to the 5 and the 5 is fine. So when we pick the edge of cost 5, there's no cycles, we go ahead and include it. And now we have a spanning tree and we stop or if you prefer, you could think of it that we do, we do consider the edge of cost 6. That would create a triangle with the edges of costs 3 and 5, so we skip the 6. And then, for completeness, we think about considering the 7, but that would form a triangle with the edges of costs 1 and 5, so we skip that. So after this single scan through the edges in assorted order, we find ourselves with these four pink edges. In this case, it's a spanning tree and as we'll see, not just in this graph but in every graph it's actually the minimum cost spanning tree. So, with the intuition hopefully solidly in place, I don't think the following pseudocode will surprise you. We want to get away with a single scan through the edges in short order. So, obviously in the preprocessing step, we want to take the unsorted array of edges and sort them by edge cost. To keep the notation and the pseudocode simple, let me just, for the purposes of the algorithm, description only, rename the edges 1, 2, 3, 4, all the way up to m conforming to this sorted order, right? So, the algorithm's just going to scan through the edges in this newly found sorted order. So we're going to call the tree in progress capital T, like we did in Prim's algorithm. And now, we're just going to zip through the edges once in sorted order. And we take an edge, unless it's obviously a bad idea. And here a bad idea means it creates a cycle, that's a no-no, but as long as there's no cycle, we go ahead and include the edge. And that's it, after you finish up the for loop you just return the tree capital T. It's easy to imagine various optimizations that you could do. So for example, once you've added enough edges to have a spanning tree. So n - 1 edges, where n is the number of vertices, you can get ahead, go ahead and abort this for loop and terminate the algorithm. But let's just keep things simple and analyze this three line version of Kruskal's algorithm in this lecture. So just like in our discussion of Primm's algorithm, what I want to do is first just focus on why is Kruskal's algorithm correct? Why does it output a spanning tree at all? And then, the spanning tree that it outputs? Why on earth should it have the minimum cost amongst all spanning trees? That's when we're, once we're convinced of the correctness, we'll move on to a naive running time implementation and then finally, a fast implementation using suitable data structures.