11.2 - A dynamic programming algorithm - Sequence alignment - [dsa 2] - By tim roughgarden

So, now that we understand that the optimal solution to the sequence element problem has to be only one of three candidates, we're going to be easily able to formulate a recurrence, identify the relevant sub-problems and derive an efficient dynamic programming algorithm for the problem. So, here is the culmination of our work. On the previous video, we thought about an optimal alignment of some pair of strings X and Y, and we notice that there are three cases for the contents of the final position. Either there's no gaps or there's a gap on top or there's a gap on the bottom. in case one, where there's no gaps, XM and YM get matched. And we proved that the induced alignment which is of the smaller strings X prime and Y prime has to be optimal in its own right. In the second case where the character little X sub M gets matched with a gap induced alignment this time of X prime and Y. Has to be optimal in its own right, and the third case where little y sub n gets matched with the gap, the induced alignment now of x and y prime. Must be optimal. So one way to think about this kind of assertion is it says that the optimal solution to a problem, to a sequence of a lot of problem depends only on the solutions to three different smaller sub-problems, one involving x, x prime and y prime with characters peeled off of both of the strings. One involving x prime and y and one involving x and y prime. But in all of the cases, all that we care about are sub-problems in which a single character was peeled off from the right from one or both of the strings that we started with. The situation is very similar to in our previous two examples. We have independent sets on line graphs and the nap sack problem and the independent set problem whenever we, we only cared about sub problems obtained by plucking off either one or two vertices from the given line graph. So all we ever cared about were prefixes of the original line graph. In the nap sack problem we got sub problems by plucking off the last item and perhaps also reducing the nap sack capacity by some interval amount. So there were two dimensions in the nap sack problem for which sub problems could decrease in size, then number of items in the residual nap sack capacity. So we use two parameters to keep track of the sub problems. And what we cared about were all possible prefixes of the items and all possible residual integral capacities, at most the original knapsack capacity. So what's up in the sequence alignment problem? Well here, sub problems get smaller by plucking a character off of the first string and or the second string. So again there are two ways in which the sub problem can get smaller, either the first string or the second string. So we'll again use two different parameters, one to figure out how much we've plucked off of the first string, the second one to figure out how much we've plucked off of the second string. Right. But all we care about. The only relevant sub problems involved. Prefixes of the two original input strings X and Y. That is, the only sub problems that we care about have the form x I y j, where x I denotes the first I letters of capital x, some prefix of x, and y j denotes some prefix of y, the first j letters of y. So lets now move from the sub-problems we're going to use in our dynamic programming algorithm to the recurrence that we're going to use. And the recurrence really all it does is compile our understanding of the optimal solution and how it depends on the solution of the smaller sub-problems into an easy to use mathematical formula. So I'll use the notation P sub i j for the value of the optimal solution to the corresponding sub problem, the one involving the prefix X i and the prefix Y j So for a given set of positive values for i and j, what is Pij? Well, there are three possibilities. Case one is where the final position of the optimal alignment doesn't have any gaps, so it matches the final character of X sub i, that is little x sub i and the final character of the prefix capital Y sub j, that is the character little y sub j. It matches them together and just reuses an optimal alignment for the smaller strings, Xi-1 and Yi-1 Case two is where the last letter of the first string, that is little x of i gets matched with a gap. And that case the penalty of the corresponding alignment is the penalty of a gap plus whatever the optimal alignment is of the first i minus one letter of capital X plus the first J letter of Capital Y. The symmetrically case three we pay for a gap and then we pay whatever the optimal alignment is of all of the first I letters of capital X with the first j menace one letters of Y. This is the case where the last letter of the second string gets matched with the gap in the final position of the optimal alignment. So we know that the optimal solution has to look like one of these three things, we don't know which, so in the recurrence we'll just in effect do brute force search among the three outcomes. We just remember, we just choose the minimum of the three possibilities. So that's the formal recurrence. It's correctness really just follows immediately from the work we already did, understand what the optimal solution has to look like. So, before we state the algorithm where we systematically solve all of the sub problems using this magical formula. Let's just make sure we get the edge cases, the base cases where i or j equals zero correctly sorted out. So specifically what is the value of p I,0 and also it turns out p of zero, i where i here is just some non-negative integer. Alright. So the answer to this question is the second one, is B. And I hope if you could keep the notation straight then the answer was fairly clear. So let's remember what, what does PIJ mean? That's the total penalty of an optimal alignment between the first i letters of X, and the first j letters of Y. So consider Pi zero. So now we're asking about aligning the first zero letters of X with the first zero letters of Y. That is with the empty string. Well the optimal way to match any string to the empty string is you're just going to insert gaps into the empty string to equalize their lengths. And so if your string has length i, you need to insert i gaps. What's the? Penalty of that alignment is just i times the penalty for a single gap, and that's the answer here in B. So we're ready now to give the algorithm, and as with all these dynamic programming algorithms once you know the sub-problems and once you know the recurrence that relates their solutions there's really nothing to do. All you do is systematically answer solve all of the sub-problems moving from smallest to largest. So we're going to use an array A to keep track of the solutions of all of these sub-problems. A is going to have two dimensions. The reason for two dimensions is we have two independent parameters which are keeping track of the sub-problem size. One for how many letters of X we're dealing with, and the second dimension for how many letters of Y that we're dealing with. That's analogous to the knapsack problem, where we also had two dimensions to keep track of. The number of items in play, and the residual knapsack capacity. We just figured out what the base case is, so we just solved those in a pre-processing step. So if one of the two indices is zero, then the optimal solution value is just the gap penalty times the non-zero index. And, now we just go to our double four loops. It's double four loops because we have two indices into out array. And whenever we get into a sub problem, we just evaluate the recurrence invoking of these solution to the already computed smaller sub problems. So one sanity check you should always apply when you're writing out the code for a dynamic programming algorithm: when you look at the right hand side of your recurrence, when you look at these purportedly already solved subproblems whose solutions you're using to solve the current subproblem, make sure you have actually already solved those subproblems. So in this case we're good to go because the indices of the subproblems are only less than the entry that we're filling in right now. So indeed all three of the relevant subproblems, A-I - 1 j - 1 A-I - 1 j, and A-I j - 1 they've already been computed in earlier iterations of our double four loop. So they're just hanging out, waiting to get looked up in constant time. And as usual once you've actually figured out the key ingredients for the dynamic programming solution, namely the sub-problems and the recurrence, it's pretty much self evident why the things going to work and it's also self evident exactly what its running time is going to be. So why is the algorithm correct? That is, why does it terminate with every entry Aij equal to the true optimal penalty Pij of the corresponding sub-problem. Well, this just follows because our recurrent is correct, that's where all the hard work was, and then we're just systematically solving all of the sub-problems. So, formally, if you like, it would be proof by induction. So, the running time is completely trivial to evaluate. In each iteration of this double four loop, we do a constant amount of work. We just need to look up three things in constant time and make a couple of comparisons. How many four loops are there? Well, M iterations of the outer four loop, N iterations of the inner four loop. So we suffer the product, M times N. That is, the running time is proportional to the product of the lengths of the two strings. So depending on the application, you may be content to just have an algorithm compute for you the nw score, the total penalty of an optimal alignment or perhaps you're actually interested in the alignment itself. And just as we discussed with independent sets of line graphs, by tracing back through the filled in table, you can indeed reconstruct an optimal solution. So let me just give you the high level idea of how it works. It's going to to follow the same template and all you think through the details of why this really works in the privacy of your own home. So assume that you've already run the algorithm on the previous slide. That you've filled in all the entries of this two d array capital A. Now we want to trace back. So where are we going to start tracing back this filled in table? Well, we are going to start with a problem that we actually care about, namely the largest problem A of M comma N, that's what we want the alignment for. Now we know the softball alignment looks, has one of three candidates, we know there's three possible situations for the contents of the final position of that alignment. More over, when we filled in this entry of the table, we explicitly compared the three possibilities to figure out which one was the best. So you know, perhaps on the forward pass we actually cached the result of the comparison, or in the worst case we can just go back and re-compute, and figure out which of those three. Pieces was used to fill in this entry, and depending on which one of the three candidates won, that tells us, what should be, the contents of the final position of the optimal alignment. If case one was used to fill in this entry, we should match, little x sub n and little y sub n. If case two was used to fill in this entry, we should match little x sub n with the gap. If case three was used, to fill in this entry, we should match little y sub n with the gap. If there was a tie, we get to pick any of them. Arbitrarily, all of them will lead to optimal alignments. Then of course, after figuring out what to do in this final position. We have an induced sub problem involving x prime and or y prime. That tells us a, a previous entry of the table to go to. And we just repeat the process. We, again, figure out which of the three cases was used to fill in this entry. That tells us how to fill in the next right most position of the alignment. And we just keep going until we fall off the table. So what do you do when you fall off the table? Well, once one of the indices I or J gets all the way down to zero, now you have no choice. So now one of the strings is empty and the other one has some number of symbol. So you should just insert the appropriate number of gaps to equalize the lengths. One thing that's pretty neat is that this trace back procedure is efficient. In fact, it's way more efficient, in general, than the forward pass. For the forward pass, you have to fill in every single one of the m * n entries. But in this trace back procedure, each time you track back. One of the two indices, at least, will get decremented. So that says, you're going to complete this trace back in O of m + n time with an optimal alignment of the original two strings.