Tip:
Highlight text to annotate it
X
So, now that we understand that the optimal solution to the sequence element
problem has to be only one of three candidates, we're going to be easily able
to formulate a recurrence, identify the relevant sub-problems and derive an
efficient dynamic programming algorithm for the problem.
So, here is the culmination of our work. On the previous video, we thought about
an optimal alignment of some pair of strings X and Y, and we notice that there
are three cases for the contents of the final position.
Either there's no gaps or there's a gap on top or there's a gap on the bottom.
in case one, where there's no gaps, XM and YM get matched.
And we proved that the induced alignment which is of the smaller strings X prime
and Y prime has to be optimal in its own right.
In the second case where the character little X sub M gets matched with a gap
induced alignment this time of X prime and Y.
Has to be optimal in its own right, and the third case where little y sub n gets
matched with the gap, the induced alignment now of x and y prime.
Must be optimal. So one way to think about this kind of
assertion is it says that the optimal solution to a problem, to a sequence of a
lot of problem depends only on the solutions to three different smaller
sub-problems, one involving x, x prime and y prime with characters peeled off of
both of the strings. One involving x prime and y and one
involving x and y prime. But in all of the cases, all that we care
about are sub-problems in which a single character was peeled off from the right
from one or both of the strings that we started with.
The situation is very similar to in our previous two examples.
We have independent sets on line graphs and the nap sack problem and the
independent set problem whenever we, we only cared about sub problems obtained by
plucking off either one or two vertices from the given line graph.
So all we ever cared about were prefixes of the original line graph.
In the nap sack problem we got sub problems by plucking off the last item
and perhaps also reducing the nap sack capacity by some interval amount.
So there were two dimensions in the nap sack problem for which sub problems could
decrease in size, then number of items in the residual nap sack capacity.
So we use two parameters to keep track of the sub problems.
And what we cared about were all possible prefixes of the items and all possible
residual integral capacities, at most the original knapsack capacity.
So what's up in the sequence alignment problem?
Well here, sub problems get smaller by plucking a character off of the first
string and or the second string. So again there are two ways in which the
sub problem can get smaller, either the first string or the second string.
So we'll again use two different parameters, one to figure out how much
we've plucked off of the first string, the second one to figure out how much
we've plucked off of the second string. Right.
But all we care about. The only relevant sub problems involved.
Prefixes of the two original input strings X and Y.
That is, the only sub problems that we care about have the form x I y j, where x
I denotes the first I letters of capital x, some prefix of x, and y j denotes some
prefix of y, the first j letters of y. So lets now move from the sub-problems
we're going to use in our dynamic programming algorithm to the recurrence
that we're going to use. And the recurrence really all it does is
compile our understanding of the optimal solution and how it depends on the
solution of the smaller sub-problems into an easy to use mathematical formula.
So I'll use the notation P sub i j for the value of the optimal solution to the
corresponding sub problem, the one involving the prefix X i and the prefix Y
j So for a given set of positive values for i and j, what is Pij?
Well, there are three possibilities. Case one is where the final position of
the optimal alignment doesn't have any gaps, so it matches the final character
of X sub i, that is little x sub i and the final character of the prefix capital
Y sub j, that is the character little y sub j.
It matches them together and just reuses an optimal alignment for the smaller
strings, Xi-1 and Yi-1 Case two is where the last letter of the first string, that
is little x of i gets matched with a gap. And that case the penalty of the
corresponding alignment is the penalty of a gap plus whatever the optimal alignment
is of the first i minus one letter of capital X plus the first J letter of
Capital Y. The symmetrically case three we pay for a
gap and then we pay whatever the optimal alignment is of all of the first I
letters of capital X with the first j menace one letters of Y.
This is the case where the last letter of the second string gets matched with the
gap in the final position of the optimal alignment.
So we know that the optimal solution has to look like one of these three things,
we don't know which, so in the recurrence we'll just in effect do brute force
search among the three outcomes. We just remember, we just choose the
minimum of the three possibilities. So that's the formal recurrence.
It's correctness really just follows immediately from the work we already did,
understand what the optimal solution has to look like.
So, before we state the algorithm where we systematically solve all of the sub
problems using this magical formula. Let's just make sure we get the edge
cases, the base cases where i or j equals zero correctly sorted out.
So specifically what is the value of p I,0 and also it turns out p of zero, i
where i here is just some non-negative integer.
Alright. So the answer to this question is the
second one, is B. And I hope if you could keep the notation
straight then the answer was fairly clear.
So let's remember what, what does PIJ mean?
That's the total penalty of an optimal alignment between the first i letters of
X, and the first j letters of Y. So consider Pi zero.
So now we're asking about aligning the first zero letters of X with the first
zero letters of Y. That is with the empty string.
Well the optimal way to match any string to the empty string is you're just going
to insert gaps into the empty string to equalize their lengths.
And so if your string has length i, you need to insert i gaps.
What's the? Penalty of that alignment is just i times
the penalty for a single gap, and that's the answer here in B.
So we're ready now to give the algorithm, and as with all these dynamic programming
algorithms once you know the sub-problems and once you know the recurrence that
relates their solutions there's really nothing to do.
All you do is systematically answer solve all of the sub-problems moving from
smallest to largest. So we're going to use an array A to keep
track of the solutions of all of these sub-problems.
A is going to have two dimensions. The reason for two dimensions is we have
two independent parameters which are keeping track of the sub-problem size.
One for how many letters of X we're dealing with, and the second dimension
for how many letters of Y that we're dealing with.
That's analogous to the knapsack problem, where we also had two dimensions to keep
track of. The number of items in play, and the
residual knapsack capacity. We just figured out what the base case
is, so we just solved those in a pre-processing step.
So if one of the two indices is zero, then the optimal solution value is just
the gap penalty times the non-zero index. And, now we just go to our double four
loops. It's double four loops because we have
two indices into out array. And whenever we get into a sub problem,
we just evaluate the recurrence invoking of these solution to the already computed
smaller sub problems. So one sanity check you should always
apply when you're writing out the code for a dynamic programming algorithm: when
you look at the right hand side of your recurrence, when you look at these
purportedly already solved subproblems whose solutions you're using to solve the
current subproblem, make sure you have actually already solved those
subproblems. So in this case we're good to go because
the indices of the subproblems are only less than the entry that we're filling in
right now. So indeed all three of the relevant
subproblems, A-I - 1 j - 1 A-I - 1 j, and A-I j - 1 they've already been computed
in earlier iterations of our double four loop.
So they're just hanging out, waiting to get looked up in constant time.
And as usual once you've actually figured out the key ingredients for the dynamic
programming solution, namely the sub-problems and the recurrence, it's
pretty much self evident why the things going to work and it's also self evident
exactly what its running time is going to be.
So why is the algorithm correct? That is, why does it terminate with every
entry Aij equal to the true optimal penalty Pij of the corresponding
sub-problem. Well, this just follows because our
recurrent is correct, that's where all the hard work was, and then we're just
systematically solving all of the sub-problems.
So, formally, if you like, it would be proof by induction.
So, the running time is completely trivial to evaluate.
In each iteration of this double four loop, we do a constant amount of work.
We just need to look up three things in constant time and make a couple of
comparisons. How many four loops are there?
Well, M iterations of the outer four loop, N iterations of the inner four
loop. So we suffer the product, M times N.
That is, the running time is proportional to the product of the lengths of the two
strings. So depending on the application, you may
be content to just have an algorithm compute for you the nw score, the total
penalty of an optimal alignment or perhaps you're actually interested in the
alignment itself. And just as we discussed with independent
sets of line graphs, by tracing back through the filled in table, you can
indeed reconstruct an optimal solution. So let me just give you the high level
idea of how it works. It's going to to follow the same template
and all you think through the details of why this really works in the privacy of
your own home. So assume that you've already run the
algorithm on the previous slide. That you've filled in all the entries of
this two d array capital A. Now we want to trace back.
So where are we going to start tracing back this filled in table?
Well, we are going to start with a problem that we actually care about,
namely the largest problem A of M comma N, that's what we want the alignment for.
Now we know the softball alignment looks, has one of three candidates, we know
there's three possible situations for the contents of the final position of that
alignment. More over, when we filled in this entry
of the table, we explicitly compared the three possibilities to figure out which
one was the best. So you know, perhaps on the forward pass
we actually cached the result of the comparison, or in the worst case we can
just go back and re-compute, and figure out which of those three.
Pieces was used to fill in this entry, and depending on which one of the three
candidates won, that tells us, what should be, the contents of the final
position of the optimal alignment. If case one was used to fill in this
entry, we should match, little x sub n and little y sub n.
If case two was used to fill in this entry, we should match little x sub n
with the gap. If case three was used, to fill in this
entry, we should match little y sub n with the gap.
If there was a tie, we get to pick any of them.
Arbitrarily, all of them will lead to optimal alignments.
Then of course, after figuring out what to do in this final position.
We have an induced sub problem involving x prime and or y prime.
That tells us a, a previous entry of the table to go to.
And we just repeat the process. We, again, figure out which of the three
cases was used to fill in this entry. That tells us how to fill in the next
right most position of the alignment. And we just keep going until we fall off
the table. So what do you do when you fall off the
table? Well, once one of the indices I or J gets
all the way down to zero, now you have no choice.
So now one of the strings is empty and the other one has some number of symbol.
So you should just insert the appropriate number of gaps to equalize the lengths.
One thing that's pretty neat is that this trace back procedure is efficient.
In fact, it's way more efficient, in general, than the forward pass.
For the forward pass, you have to fill in every single one of the m * n entries.
But in this trace back procedure, each time you track back.
One of the two indices, at least, will get decremented.
So that says, you're going to complete this trace back in O of m + n time with
an optimal alignment of the original two strings.