Tip:
Highlight text to annotate it
X
This is the lecture on adaptive learning rate. We have already discussed various types of
feed-forward networks and the training that we have used is based on gradient descent
technique, where the learning rate eta is a fixed quantity. In this class, we will talk
about what is the purpose of making this learning rate adaptive and how we can achieve it.
The subjects to be covered in this class are motivation for adaptive learning rate, Lyapunov
stability theory. We have already discussed about stability theory. This is just to familiarize
you in this class so that you can appreciate how to derive this adaptive learning rate
using Lyapunov function. Then, some simulations and discussions.
This is a feed-forward network. This feed-forward network can be a multilayer network or RBFN
network. There are various architectures of feed-forward networks. It can be a multilayer
network or it can be a radial basis function network. Once you have this kind of structure,
the goal is to identify this W, which is the weight vector. What we have done is that we
have taken all these weights in the first layer, second layer and as many layers we
have in the network, and put them in a vector format, which is W. That means the total number
of weights in this network is capital N.
The training data consists of, say, N patterns x p and y p. x p is the vector here, that
is, the input vector and y p is the output vector. It is not necessary that we have only
a single output or two inputs. This is simply a schematic diagram. The network can have
as many inputs as and as many outputs and of course, as many hidden units. The update
law for such a multilayer network, radial basis function network has this structure,
where W t plus 1 is W t minus eta del E upon del W, where eta is the learning rate. This
is a gradient descent, del E by del W is the gradient descent, that is, E is the error
cost function and that is minimized by this update rule. Let us look at the motivation
for adaptive learning rate.
In general, if I am looking at a function f xů. Let us say y is equal to f x. This
is my f x and this is x. When f is nonlinear, then the usual curve can look like this. This
is just a representative, this need not be exact. In principle, what this curve says
is that this curve has many ů. This is one minimum, this is another minimum and this
is another minimum. In this curve, there are three minimums out of which two are local
minimum and this is global minimum, but there can be another function f x for which there
can be so many local minimum and many global minimum. So, many local minima and many global
minima.
These are the normal characteristics of a nonlinear function f x if I plot with respect
to x. The objective is find x such that y is minimum. The objective is how I reach here,
not here, starting from any initial condition. If you see, I use a gradient descent rule
and if I start from this point, this red circle and then from here, I glide down and come
here at this point.
But if I can change this initial position to this point, then I can come here by using
the rule x new is x old minus eta del y by del x. This is my normal principle of gradient
descent that I can come to the local minimum that is nearest to the starting initial point.
But such a weight update does not guarantee that I can reach here. Of course, in this
case, you can easily see that if I am starting from here, if I increase my learning rate
step eta, then this initial position may go to this position by simply changing this eta,
a bigger value and then by lowering eta, I can easily reach this. We have discussed these
in previous classes. With adaptive learning rate, one can employ a higher learning rate
when the error is far from global minimum and a smaller learning rate when it is near
it, but how do you do it? There are certain heuristics methods that people have tried.
But today, we would like to give a comprehensive approach on how to compute this adaptive eta
such that even if I start from the initial weight, which is nearer to a local minimum,
I can still reach the global minimum ľ that is the objective. The objective is to achieve
global convergence for a non-quadratic, non-convex, nonlinear function without increasing the
computational complexity. In gradient descent, the learning rate is fixed. If one can have
a larger learning rate for a point far away from a global minimum and a smaller learning
rate for a point closer to the global minimum, then it would be possible to avoid local minima
and ensure global convergence. This necessitates the need for adaptive learning rate. Today,
in this class, we will derive how to comprehensively compute eta, the adaptive learning rate such
that we reachů. Of course, our algorithm does not guarantee that we will have a global
convergence, but it is better than gradient descent. We will be using a Lyapunov function.
What is a Lyapunov function?
For a system that is defined in the state space x t, V x is a Lyapunov function if it
is positive definite and it grows, that is, V x t increases as x t increases. Once we
have selected such a Lyapunov function and if I compute the rate derivative of that Lyapunov
function V dot x, if it is negative definite, then the system is asymptotically stable.
This is the Lyapunov function. Another condition is that if V dot becomes 0, then x t also
converges to 0.
Weight update law using Lyapunov based approach. Normally, given a network that is characterized
by Wů. We have already discussed all these things. The network is characterized by the
weight vector W and the input vector is x p, then the network output is y p hat. This
is network prediction and p stands for pattern. I have N such patterns. The usual quadratic
function E given for this kind of network for training the weight vector W is y p, which
is desired, and y hat p is the actual one.
y p is desired and y hat p is predicted by the network. You can see that this is the
usual quadratic function that we have already discussed. This is for N patterns and if I
have also multiple outputs, then I can put another index here and make a sum there. For
such a network, let us choose a Lyapunov function candidate for the system as below. V is half
y tilde transpose. It implies that here, I have this y tilde is actually the error for
the pattern 1; y tilde is a vector of error function ľ the error that occurs when a network
predicts given some input x p with specific W. This is the error for the pattern 1, this
is the error for pattern p and error for pattern N.
We have capital N sets of patterns and for each pattern, we compute the error and each
error is an element of y vector y tilde. Obviously, V isů. You can easily see that whatever I
am writing here and this E are the same. In this case, V is the same as E ľ we have just
written in a vector format. This expression has been written in a vector format in V and
we say let this be a Lyapunov function. Then I try to differentiate V dot. You check here.
This is my Lyapunov function and in this, there is a term called y hat p and this y
hat p is at W. The objective is how can I find out W dot? The objective is to find W
dot such that V dot is negative definite ľ this is the objective. Find W dot such that
if I differentiate V, V dot is negative definite. That is how we deal with a nonlinear function.
You can easily see this is a nonlinear dynamics, because V is a nonlinear function and y hat
p is a function of W, nonlinear function and I am now saying W dot, that is, the evaluation
of W follows certain dynamic law. Then, this becomes a dynamic equation and if V dot can
be negative definite, then I can say that the W dot that is given to me is the weight
update law that stability.
The time derivative of the Lyapunov function V is given now ľ V dot isů. You see that
V was half y tilde transpose y. This is a scalar function and y is a vector, where y
tilde is the vector of all the errors for each pattern. V dot is minus y tilde del y
hat by del W dot. You can easily see here that if I differentiate this V dot, y tilde
and then differentiation of this quantity with respect to W and then, W dot. I can write
this equation as minus y tilde transpose J W dot. You can easily see that thisů. I hope
this expression is clear to you, where I simply differentiate this V. The negative sign comes
because y tilde is y actual minus y hat. When I differentiate that, I get minus del y hat
by del W and W dot.
The relation that I have added is dV by dt is dV by dW into dW by dt. What is this? W
dot is dW by dt and this particular expression, this part, you can easily see that this is
a vector and this is another vector, but this is a row vector. Of course, we can always
say that y is a vector, W is another vector, so obviously this is actually a matrix, which
is this. I define this quantity as J and J is equal to del y upon del W ľ this quantity.
The dimension of this matrix is N into M because W has dimension capital M and y has dimension
N. I hope now you understood what V dot is. V dot is this quantity.
Now, the theorem is if W is updated by this equation, the W t dash is W 0 plus 0 to t
dash W dot dt. At every time t, I compute this relationship, where this W dot is given
by this expression, which you are probably not so familiar with, but y tilde is norm
square. Norm square means y tilde is a vector. So you take each element, square it, add them
and take the root of that ľ that is the norm. If you do not take a root, norm square, so
W dot is this norm square. J transpose y tilde is another vector, again norm square, plus
this epsilon is a small positive constant and then J transpose y tilde. If this is the
case, this is W dot, then, y tilde converges to 0 under the condition that W dot exists
along the convergence trajectory. That is what we need. We need that this y tilde, the
vector y tilde should converge to a zero vector under the condition that W dot exists along
the convergence trajectory. This is the theorem. Once we have a theorem, we must prove that
theorem. How do you prove it?
We will talk of this theorem as Lyapunov function I, because we prove it using the Lyapunov
theory, Lyapunov function concept. What we do is we take this equation 6 and you replace
this in 4 and compute V dot. If we do that, we get V dot to be y tilde square and this
quantity. You can easily see that this quantity is always either positiveů. you take the
negative the other quantity. This is a positive quantity, this is another positive quantity
and this is also another positive quantity. Hence, this quantity is always either negative,
because negative sign is there or 0. That assures us if V1 dot goes to 0, then this
quantity also will go to 0.
If V1 dot is uniformly continuous and bounded, then as t tends to infinity, V1 dot tends
to 0. If V1 dot tends to 0, you can easily see that y tilde will go to 0. That is the
proof. The objective was that our error cost function, which is simply half y tilde transpose
y tildeů. If y tilde goes to 0, then the cost function becomes 0, that is, the function
is minimized.
The weight update law is a batch update law that we discussed now. The instantaneous LF-I
learning algorithm can be derived asů. For instantaneous, what we do is we do not take
the vector y tilde, we simply take a scalar. y tilde is simply y minus y p. This is no
more a vector. This is for instantaneous quantity. W dot is this quantity. If I consider what
should be the weight update law for instantaneous updateů. We have already discussed earlier
that there is batch update, there is instantaneous update. For batch update, we consider all
the patterns and for instantaneous update, we consider the recent pattern, current pattern.
This becomes the update law using the same Lyapunov function approach.
If I look at this W dot, how will I update this? If I take a discrete equivalent of this
equation, then this is the discrete equivalent ľ the difference equation form of the continuous
update law, which is W hat t plus 1 is W hat t plus mu W dot t. This is how the weight
update is done, where mu is a suitable small constant such that this equation approximates
equation 8.
We derived the weight update law W dot. That would ensure that y tilde converges to 0 if
W dot exists along the convergence trajectory. That was our theorem and we proved it. Let
us go in depth of this algorithm. If you look at the BP algorithm, back propagation algorithm,
delta W is minus eta del E upon del W. What is delta W? I always write W t plus 1 is W
t plus delta W. You can see that this is the gradient descent law, gradient descent method
and this gradient descent method can be written as eta Ji transpose y tilde where this Ji
is the same as the Ji is del yi by del W. You can easily that this is Ji Ji is del y
by del W. This is a row vector because y is a scalar. There is no i there, so del y hat
by del W is here. If I write the equation, this equation is the same as this equation.
This eta Ji transpose y tilde is the same as minus eta del E upon del W. Then, I write
this equation and using the recent algorithm which I call LF-I, we got an equation which
is Wt plus 1 is Wt plus mu y tilde norm square by Ji transpose y tilde norm square into Ji
transpose y tilde.
You can easily see that this back propagation algorithm was very popular and the algorithm
we have derived is remarkable same identity, except that eta is replaced. In back propagation,
this eta is a fixed quantity and in this Lyapunov function based weight update law that we just
now derived is this quantity, whereas this quantity is a time-dependent quantity. That
gives us a notion that using the Lyapunov function, we can actually derive an adaptive
learning rate, which is eta. Comparing the above two equations, we find that the fixed
learning rate eta in BP algorithm is replaced by its adaptive version, which is eta adaptive.
This is the first time (actually, this is our own work) we derived an adaptive learning
rate usingů. Comprehensively, we found this an expression for adaptive learning rate using
Lyapunov function method.
The convergence of LF-I is that we said thatů we showed that y tilde converges to 0, but
there was a condition that if W dot exists along the convergence trajectory, then our
theorem is right. But W dot may not exist along the convergence trajectory. Then, our
thing does not hold good. The theorem states that the global convergence of LF-I is guaranteed,
provided W dot exists along the convergence trajectory. This, in turn, necessitates that
del V1 by del W, the norm of that, is J transpose y tilde is not equal to 0.
This says that the W dot exists along the convergence trajectory and for this existing
convergence trajectory meaning that this should always exist ľ J transpose y tilde should
always exist. This implies that J transpose y tilde should not be 0 or del V1 upon del
W is 0, which indicates a local minimum of the error function. Thus, the theorem only
says that the global minimum is reached only when local minima are avoided during the training.
Since instantaneous update rule introduces noise, it may be possible to reach global
minimum in some cases. However, global convergence is not guaranteed. This is the implication
of the theorem that W dot exists along the convergence trajectory.
This is a second version of that same algorithm where our Lyapunov function V2 is the earlier
Lyapunov function. This was the earlier Lyapunov function we took and we added a term, which
is lambda into W dot transpose W dot. This term was added into the error function V2.
Again, if we follow the same method, we find out V2 dot, then this can written in this
form, which is minus y tilde transpose into (J minus D) W dot, where J is del y upon del
W and D is this quantity, that is, lambda into 1 upon y tilde norm square y tilde W
double dot transpose. If I put this, then this V2 dot, which is computed hereů. This
is V2, this is V2 dot and this is the same quantity as this.
Now, we propose another theorem. If the update law for weight vector W follows a dynamics
given by the following nonlinear differential equation, which is 14, where this alpha W
has this form y tilde norm square upon J transpose y tilde norm square plus epsilon is a scalar
function of weight vector W and epsilon is a small positive constant, then y tilde converges
to 0 under the condition that J minus D transpose y tilde is non-zero along the convergence
trajectory. This should be non-zero along the convergence trajectory.
We take this W dot and this W dot can be written in this form now. From here, I can write this
equation in this form. Substituting for W dot from this equation, if I replace this
W dot here, which is V2 dot, then V2 dot becomes this quantity. You can easily see that this
is a positive quantity or 0. This is again a positive quantity and this is another positive
quantity or 0. In that sense, since (J minus D) transpose y tilde is non-zero, V2 dot is
less than 0 for all y tilde less than or equal to 0 and V2 dot is 0 if y tilde equal to 0.
If V2 dot is uniformly continuous and bounded, then according to Barbalat's lemma as t tends
to infinity, V2 dot tends to 0, implying y tilde will converge to 0, which was the theorem.
The theorem is proved that y tilde you will again converge to 0, provided the conditionů.
The condition is (J minus D) transpose y tilde is non-zero along the convergence trajectory.
This is necessary because if this is not 0, if this becomes 0, then W dot does not get
updated. We say that W dot should exist. For W dot existing here, this quantity should
not become 0. If this becomes 0, then obviously W dot becomes 0 and then weights cannot move
along. The trajectory of weight update becomes stagnant and you cannot have further. So,
this should not become 0; if this becomes 0, then this theorem does not hold good. This
can be 0 in local minima, that is, for LF-I; for LF-II, we showed now the same thing that
this quantity also should not become 0. As long as this is not 0, y tilde will converge
to 0. We derived two Lyapunov function based algorithms,
but in both the theorems, we showed that y tilde will converge to 0, but there is a condition.
In the first one, the condition was that Ji transpose y tilde should not become 0. Here,
(J minus D) transpose y tilde should not become 0 and because of these conditions, global
convergence is not achieved. Anyway, the weight update law that we talked about, the difference
equation version of that weight update law is this particular format, which is equation
17.
This can be written in terms of two forms. W t plus 1 is W t and there is some quantity
here, which is a time-varying quantity, then J transpose P y tilde. This is the instantaneous
and that is why P has come ľ P for each pattern. Minusů again, this is another quantity, where
mu1 is mu plus lambda and that quantity is written ľ again, another time-varying quantity
into W double dot. This is a gradient term that has some input and this is W double dot
into some constant, where W double dot is computed using this formula, which you can
easily derive using Euler's method.
If I use the same cost function and apply back propagation, I get this equation. The
recursive term is this, equation 18. Now also, we are getting the same thing, but we are
now getting eta dash and mu dash are time-varying quantities, that is, the adaptive quantity
using Lyapunov function. The Lyapunov function or the method that we have described gives
us weight update law, which is similar to gradient descent, butů. that but is very
important because the learning rate associated is the adaptive map. Normally, in back propagation,
we call this term the acceleration term, but our mu dash is an adaptive acceleration term
and eta dash in back propagation is a fixed learning rate, whereas in our case, it is
an adaptive learning rate.
I will not go into details of this because this requires a little bit of more introspection
and you really need to understand and go into detail about it. Simple classroom teaching
will not help you to understand this. In this particular class, I am just telling you or
I am giving you or I am delivering this lecture to just let you know or to inspire you to
think about the methodology that will help you to get a better convergence algorithm.
The convergence of LF-II is again the same thing that this should not become 0 and if
this does not become 0, then it ensures global convergence, but we cannot guarantee that
these cannot become 0. This can become 0.
You can easily see that J transpose y tilde is D transpose y tilde. If this is the case,
then the solution of the above equation represent local minima. When this is 0, this is the
local minima it reaches. We said that this should exist. That means this should not become
0 for global convergence, but W dot vanishes whenever this is 0. If we assume J is not
equal to D, then rank of J minus D equal to n ensures global convergence because if J
is not equal to D, it implies this is 0 only when y tilde is 0 and that is global convergence.
At global convergence, y tilde is 0. The other condition is that when J transpose y tildeů,
but this case of course is forů when this is true, the solution of the above equation
represent local minima and then, the convergence is not global.
For a neural network, n is much much less than m. There are at least m minus n vectors
for which solutions do not exist. Hence, local minima do not occur. Increasing the numbers
of hidden layers or hidden neurons, chances of encountering local minima can be reduced
because of this particular statement. Increasing the number of output neurons increases m and
n, as well as n by m. Thus, for MIMO systems, there are more local minima as compared to
single output system. So more complexity, more local minima.
This is one of the interesting parts. I will not go in detail here but all that I will
tell you is that using Lyapunov function, the second Lyapunov function, LF-II algorithm
(the first one is LF-I, the second one is LF-II), one can show that because of the acceleration
term that we are adding, most of the timeů because whenever this quantity del V1 upon
del W is 0, that does not ensure that W double dotů Whenever del V1 by del W is 0, it does
not imply that W double dot is 0. This fact actually ensures that this term can actually
drive away this local minimum. It can bring the weight from this local minima and it can
come here. We can avoid local minima, not always but in some cases using LF-II algorithm
by adding that extra term, which is half W dot transpose into W. That extra term helps
us because that introduces an acceleration term in our learning algorithm. That acceleration
term can take away the weight from being stuck in the local minima towards the global minima.
We talked about adaptive learning rate. What is this adaptive learning rate? Let us consider
the LF-I algorithm and in this algorithm, if we apply this algorithm to an XOR functionů.
All of you know what an XOR function is ľ when both input are equal, then the output
is 0 and unequal, the output is 1. That is the XOR function. We have four patterns in
the XOR function for two-input XOR function; four patterns for a two-input XOR function, that is, I have
this. This is XOR, x1, x2 and y. If I train this function, if I train a network to learn
this function, then when I plot this eta functionů.
You see for each pattern, I have four curves here ľ one, two, three and four and you can
easily see this eta is increasing and decreasing until it goes to 0. What you are seeing are
two important things: learning rate is not fixed, which in BP is a fixed quantity, and
learning rate goes to 0 as error goes to 0 ľ this is important. That is what is important
ľ how the learning should be done. Learning should be doneů in the beginning, eta should
be very high or whatever is necessary, but as the learning is over for the patterns and
the network has already learned, then eta should be 0. Why should eta be there? But
for back propagation, the eta is always fixed. This is important. What is the meaning of
adaptive learning rate? It has two important properties: eta is time varying and it has
a property that y tilde converges to 0, eta also becomes 0.
We will now talk about simulation results. This is your LF-I versus LF-II, the XOR function.
Let me first summarize. We proposed two theorems for weight updates in a feed-forward network
ľ it can be a multilayer network or it can be a radial basis function network. We said
that these weight update algorithms will ensure that y tilde converges to 0, that is, global
convergence provided two conditions ů. There is a condition for each theorem that has to
be satisfied. In one case, while the weights are sliding along the convergence trajectory,
they should not be stuck in the local minimum. This is the first theorem. The second theorem
says J minus D transpose y should also not be 0. Again, that solution also ensures us
to be in local minima if it becomes 0. Now, the problem is which is better ľ LF-I or
LF-II? We have some idea that LF-II can avoid local minimum. Let us do simulation and see
whether the theory we predicted is true or wrong.
See what we have done here for XOR function. feed-forward neural network, that is, a multilayer
neural network and using that neural network, what we did is we started from this different
initial condition W and
for every run, we wanted to see how many training epochs were necessary for each algorithm for
convergence. If you look at the top one, which is the broken line, that represents the number
of training epochs. One epoch is training about four patterns. In the XOR function,
I have ?1, ?1 is ?1; ?1, +1 is +1. This is my x1, this is my x2, this is my y. Then,
+1, ?1 is +1 and +1, + 1 is ?1. This is my XOR function and if I train this XOR function
in a multilayer network using my Lyapunov function-I algorithm, then from every different
initial condition, because initial weights are all random, for each initial condition
I get.ů You see that these are the number of patterns here and one epoch is these four
patterns.
I give these four patterns to my network, train it and again, I repeat it. Like that,
if I train almostů here, it is almost near 200 and here, it is around 120 or so, it is
varying; sometimes, it may also go to up to 300. You can easily see the number of training
epochs necessary to train an XOR function using LF-I is always a varying quantity and
there is a good deal of fluctuation ľ that is almost as 120 training epochs to 300 training
epochs, whereas if you look at LF-II, it is almost constant. This is a straight line except
these fluctuations here ľ this is one fluctuation, this is another fluctuation, but otherwise,
they are very fixed quantities. This is always below the training epochs that are necessary
using LF-I.
The observation is LF-II provides tangible improvement over LF-I both in terms of convergence
time and training epochs and the most important is that it is invariant to initial condition.
I have to recognize that training a neural network means you start from a new initial
condition. Each initial condition will lead to a different convergence time. If that is
the case, then the algorithm is not a good one. So, can I say that I have an algorithm
that ensures that whatever may be my initial weight vector, my convergence time is independent
of that? That is important. That was for XOR function.
This is for three-bit parity problem and in three-bit parity problem, you can easily seeů.
I hope that you can get all these functions ľ they are there in many textbooks. This
is XOR function three-bit parity. If you take a three-bit parity function, again you can
see that LF-I is again very much fluctuating between 750 epochs to almost 3,000 epochs.
It fluctuates, whereas in LF-II, the number of training epochs is fixed almost at the
500 level and some was fixed very rarely. Here, one fluctuation, second fluctuation
here, big fluctuation 3. Again, LF-II performs better than LF-I both in terms of computation
time and training epochs.
This is a 2D Gabor function. For this Gabor function, the LF-I and LF-II, if you look
at the RMS error and these are the training iterations, you can easily see that for LF-II,
the error is much less and it has a lesser value than the LF-I and so, it has a better
convergence performance.
We have a kind of a table format. We have compared the results for performance of back
propagation algorithm and LF-I and LF-II for an XOR function. You can easily see that the
number of epochs that are necessary for training the XOR function using eta equal to 0.5 is
5,620 and when eta equal to 0.95, it is 3,769, whereas in the case of LF-I, we take only
165 and LF-II only 120. The epoch simply gives the number of times I have to train, but what
is the total computation time?
Of course, one can argue that gradient descent is a very simple function and computationally
very simple but if you look at the total number of computation times in seconds that is necessary,
if you again see that also, in that case, the proposed algorithm is much better because
it is 0.0038 whereas for back propagation, it is 0.0354. It is pretty fast, at least
10 times faster than the back propagation algorithm in convergence time.
The three-bit parity problem. Again, you can easily see here that LF-II takes 738 number
of epochs and LF-I is 1,338 epochs, whereas back propagation takes a pretty low time of
5,941 with eta 0.95 and eta = 0.5 is 12,032. If you look at the computation time also,
for LF-II, it is 0.0676 seconds, whereas back propagation takes 0.483 seconds for training.
This is the three-bit parity problem we talked about.
This is the 8 by 3 encoder. You already know what an encoder is. If I use an 8 by 3 encoder
map using a feed-forward neural network, again if you look at the number of epochs that are
necessary ľ LF-II and LF-II using Lyapunov function, they are pretty small compared to
back propagation, and you can see the comparison time-wise also.
The 2D Gabor function. We took a multilayer network with 40 hidden units, 80 hidden units
and we tried using back propagation network as well as LF-I and LF-II only 40. You can
see the RMS error per run. We train the network using many initial weight conditions and average
rate over 100 such runs and then we computed the average error and you can easily see the
computation is in terms of RMS error. LF-I and LF-II are much better than the back propagation
algorithm.
Let us come to the final part, the global convergence of Lyapunov based learning algorithm.
It is possibleů. If we take the objective function V2, Lyapunov function is mu V1 plus
half sigma this square, where V1 is this quantity which is the Lyapunov function we have considered
for our first theorem. If I compute V2 dot, then this becomes equation 22. The objective
is to select a weight update law W dot such that the global minimum V1 equal to 0 and
del V1 by del W equal to 0 is simultaneously reached.
You see that if I have a cost function, then at this, del V1 by del W also 0 and V1, this
quantity, is also 0 here. In this case, this is 0 ľ del V1 by del W is 0, but V1 is not
0 here, V1 is 0 here. Let me now try to explain what the meaning of attaining global convergence
is. If I look at the Lyapunov function, which is now V2, you can easily see that if this
is my Lyapunov function, at this point, V1 is 0 and at this point, this quantity del
v1 by del W is also 0. V2 is 0 here in this point, but in this case, this one is 0 but
this is not 0 in this case; in this case, this one is 0, but this is not 0. If I can
take a Lyapunov function like this, some researchers have proved that we can attain global convergence.
But what is the problem if I take 22? This has a Lyapunov function, which is 21. Then,
I have to make sure that this V2 dot is negative definite.
For that, you see that this is my weight update law ľ researchers have shown it, but unfortunately,
this involves a Hessian matrix and its inversion. Computationally, it is impossible. Normally,
a neural network will have a number of weights, at least in 100s. In a tangible application,
this number is at least 100. In that case, you can easily see that this is a very huge
matrix whose inversion is very difficult. It is not difficult to compute the inversion,
but it is time-consuming. Obviously, they cannot be applied online, but the algorithm
that we presented today is applicable online.
The implementation of weight update algorithm becomes very difficult due to the inverse
presence of the inverse of the Hessian term. The above algorithm is of theoretical interest
only and also, this weight update algorithm is similar to BP learning algorithm with a
fixed learning rate, whereas the learning rate that we presented today is adaptive.
The Lyapunov function algorithms performs better thanů (I did not talk about today
what is extended Kalman filtering, it is not necessary) performs better in comparison to
back propagation algorithm in terms of speed and accuracy. LF-II, the second algorithm
that we proposed today, avoids local minima to a greater extent as compared to LF-I. It
is seen that by choosing proper network architecture, it is possible to reach global minimum. Both
LF-I and LF-II have an interesting parallel with conventional BP algorithm, where the fixed learning rate of BP is replaced
by adaptive learning rate.
The conclusion is that the adaptive learning rate can be comprehensively computed using
Lyapunov function approach. Here, we select a Lyapunov function V, then find V dot and
propose W dot such that V dot is negative definite.
By doing this, what we did is that we attained an adaptive learning rate. In gradient descent,
we have W t plus 1 is W t plus eta del E by del W. This is gradient descent. In Lyapunov
function, we got the same plus etaadaptive del E by del W. We derived the Lyapunov function
in adaptive eta ľ the structure is the same otherwise. I hope that this is clear for you
ľ how to compute comprehensively an adaptive learning rate such that our convergence is
better. Although the two algorithms that we proposed do not guarantee global convergence,
they give us an idea how we can certainly improve the convergence of such algorithm
for feed-forward networks. Thank you.