3. Learning The Weights of a Logistic Output Neuron

To extend the learning rule for a linear neuron to a learning rule we can use for multilayer nets of nonlinear neurons, we need two steps. First, we need to extend the learning rule to a single nonlinear neuron. And we're going to use logistic neurons, although many other kinds of nonlinear neurons could be used instead. We're now going to generalize the learning rule for a linear neuron to a logistic neuron, which is a non linear neuron. So, a logistic neuron, computes its logic, z, which is its total input, its, its bias plus the sum over all its input lines of the value of, on an input line xi times the weight on that line, wi. It then gives an output y that's a smooth nonlinear function of that logit. As shown in the graph here, that function is approximately zero when z is big and negative, approximately one when z is big and positive, and in bet, in between, it changes smoothly and nonlinearly. The fact that it changes continuously gives it nice derivatives, which make learning easy. So to get the derivatives of a logistic neuron with respect to the weight, which is what we need for learning, we first need to compute the derivative of the logit itself, that is the total input with respect to our weight, that's very simple. The logit is just a bias plus the sum of all the input lines of the failure on the input lines times the weight. So, when we differentiate with respect to wi, we just get xi. So, the derivative of the logit with respect to wi is xi, and similarly, the derivative of the logit with respect to xi is wi. The derivative of the output with respect to the logic is also simple if you express it terms of the output. So, the output is one / one + e^-z. And dy by dz is just y into one - y. That's not obvious. For those of you who like to see the math, I've put it on the next slide. The math is tedious but perfectly straightforward so you can go through it by yourself. Now, we've got the derivative, the output with respect to the logic and the derivative, the logit with respect to the weight, we can start to figure out the derivative, the output with respect to the weight. We just use the chain rule again. So, dy by dw is dz by dw times dy by dz. And dz by dw, as we just saw, is xi, dy by dz is y into one minus y. And so, we now have the learning row for a logistic neuron. We've got dy by dw, and all we need to do is use the chain rule once more, and multiply it by de by dy. And we get something that looks very like the delta rule. So, the way the arrow changes is we change the weight, de by dwi, is just the sum of all the row of training cases and of the value on input line xin times the residual, the difference between the target and the output, on the actual output of the neuron. But it's got this extra term in it, which comes from the slope of the logistic function, which is yn into one - yn. So, a slight modification of the delta rule gives us the gradiant decent learning rule for training a logistic unit.