Tip:
Highlight text to annotate it
X
In this video, we're going to look at the soft max output function.
This is a way of forcing the outputs of a neural network to sum to one so they can
represent a probability distribution across discreet mutually exclusive
alternatives. Before we get back to the issue of how we
learn feature vectors to represent words, we're gonna have one more digression, this
time it's a technical diversion. So far I talked about using a square area
measure for training a neural net and for linear neurons it's a sensible thing to
do. But the squared error measure has some
drawbacks. If for example the design acuities are
one, so you have a target of one, and the actual output of a neuron is one
billionth, then there's almost no gradient to allow a logistic unit to change.
It's way out on a plateau where the slope is almost exactly horizantal.
And so, it will take a very, very long time to change its weights, even though
it's making almost as big an error as it's possible to make.
Also, if we're trying to assign probabilities to mutually exclusive class
labels, we know that the output should sum to one.
Any answer in which we say, the probability this is A is three quarters
and the probability that it's a B is also three quarters is just a crazy answer.
And we ought to tell the network that information, we shouldn't deprive it of
the knowledge that these are mutually exclusive answers.
So the question is, is there a different cost function that will work better?
Is there a way of telling it that these are mutually exclusive and then using a,
an appropriate cost function? The answer, of course is, that there is.
What we need to do is force the outputs of the neural net to represent a probability
distribution across discrete alternatives, if that's what we plan to use them for.
The way we do this is by using something called a soft-max.
It's a kind of soft continuous version of the maximum function.
So the way the units in a soft-max group work is that they each receive some total
input they've accumulated from the layer below.
That's Zi for the i-th unit, and that's called the logit.
And then they give an output Yi that doesn't just depend on their own Zi.
It depends on the Zs accumulated by their rivals as well.
So we say that the output of the i-th neuron is E to the Zi divided by the sum
of that same quantity for all the different neurons in the softmax group.
And because the bottom line of that equation is the sum of the top line over
all possibilities, we know that when you add over all possibilities you'll get one.
That is, the sum of all the Yi's must come to one.
What's more, the Yi's have to lie between zero and one.
So we force the Yi to represent a probability distribution over mutually
exclusive alternatives just by using that soft max equation.
The soft max equation has a nice simple derivative.
If you ask about how the YI changes as you change the Zi, that obviously depends on,
all the other Zs. But then the Yi itself depends on all the
other Zs. And it turns out, that you get a nice
simple form, just like you do for the majestic unit, where the derivative of the
output with respect to the input, for an individual neuron in a softmax group, is
just Yi times one minus Yi. It's not totally trivial to derive that.
If you tried differentiating the equation above, you must remember that things turn
up in that normalization term on the bottom row.
It's very easy to forget those terms and get the wrong answer.
Now the question is, if we're using a soft max group for the outputs, what's the
right cost function? And the answer, as usual, is that the most
appropriate cost function is the negative log probability of the correct answer.
That is, we want to maximize the log probability of getting the answer right.
So if one of the target values is a one and the remaining ones are zero, then we
simply sum of all possible answers. We put zeros in front of all the wrong
answers. And we put one in front of the right
answer and that gets us the negative log probability of the correct answer, as you
can see in the equation. That's called the cross entropy cost
function. It has a nice property that it has a very
big gradient when the target value is one and the output is almost zero.
You can see that by considering a couple of cases.
So value of one in a million is much better than a value of one in a billion,
even though it differs by less than a millionth.
So when you make the output value, you increase by less than one millionth.
The value of C improves by a lot. That means it's a very, very steep
gradient for C. One way of seeing why a value of one in a
million is much better than a value of one in a billion, if the correct answer is one
is that if you believe the one in a million, you'd be willing to bet but odds
of one in a million, then you'd lose $one million.
If you thought the answer was one in a one billion you'd, you'd lose $one billion
making the same bet. So we get a nice property that.
That cost function, C has a very steep derivative when the answer is very wrong
and that exactly bounces the fact that the way which the advert changes is to change
the import, the Y or the Z is very flat when the once is very wrong.
And when you multiply the two together to get the derivative of cross entropy with
respect to the logic going into i put unit i.
You use the chain rule so that derivative is how fast the cost function changes as
you change the output of the unit times how fast the output of the unit changes as
you change Zi. And notice we need to add up across all
the Js, because when you change the i, the output of all the different units changes.
The result is just the actual output minus the target output.
And you can see that when the actual target outputs are very different, that
has a slope of one or -one. And the slope is never bigger than one or
-one. But the slope never gets small until the
two things are pretty much the same. In other words, you're getting pretty much
the right answer.