In this video, I'm going to describe a new way of combining a very large number of
neural network models without having to separately train a very large number of
models. This is a method called dropout that's
recently been very successful in winning competitions.
For each training case, we randomly omit some of the hidden units.
So, we end up with a different architecture for each training case.
We can think of this as having a different model for every training case.
And then, the question is, how could we possibly train a model on only one
training case and how could we average all these models together efficiently at test
time? The answer is that we use a great deal of
weight sharing. I want to start by describing two
different ways of combining the outputs of multiple models.
In a mixture, we combine models by averaging their output probabilities.
So, if model A assigns probabilities of 0.3, 0.2 and 0.5, to three different
answers, model B assigns probabilities of 0.1, 0.8 and 0.1, the combined model
simply assigns the averages of those probabilities.
A different way of combining models is to use a product of the probabilities. Here,
we take a geometric mean of the same probabilities.
So, model A and model B again assign the same probabilities as they did before.
But now, what we do is we multiply each pair of probabilities together and then
take the square root. That's the geometric mean and the
geometric means will generally add up to less than one.
So, we have to divide by the sum of the geometric means to normalize the
distribution so that it adds up to one again.
You'll notice that in a product, a small probability output by one model, has veto
power over the other models. Now I want to describe an efficient way to
average a large number of neural nets that gives us an alternative to doing the
correct Bayesian thing. The alternative probably doesn't work
quite as well as doing the correct Bayesian thing, but it's much more
practical. So, consider the neural net with one
hidden layer, shown on the right. Each time we present a training example to
it, What we're going to do is randomly emit
each hidden unit with a probability of 0.5.
So, we crossed out three of the hidden units here.
And we run the example through the net with those hidden units absent.
What this means is that we're randomly sampling from two to the h architectures,
where h is the number of hidden units, It's a huge number of architectures.
Of course, all of these architectures show weights.
That ism whenever we use a hidden unit, it's got the same weight as it's got in
other architectures. So, we can think of dropout as a form of
model averaging. We sample from these two to the h models.
Most of the models, in fact, will never be sampled.
And a model of this sampled only gets one training example.
That's a very extreme form of bagging. The training sets are very different for
the different models, but they're also very small.
The sharing of the weights between all the models means that each model is very
strongly regularized by the others. And this is a much better regularizer than
things like L2 or L1 penalties. Those penalties pull the weights toward
zero. By sharing weights with other models, a
models gets regularized by something that's going to tend to pull the weight
towards the correct value. The question still remains what we do with
test time. So, we could sample many of the
architectures, maybe a hundred, and take the geometric mean of the output
distributions. But that would be a lot of work.
There's something much simpler we can do. We use all of the hidden units, but we
halve their outgoing weights. So, they have the same expected effect as
they did when we were sampling. It turns out that using all of the hidden
units with half their outgoing weights, exactly computes the geometric mean that
the predictions that all two to the h models would have used, provided we're
using a softmax output group. If we have more than one hidden layer, we
can simply use drop out at 0.5 in every layer.
At test time, we halve all the outgoing weights of hidden units,
And that gives us what I call the mean net.
So, we use a net that has all of the units but the weights are halved.
When we have multiple hidden layers, this is not exactly the same as averaging lots
of set per dropout model, but it's a good approximation and it's fast.
We could run lots of stochastic models with dropout, and then average across
those stochastic models. And that would have one advantage over the
mean net. It would give us an idea of the
uncertainty in the answer. What about the input layer?
Well, we can use the same trick there, too.
We use dropout on the inputs, but we use a higher probability of keeping an input.
This trick's already in use in a system called denoising autoencoders, developed
by Pascal Vincent, Hugo Laracholle and Yoshua Bengio at the University of
Montreal, and it works very well. So, how well does dropout work?
Well, the record breaking object recognition net developed by Alex
Krizhevsky would have broken the record even without dropout.
But it broke a lot more by using dropout. In general, if you have a deep neural net
and it's overfitting dropout, it will typically reduce the number errors by
quite a lot. I think any net that requires early
stopping in order to prevent it overfitting would do better by using
dropout. It would, of course, take longer to train and it might mean more hidden
units. If you got a deep neural net and it's not
overfitting, you should probably be using a bigger one and using dropout, that's
assuming you have enough computational power.
There's another way to think about dropout, which is how I originally arrived
at the idea. And you'll see it's a bit related to
mixtures of experts, and what's going wrong when all the experts cooperate,
What's preventing specialization? So, if a hidden unit knows which other
hidden units are present, it can co-adapt to the other hidden units on the training
data. What that means is, the real signal that's
training a hidden unit is, try to fix up the error that's leftover when all the
other hidden units have had their say. That's what's being back propagated to
train the weights of each hidden unit. Now, that's going to cause complex
co-adaptations between the hidden units. And these are likely to go wrong when
there's a change in the data. So, a new test data,
If you rely on a complex co-adaptation to get things right on the training data,
it's quite likely to not work nearly so well on new test data.
It's like the idea that a big, complex conspiracy involving lots of people is
almost certain to go wrong because there's always things you didn't think of.
And if there's a large number of people involved, one of them will behave in an
unexpected way. And then, the others will be doing the
wrong thing. It's much better if you want conspiracies,
to have lots of little conspiracies. Then, when unexpected things happen, many
of the little conspiracies will fail, but some of them will still succeed.
So, by using dropout, We force a hidden unit to work with
combinatorially many other sets of hidden units.
And that makes it much more likely to do something that's individually useful
rather than only useful because of the way particular other hidden units are
collaborating with it. But it is also going to tend to do
something that's individually useful and is different from what other hidden units
do. It needs to be something that's marginally
useful, given what its co-workers tend to achieve.
And I think this is what's giving nets with dropout, their very good performance.