Tip:
Highlight text to annotate it
X
In this video, I'm going to talk about alternative pre-training methods for
learning deep neural nets. I introduced pre-training using
restrictive Boltzmann machines trained with contrastive divergence.
But after that, people discovered there are many other ways to pre-train layers
of features. And indeed, if you initialize the weights
correctly, you may not need pre-training at all provided you have enough labeled
data. We've seen some of the neat things that
can be done with the codes produced by deep auto-encoders.
I now want to consider shallow auto-encoders that just have one hidden
layer. Restricted Boltzmann machines can be used
with shallow auto-encoders, particularly if they're trained with contrastive
divergence because they're trying to make the reconstructions look like the data.
When do you use an autoencoder? A restricted Boltzmann machine has very
strong regularization because the hidden units are only allowed to have binary
activities, and this restricts their capacity a lot.
If we train restricted Boltzmann machines with maximum likelihood, they're not at
all like auto-encoders. One way to see that is if you had a pixel
that was pure noise, an auto-encoder would try to reconstruct whatever noise
value it had. A restricted Boltzmann machine trained
with maximum likelihood would completely ignore that pixel and model it just using
the bias for that input. So, since we can view a restricted
Boltzmann machine as a kind of a strongly regularized auto-encoder, maybe we can
replace the RBMs that we use for pre-training with a stack of
autoencoders. It turns out that if you do that,
pre-training is not as effective. At least, that's true if you use shallow
auto-encoders that are regularized just by penalizing the squared weights.
So, stacking these autoencoders doesn't work as well as stacking restricted
Boltzmann machines. However, there's a different kind of
auto-encoder that does work as well, and that's the denoising auto-encoder.
And, it's been studied extensively by the group in Montreal.
Denoising auto-encoders work by adding noise to each input vector by setting
many of the components to zero, but it's different components for different input
vectors. This resembles dropout, but it's for the
inputs rather than the hidden units. The denoising auto-encoder is still
required to reconstruct, the inputs that have been set to zero. And so, it can't
just copy its input. The danger with the shallow auto-encoder
is that if you give it enough hidden units, it might just copy each pixel to
one hidden unit, and then reconstruct that pixel from that hidden unit.
A denoising auto-encoder clearly can't do that so it has to use hidden units to
catch a correlation between inputs so that it can use the values of some inputs
to help it reconstruct the inputs that have been zeroed out.
If we use a stack of denoising auto-encoders, pre-training is very
effective. There's some cases in which RBMs still
work better, but in most cases denoising auto-encoders are more effective.
It's also much simpler to evaluate the pre-training using a denoising
autoencoder because we can easily compute the value of the objective function.
When we pre-train a restricted Boltzmann machine with contrast divergence, we
can't compute the value of the real objective function we're trying to
minimize. So, we often just use the squared
reconstruction error, which is not actually what's being minimized.
In a denoising auto-encoder, we can print out the value of the thing we're trying
to minimize, and that's very helpful. One disadvantage of the denoising
autoencoder is that it lacks the nice variational boundary we get with
restricted Boltzmann machines. But that's only of theoretical interest
because it only applies if the restricted Boltzmann machine is trained with maximum
likelihood. Yet another kind of auto-encoder is the
contractive auto-encoder, that was also developed by the group in Montreal.
The way this works is that we try to make the hidden activities be as insensitive
as possible to the inputs. Of course, the hidden units can't just
ignore the inputs altogether because they have to be able to reconstruct them.
The way we achieve this insensitivity is by penalizing the squared gradient of
each hidden unit with respect to each input. So, we try to make each hidden
unit so that it won't change much if we change an input value.
Contractive auto-encoders also work very well for pre-training.
Their codes tend to have the property that only a small subset of the hidden
units are in their sensitive range. For different parts of the space, it's a
different subset and so this active set acts like a sparse code.
The other hidden units are unsaturated and are insensitive.
RBMs actually have a very similar behavior.
After they've been trained, many of the hidden units will be saturated, and the
working set of the unsaturated ones will be different for different training
cases. I want to finish by summarizing my
current view of pre-training. There are now many different ways to do
layer by layer pre-training that discovers good features.
When our data set does not have a huge number of labels,
this way of discovering features before you ever use the labels is very helpful
for the subsequent discriminative fine tuning.
It discovers the features without using the information in the labels, and then
the information in the labels is used for fine tuning the decision banquets between
classes. It's especially useful if we have a lot
of unlabeled data so that the pre-training can be a very good job of
discovering interesting features, using a lot of data.
For very large labeled data sets however, initializing the weights that are going
to be used for supervised learning by using unsupervised pre-training is not
necessary, even if the nets are deep. Pre-training was the first good way to
initialize the weights for deep nets, but now we have lots of other ways.
However, even if we have a lot of labels, if we make the nets much larger again,
we'll need pretraining again. So, an argument I often have with people
from Google is they say, we've got lots and lots of labelled data so we don't
need regularization methods. Our nets won't over fit anyway because
we've got so much data. The counter-argument is, that's only
because you're using nets that are much too small.
You should use much, much bigger nets on much, much more powerful computers.
And then, you'll start over fitting again and you'll need these regularization
methods, like dropout and pre-training. If you ask which regime the brain is in,
the brain is clearly in the regime where it got huge numbers of parameters
compared with the amount of data its got. And so to the brain, at least,
regularization methods are very important.