6. Shallow Autoencoders for Pre Training

In this video, I'm going to talk about alternative pre-training methods for learning deep neural nets. I introduced pre-training using restrictive Boltzmann machines trained with contrastive divergence. But after that, people discovered there are many other ways to pre-train layers of features. And indeed, if you initialize the weights correctly, you may not need pre-training at all provided you have enough labeled data. We've seen some of the neat things that can be done with the codes produced by deep auto-encoders. I now want to consider shallow auto-encoders that just have one hidden layer. Restricted Boltzmann machines can be used with shallow auto-encoders, particularly if they're trained with contrastive divergence because they're trying to make the reconstructions look like the data. When do you use an autoencoder? A restricted Boltzmann machine has very strong regularization because the hidden units are only allowed to have binary activities, and this restricts their capacity a lot. If we train restricted Boltzmann machines with maximum likelihood, they're not at all like auto-encoders. One way to see that is if you had a pixel that was pure noise, an auto-encoder would try to reconstruct whatever noise value it had. A restricted Boltzmann machine trained with maximum likelihood would completely ignore that pixel and model it just using the bias for that input. So, since we can view a restricted Boltzmann machine as a kind of a strongly regularized auto-encoder, maybe we can replace the RBMs that we use for pre-training with a stack of autoencoders. It turns out that if you do that, pre-training is not as effective. At least, that's true if you use shallow auto-encoders that are regularized just by penalizing the squared weights. So, stacking these autoencoders doesn't work as well as stacking restricted Boltzmann machines. However, there's a different kind of auto-encoder that does work as well, and that's the denoising auto-encoder. And, it's been studied extensively by the group in Montreal. Denoising auto-encoders work by adding noise to each input vector by setting many of the components to zero, but it's different components for different input vectors. This resembles dropout, but it's for the inputs rather than the hidden units. The denoising auto-encoder is still required to reconstruct, the inputs that have been set to zero. And so, it can't just copy its input. The danger with the shallow auto-encoder is that if you give it enough hidden units, it might just copy each pixel to one hidden unit, and then reconstruct that pixel from that hidden unit. A denoising auto-encoder clearly can't do that so it has to use hidden units to catch a correlation between inputs so that it can use the values of some inputs to help it reconstruct the inputs that have been zeroed out. If we use a stack of denoising auto-encoders, pre-training is very effective. There's some cases in which RBMs still work better, but in most cases denoising auto-encoders are more effective. It's also much simpler to evaluate the pre-training using a denoising autoencoder because we can easily compute the value of the objective function. When we pre-train a restricted Boltzmann machine with contrast divergence, we can't compute the value of the real objective function we're trying to minimize. So, we often just use the squared reconstruction error, which is not actually what's being minimized. In a denoising auto-encoder, we can print out the value of the thing we're trying to minimize, and that's very helpful. One disadvantage of the denoising autoencoder is that it lacks the nice variational boundary we get with restricted Boltzmann machines. But that's only of theoretical interest because it only applies if the restricted Boltzmann machine is trained with maximum likelihood. Yet another kind of auto-encoder is the contractive auto-encoder, that was also developed by the group in Montreal. The way this works is that we try to make the hidden activities be as insensitive as possible to the inputs. Of course, the hidden units can't just ignore the inputs altogether because they have to be able to reconstruct them. The way we achieve this insensitivity is by penalizing the squared gradient of each hidden unit with respect to each input. So, we try to make each hidden unit so that it won't change much if we change an input value. Contractive auto-encoders also work very well for pre-training. Their codes tend to have the property that only a small subset of the hidden units are in their sensitive range. For different parts of the space, it's a different subset and so this active set acts like a sparse code. The other hidden units are unsaturated and are insensitive. RBMs actually have a very similar behavior. After they've been trained, many of the hidden units will be saturated, and the working set of the unsaturated ones will be different for different training cases. I want to finish by summarizing my current view of pre-training. There are now many different ways to do layer by layer pre-training that discovers good features. When our data set does not have a huge number of labels, this way of discovering features before you ever use the labels is very helpful for the subsequent discriminative fine tuning. It discovers the features without using the information in the labels, and then the information in the labels is used for fine tuning the decision banquets between classes. It's especially useful if we have a lot of unlabeled data so that the pre-training can be a very good job of discovering interesting features, using a lot of data. For very large labeled data sets however, initializing the weights that are going to be used for supervised learning by using unsupervised pre-training is not necessary, even if the nets are deep. Pre-training was the first good way to initialize the weights for deep nets, but now we have lots of other ways. However, even if we have a lot of labels, if we make the nets much larger again, we'll need pretraining again. So, an argument I often have with people from Google is they say, we've got lots and lots of labelled data so we don't need regularization methods. Our nets won't over fit anyway because we've got so much data. The counter-argument is, that's only because you're using nets that are much too small. You should use much, much bigger nets on much, much more powerful computers. And then, you'll start over fitting again and you'll need these regularization methods, like dropout and pre-training. If you ask which regime the brain is in, the brain is clearly in the regime where it got huge numbers of parameters compared with the amount of data its got. And so to the brain, at least, regularization methods are very important.