Tip:
Highlight text to annotate it
X
In this video, I'm going to describe a method developed by David MacKay in the
1990s for determining the weight penalties to use in a neural network without using a
validation set. It's based on the idea that we can
interpret weight penalties as doing map estimation so that the magnitude of the
weight penalty is related to the tightness of the prior distribution over the
weights. Mackay showed we can empirically fit both
the weight penalties and the assumed noise in the output of the neural net to get a
method for fitting weight penalties that does not require a validation set and
therefore, allows us to have different weight penalties for different subsets as
the connections in a neural network, something that will be very expensive to
do using validation sets. Mackay went on to win competitions using
this kind of method. I'm now going to describe a simple and
practical method developed by David MacKay for making use of the fact that we can
interpret weight penalties as the ratio of two variances.
After we've learned a model to minimize squared error we can find the best value
for the output variance and the best value is found by simply using the variance of
the residual errors. We can also estimate the variance in the
Gaussian prior of the weight. We have to start with some guess about
what this variance should be. Then, we do some learning, and then we use
a very dirty trick called empirical Bayes. We set the variance of our prior to be the
variance of the weights of the model learned because that's the variance that
will make those weights most likely. This really violates a lot of the
presuppositions of the Bayesian approach. We're using the data to decide what our
prior beliefs are. So, once we've learned the weights, we fit
a zero mean Gaussian to the one-dimensional distribution of the
learned weights. And then, we take the variance of that
Gaussian, and we use that for our prior. Now, one nice thing about that is, is the
different subsets of weights. Like in different layers, for example, we
could learn different variances for the different layers.
We don't need a validation set so we can use all of the non-test data for training.
And because we don't need validation sets to determine the weight penalties in
different layers, we can actually have many different weight penalties.
This will be very hard to do with validation sets.
So, here's MacKay's method. You start by guessing the noise variance
and the weight prior variance. Actually, all you have to really do is
guess their ratio. Then, you do some gradient descent
learning, trying to improve the weights. Then, you reset the noise variance to be
the variance of the residual errors and you reset the weight prior variance to be
the distribution of the actually learned weight.
And then, you go back around this loop again.
So, this actually works quite well in practice.
And MacKay won several competitions this way.