6. Mackays Quick And Dirty Method of Setting Weight Costs

In this video, I'm going to describe a method developed by David MacKay in the 1990s for determining the weight penalties to use in a neural network without using a validation set. It's based on the idea that we can interpret weight penalties as doing map estimation so that the magnitude of the weight penalty is related to the tightness of the prior distribution over the weights. Mackay showed we can empirically fit both the weight penalties and the assumed noise in the output of the neural net to get a method for fitting weight penalties that does not require a validation set and therefore, allows us to have different weight penalties for different subsets as the connections in a neural network, something that will be very expensive to do using validation sets. Mackay went on to win competitions using this kind of method. I'm now going to describe a simple and practical method developed by David MacKay for making use of the fact that we can interpret weight penalties as the ratio of two variances. After we've learned a model to minimize squared error we can find the best value for the output variance and the best value is found by simply using the variance of the residual errors. We can also estimate the variance in the Gaussian prior of the weight. We have to start with some guess about what this variance should be. Then, we do some learning, and then we use a very dirty trick called empirical Bayes. We set the variance of our prior to be the variance of the weights of the model learned because that's the variance that will make those weights most likely. This really violates a lot of the presuppositions of the Bayesian approach. We're using the data to decide what our prior beliefs are. So, once we've learned the weights, we fit a zero mean Gaussian to the one-dimensional distribution of the learned weights. And then, we take the variance of that Gaussian, and we use that for our prior. Now, one nice thing about that is, is the different subsets of weights. Like in different layers, for example, we could learn different variances for the different layers. We don't need a validation set so we can use all of the non-test data for training. And because we don't need validation sets to determine the weight penalties in different layers, we can actually have many different weight penalties. This will be very hard to do with validation sets. So, here's MacKay's method. You start by guessing the noise variance and the weight prior variance. Actually, all you have to really do is guess their ratio. Then, you do some gradient descent learning, trying to improve the weights. Then, you reset the noise variance to be the variance of the residual errors and you reset the weight prior variance to be the distribution of the actually learned weight. And then, you go back around this loop again. So, this actually works quite well in practice. And MacKay won several competitions this way.