3. Learning Sigmoid Belief Nets

In this video, I'll talk about why it's difficult to learn sigmoid belief nets. And then in the following two videos, I'll describe two different methods we discovered that allow us to do the learning. The good news about learning in sigmoid belief nets is that unlike Boltzmann machines, we don't need two different phases. We just need what in a Boltzmann machine would be the positive phase. That's because sigmoid belief nets are what is called locally normalized models. So we don't have to deal with a partition function or its derivatives. Another piece of good news about Sigma belief nets, is that if we could get unbiased samples from the posterior distribution over the hidden units, given the data vector, Then learning would be easy. That is, we could follow the gradient specified by maximum likelihood learning, in a mini batch stochastic kind of way. The problem is that it's hard to get unbiased samples, from the posterior distribution over the hidden units. This is largely due to a phenomenon that Judeo Po calls explaining away. And I'll explain, explaining away in this video and it's important to understand it. Now, I'm going to talk about why it's difficult to learn sigmoid belief nets. As we've seen, it's easy to generate an unbiased sample, once you've done the learning. That is, once we've decided on the weights and the network, we can easily see the kinds of things the network believes in by generating samples from this model. This is done top down, one layer at a time. It's easy, because it's a causal model. However, even if we know the weights, it's hard to infer the posterior distribution over hidden causes when we observe the visible effects. The reason for this is that the number of possible patterns of hidden causes is exponential in the number of hidden nodes. It's hard even to get a sample from the posterior, which is what we need if we're going to do stochastic gradient descent. So given this difficulty in sampling from the posterior. It's hard to see how we can learn sigmoid belief nets with millions of parameters, Which is what we'd like to do. This is a very different regime from the one normally used with graphical models. There they have interpretable models, and they're trying to learn dozens or maybe hundreds of parameters. They're not typically trying to learn millions of parameters. Now before I go into ways in which we can try and get samples from the posterior distribution. I just want to tell you what the learning rule is, if we could get those samples. So, if we can get an unbiased sample from the posterior distribution of hidden states, given the observed data, then learning is easy. So here's part of a sigmoid belief net, and we're going to suppose that for every node we have a binary value. So, for node J, that binary value is SJ. And that vector binary value is a global configuration for the node, which is the sample from the posterior distribution. In order to do maximum likelihood learning, all we have to do is maximize the law of probability, that the inferred binary state of unit, I would be generated from the inferred binary states of its parents. So the learning rule is local and simple. The probability that the parents of I would turn I on, is given by a logistic function that involves the binary states of the parents. And what we need to do is make that probability be similar to the actually observed binary value of I and although I'm not going to derive it here, The maximum likelihood learning rule for the weight WJI is simply to change it in proportion to the state of J times the difference between the binary state of I and the probability that the binary states of I's parents would turn it on. So to summarize, If we have an assignment of binary states to all the hidden nodes, then it's easy to do maximum likelihood learning in our typical stochastic way. Where we sample from the posterior, and then we update the weights based on that sample. And we average that update over a mini batch of samples. So, let's go back to the issue of why it's hard to sample from the posterior. The reason it's hard to get an unbiased sample from the posterior over the hidden nodes, given an observed data factor at the leaf nodes, is a phenomenon called explaining away. So if you look at this little sigma B leaf net chair, it has two hidden causes and one observed effect. And if you look at the biases, you'll see that the observed effect of a high stress jumping is very unlikely to happen unless one of those causes is true. But if one of those causes has happened, the twenty cancels the minus twenty, and neither house will jump with the probability of a half. Each of the causes is itself rather unlikely but not nearly as unlikely as a host spontaneously jumping. So if you see the house jump, one plausible explanation is that a truck hit the house. A different plausible explanation, is that it was an earthquake. And each of those has a probability of about E to the minus ten. Whereas the house jumping spontaneously has a probability of about E to the minus twenty. However, if you assume both hidden causes, that has a probability of E to the -twenty, so that's extremely unlikely, even if the house did jump. So assuming there was an earthquake, reduces the probability that the house jumped because the truck hit it. And we get an anti-correlation between the two hidden causes when we've observed the house jumping. Notice in the model itself, in the prior for the model, these two hidden causes are quite independent. So if the house jumps. This basically an even chunks that was because of the truck or because of the earthquake. The posterior actually look something like this. There's four possible patterns of hidden causes, given that the house jumped. Two of them are extremely unlikely. Namely that the truck hit the house, and there was an earthquake. Or that neither of those things happened. The other two combinations are equally probable, and you'll notice they form an exclusive all. We have two likely patterns of causes which are just the opposites of each other. That's explaining away. Now that we've understood explaining away, let's consider, Let's go back to the issue of learning a deep sigmoid belief net. So we're going to have multiple layers of hidden variables. They're going to give rise to some data in our causal model. And we want to learn those weights, W, between the first layer of hidden variables in the data. And let's see what it takes to learn W. First of all, the posterior distribution over the first layer of hidden variables is not going to be a factorial. They're not independent in the posterior. And that's because of explaining away. So, even if we just had that layer of hidden variables, once we've seen the data, they wouldn't be independent of one another. But because we have higher layers of hidden variables, they're not even independent in the prior. This hidden variables in the laser bath created prior, and that prior itself will cause correlations between the hidden variables in first layer. To learn W, we need to learn the posteria in the first hidden layer were least in the approximation to it. And even if you are only approximating it we need to know all of the weights in higher layers in order to compute that prior term. In fact it's even worse than that. Because to compute that prior term, we need to integrate out all the hidden variables and higher layers. That is we need to consider all possible patterns of activity in these higher layers. And combine them all to compute the prior that the higher levels create for the first hidden layer. Computing that prior is a very complicated thing. So these three problems suggest that it's gonna be extremely difficult to learn those weights W. And in particular, we're not gonna be able to learn them without doing a lot of work in the higher layers to compute the prior. So now we're gonna consider some methods for learning deep belief mets. The first one is the Monte Carlo method used by Radford Neal. And that Monte Carlo method basically does all the work. That is, if we go back to the previous slide, it considers patents activity over all of the hidden variables. And it runs a mark off chain that takes a long time to settle down, given the data factor. And once it's settled down, to thermal equilibrium, you get a sample from the posterior, but it's a lot of work. So, in large deep belief nets, this methods pretty slow. In the 1990's people developed much faster methods for learning deep belief nets, which we call variational methods. In fact this is where variational methods came from at least the artificial intelligence community. The variational methods give up on getting unbiased sound post from the posterior and they content themselves with just getting approximate samples that is samples from some other distribution that approximates the posterior. Now as we saw before, if we have samples from the posterior, maximum likelihood learning is simple. If we have samples from some other distribution, we could still use the maximum likelihood learning rule, but it's not clear what will happen. On the face of it, crazy things might happen if we're using the wrong distribution to get our samples. There doesn't seem to be any guarantee that things will improve. In fact there is a guarantee that something will improve. It's not the log probability that the model would generate the data. But it is related to that. In fact it's a lower band on that log probability. And by pushing up the lower band, we can usually push up the log probability.