Tip:
Highlight text to annotate it
X
In this video, I'll talk about why it's difficult to learn sigmoid belief nets.
And then in the following two videos, I'll describe two different methods we
discovered that allow us to do the learning.
The good news about learning in sigmoid belief nets is that unlike Boltzmann
machines, we don't need two different phases.
We just need what in a Boltzmann machine would be the positive phase.
That's because sigmoid belief nets are what is called locally normalized models.
So we don't have to deal with a partition function or its derivatives.
Another piece of good news about Sigma belief nets, is that if we could get
unbiased samples from the posterior distribution over the hidden units, given
the data vector, Then learning would be easy.
That is, we could follow the gradient specified by maximum likelihood learning,
in a mini batch stochastic kind of way. The problem is that it's hard to get
unbiased samples, from the posterior distribution over the hidden units.
This is largely due to a phenomenon that Judeo Po calls explaining away.
And I'll explain, explaining away in this video and it's important to understand it.
Now, I'm going to talk about why it's difficult to learn sigmoid belief nets.
As we've seen, it's easy to generate an unbiased sample, once you've done the
learning. That is, once we've decided on the weights
and the network, we can easily see the kinds of things the network believes in by
generating samples from this model. This is done top down, one layer at a
time. It's easy, because it's a causal model.
However, even if we know the weights, it's hard to infer the posterior distribution
over hidden causes when we observe the visible effects.
The reason for this is that the number of possible patterns of hidden causes is
exponential in the number of hidden nodes. It's hard even to get a sample from the
posterior, which is what we need if we're going to do stochastic gradient descent.
So given this difficulty in sampling from the posterior.
It's hard to see how we can learn sigmoid belief nets with millions of parameters,
Which is what we'd like to do. This is a very different regime from the
one normally used with graphical models. There they have interpretable models, and
they're trying to learn dozens or maybe hundreds of parameters.
They're not typically trying to learn millions of parameters.
Now before I go into ways in which we can try and get samples from the posterior
distribution. I just want to tell you what the learning
rule is, if we could get those samples. So, if we can get an unbiased sample from
the posterior distribution of hidden states, given the observed data, then
learning is easy. So here's part of a sigmoid belief net,
and we're going to suppose that for every node we have a binary value.
So, for node J, that binary value is SJ. And that vector binary value is a global
configuration for the node, which is the sample from the posterior distribution.
In order to do maximum likelihood learning, all we have to do is maximize
the law of probability, that the inferred binary state of unit, I would be generated
from the inferred binary states of its parents.
So the learning rule is local and simple. The probability that the parents of I
would turn I on, is given by a logistic function that involves the binary states
of the parents. And what we need to do is make that
probability be similar to the actually observed binary value of I and although
I'm not going to derive it here, The maximum likelihood learning rule for
the weight WJI is simply to change it in proportion to the state of J times the
difference between the binary state of I and the probability that the binary states
of I's parents would turn it on. So to summarize,
If we have an assignment of binary states to all the hidden nodes, then it's easy to
do maximum likelihood learning in our typical stochastic way.
Where we sample from the posterior, and then we update the weights based on that
sample. And we average that update over a mini
batch of samples. So, let's go back to the issue of why it's
hard to sample from the posterior. The reason it's hard to get an unbiased
sample from the posterior over the hidden nodes, given an observed data factor at
the leaf nodes, is a phenomenon called explaining away.
So if you look at this little sigma B leaf net chair, it has two hidden causes and
one observed effect. And if you look at the biases, you'll see
that the observed effect of a high stress jumping is very unlikely to happen unless
one of those causes is true. But if one of those causes has happened,
the twenty cancels the minus twenty, and neither house will jump with the
probability of a half. Each of the causes is itself rather
unlikely but not nearly as unlikely as a host spontaneously jumping.
So if you see the house jump, one plausible explanation is that a truck hit
the house. A different plausible explanation, is that
it was an earthquake. And each of those has a probability of
about E to the minus ten. Whereas the house jumping spontaneously
has a probability of about E to the minus twenty.
However, if you assume both hidden causes, that has a probability of E to the
-twenty, so that's extremely unlikely, even if the house did jump.
So assuming there was an earthquake, reduces the probability that the house
jumped because the truck hit it. And we get an anti-correlation between the
two hidden causes when we've observed the house jumping.
Notice in the model itself, in the prior for the model, these two hidden causes are
quite independent. So if the house jumps.
This basically an even chunks that was because of the truck or because of the
earthquake. The posterior actually look something like
this. There's four possible patterns of hidden
causes, given that the house jumped. Two of them are extremely unlikely.
Namely that the truck hit the house, and there was an earthquake.
Or that neither of those things happened. The other two combinations are equally
probable, and you'll notice they form an exclusive all.
We have two likely patterns of causes which are just the opposites of each
other. That's explaining away.
Now that we've understood explaining away, let's consider,
Let's go back to the issue of learning a deep sigmoid belief net.
So we're going to have multiple layers of hidden variables.
They're going to give rise to some data in our causal model.
And we want to learn those weights, W, between the first layer of hidden
variables in the data. And let's see what it takes to learn W.
First of all, the posterior distribution over the first layer of hidden variables
is not going to be a factorial. They're not independent in the posterior.
And that's because of explaining away. So, even if we just had that layer of
hidden variables, once we've seen the data, they wouldn't be independent of one
another. But because we have higher layers of
hidden variables, they're not even independent in the prior.
This hidden variables in the laser bath created prior, and that prior itself will
cause correlations between the hidden variables in first layer.
To learn W, we need to learn the posteria in the first hidden layer were least in
the approximation to it. And even if you are only approximating it
we need to know all of the weights in higher layers in order to compute that
prior term. In fact it's even worse than that.
Because to compute that prior term, we need to integrate out all the hidden
variables and higher layers. That is we need to consider all possible
patterns of activity in these higher layers.
And combine them all to compute the prior that the higher levels create for the
first hidden layer. Computing that prior is a very complicated
thing. So these three problems suggest that it's
gonna be extremely difficult to learn those weights W.
And in particular, we're not gonna be able to learn them without doing a lot of work
in the higher layers to compute the prior. So now we're gonna consider some methods
for learning deep belief mets. The first one is the Monte Carlo method
used by Radford Neal. And that Monte Carlo method basically does
all the work. That is, if we go back to the previous
slide, it considers patents activity over all of the hidden variables.
And it runs a mark off chain that takes a long time to settle down, given the data
factor. And once it's settled down, to thermal
equilibrium, you get a sample from the posterior, but it's a lot of work.
So, in large deep belief nets, this methods pretty slow.
In the 1990's people developed much faster methods for learning deep belief nets,
which we call variational methods. In fact this is where variational methods
came from at least the artificial intelligence community.
The variational methods give up on getting unbiased sound post from the posterior and
they content themselves with just getting approximate samples that is samples from
some other distribution that approximates the posterior.
Now as we saw before, if we have samples from the posterior, maximum likelihood
learning is simple. If we have samples from some other
distribution, we could still use the maximum likelihood learning rule, but it's
not clear what will happen. On the face of it, crazy things might
happen if we're using the wrong distribution to get our samples.
There doesn't seem to be any guarantee that things will improve.
In fact there is a guarantee that something will improve.
It's not the log probability that the model would generate the data.
But it is related to that. In fact it's a lower band on that log
probability. And by pushing up the lower band, we can
usually push up the log probability.