MCMC: Metropolis-Hastings

This blog is a continuation of the previous discussion on MCMC, What are Monte Carlo Markov Chains (MCMC)? I introduce a well known MCMC algorithm, Metropolis-Hastings (MH).  I describe how MH behaves and prove how it holds detailed balance. Then I show an example of a bad proposal, and move on to show what generally people choose for good proposals. I close with a brief comparison of MCMC and (Self-Normalized) Importance Sampling.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are from his lectures. In addition, I’d like to address that more details on MC algorithms can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

It turns out that it’s not so easy to design a transition operator that guarantees to have detailed balance. But what if maybe we can use something that turns any proposal operator (which is something we can come up with) into something that satisfies detailed balance. This folks, is what Metropolis-Hastings does for us!

What is Metropolis-Hastings?

The intuition behind Metropolis-Hastings is that

  1. We know we need to have an operator that satisfies detailed balance
  2. We can propose something ( some sample x')
  3. We can check whether the proposal is good or not
  4. And sometimes we are going to keep that proposal, and sometimes we are not…. in such a way that keeping it vs not keeping it satisfies detailed balance.

Now I’ll introduce mathematically how MH works and then discuss why it works.

We begin with a current sample, x^s, which we will use to get a new sample, x^{'} using a proposal distribution, q(x' \mid x^s). The notation looks as follows: x' \sim q(x' \mid x^s).

Then we decide to keep the new sample, x^{s+1} = x' with probability:

\alpha = \min\left(1, \frac{\pi(x')q(x^s|x')}{\pi(x^s)q(x'|x^s)}\right)

To discuss what this probability actually means, let’s remind ourselves what the detailed balance relationship is. It is: \pi(x)p(x'|x) = \pi(x')p(x|x'). But since we are now working with proposals, let’s switch notation from p to q.

\pi(x)q(x'|x) = \pi(x')q(x|x')

Side Note: Sometimes transition probabilities such as q(x \mid x') can be referred to as a kernel. Instead of being specific with what is making the transition, we generalize by saying we are using a transition kernel. Common notation is \kappa(x \mid x'). They both represent getting one sample dependent on another sample.

Let’s suppose that q(\cdot) satisfies detailed balance, then the ratio here between \pi(x)q(x'|x) and \pi(x')q(x|x') is 1. So now let’s refer back to the probability:

\alpha = \min\left(1, \frac{\pi(x')q(x^s|x')}{\pi(x^s)q(x'|x^s)}\right)

Here we are saying that with probability \alpha, we are going to set the proposal to our new sample, x^{s+1} = x', and with probability (1 - \alpha) we are going to reject the proposal. When we reject the proposal, our next sample is set to the same sample as before, , x^{s+1} = x.

We see that if \alpha = 1, we always accept the proposal. Basically, if our transition kernel always satisfies detailed balance, we can always accept it!

However, if it does not satisfy detailed balance, then this means that somehow there’s an imbalance with the kinds of proposals we are getting. There are some kind of samples that we are proposing too often and other kinds that we are not proposing often enough.

Question: if the ratio \frac{\pi(x')q(x^s|x')}{\pi(x^s)q(x'|x^s)} is larger than 1, does this mean we are proposing some kind of samples too often or not often enough?

Let’s think about it! If we are not proposing the kind of samples we want more, then we should probably accept them with higher probability. However, if we are seeing a sample too often, then maybe we should only accept it sometimes. Therefore if \frac{\pi(x')q(x|x')}{\pi(x) q(x'|x)} > 1, it’s because we are seeing a rare sample we want more of (we are proposing not often enough). However, basic probability tells us that probabilities max out at 1. This is why we add the \text{min} to get the probability \alpha.

So what does Metropolis-Hastings do for us? It’s an algorithm that basically corrects our proposals in order to preserve detailed-balance! Neat, huh?

Proving Detailed Balance for Metropolis-Hastings

The idea behind Metropolis-Hasting is that we can get a new (transitioned/dependent) sample when we sample from a proposal, x' \sim q(x'|x^s), and get our next sample, x^{s+1} = x' with probability \alpha = \min\left(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}\right), otherwise we keep our previous sample, x^{s+1} = x^s.

When we have this particular sampling scheme, we an have an implied transition kernel that can tell us (when we have x and x'), what is the probability that we are going to see a transition from x to x'. In order to reason about this, we need to consider two cases. It could be that x' is not the same value as x, but they could be in fact the same value where x = x'. We formulate this by working backwards.

In the figure above, we show that in order to get a value x \neq x', we first need to sample an x' from q(x'|x) and then we need to decide to keep the sample with probability \alpha. Therefore for the case where x \neq x', we have \alpha * q(x'|x).

Note that there are two cases where x = x' and \alpha = 1. The first is if somehow we were able to sample the exact same sample x again (this can only happen in some cases, for example if we are sampling from a Bernoulli distribution, but if we were sampling from a Normal, then it would be very unlikely). This gives us \alpha * q(x|x) = q(x|x). The second possibility of getting the same sample again is by rejecting. Then we calculate the probability that any rejected sample will give us the same sample as before. We integrate over possible values we could have rejected, x'', and then we write the probability of (1 - \alpha'')q(x''|x).

We can prove detailed balance once we write out the kernel like this quite easily because it splits the problem into two small proofs, one for the case where x = x' and another for x \neq x'.

In the case where x = x, we can just say: \pi(x)\kappa(x|x) = \pi(x)\kappa(x|x)

Really in MH, the point is not to describe how often we stay at the sample (there are other properties that should help this from happening), but rather how often we transition from sample to sample. Therefore we care more about the next case.

In the case where x = x', we can say:

\pi(x)\kappa(x'|x) = \alpha q(x'|x)\pi(x) \qquad \text{replace } \alpha \text{ with } \text{min}\left(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}\right)

= \text{min}( \pi(x) q(x'|x), \pi(x')q(x|x'))

= \text{min}( \pi(x') q(x|x'), \pi(x)q(x'|x))

= \alpha q(x|x')\pi(x')

= \pi(x')\kappa(x|x')

\pi(x)\kappa(x'|x) = \pi(x')\kappa(x|x')

Unnormalized Densities

Now we know that Metropolis-Hastings preserves detailed-balance. It is also the case that once you have detailed-balance, you can say if you have a normalized density, \pi, then

\pi(x) = \frac{\gamma(x)}{Z} \text{ and } \pi(x') = \frac{\gamma(x')}{Z}
\frac{\pi(x')}{\pi(x)} = \frac{\gamma(x')}{\gamma(x)}

In practice, when doing MH, this is a nice property because we don’t always have the normalized density. we can calculate the acceptance probability using the unnormalized density, \gamma(x).

\alpha = \min(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)})
= \text{min}(1, \frac{\gamma(x')q(x|x')}{\gamma(x) q(x'|x)})
\text{For a Bayesian Inference Example where } \gamma(x)  = p(y,x)
= \text{min}(1, \frac{p(y,x')q(x|x')}{p(y,x) q(x'|x)})

It’s important to note that MH is very general. We can design any transition proposal and if that proposal violates detailed-balance then we can use the Metropolis-Hastings’ acceptance ratio to clean up the mis-designed proposal. We have shown that we can always evaluate the ratio whenever we have an unnormalized density and don’t have the normalize density (which is always the case).

Example of a Bad Proposal

We can do likelihood weighting. Suppose that we have to sample x' | x. We can sample from the prior, which means that we can ignore was x was.

q(x'|x) = p(x') \;\;\;\;\; \text{(Independent from previous sample)}

This is what we call Independent Metropolis-Hastings because new samples are independent from the previous sample (which is valid). Now we can see what our new acceptance ratio is:

\alpha = \text{min}(1, \frac{p(y,x')q(x|x')}{p(y,x) q(x'|x)})
= \text{min}(1, \frac{p(y,x')p(x')p(x)}{p(y,x)p(x)p(x')})
= \text{min}(1, \frac{p(y,x')}{p(y,x)})

Because our probabilities cancel, we are left with the ratio of the likelihoods. But we have the same issue as the likelihood weighting where we have low acceptance probabilities. With the exception that this sampler won’t let go of a good x, this is no better than doing likelihood weighting because we still have to wait until we get a good sample again.

More Common Examples of Choosing Proposals

It’s much more common to setup a proposal where we sample a new sample, x', from a Normal distribution centered on x with some standard deviation, \delta^2. It’s important to note that \delta^2 is symmetric. This means that the probability going from x' to x should be the same as the probability as going from x to x', or q(x'|x) = q(x|x').

This makes our acceptance ratio a bit easier since q(x'|x) = q(x|x')

\alpha = \text{min}(1, \frac{\gamma(x')q(x|x')}{\gamma(x) q(x'|x)})
= \text{min}(1, \frac{\gamma(x')}{\gamma(x)})

What are the options for our step size, \delta^2? (1) We can take really small steps. This means we will then need a lot of steps to get somewhere (2) We can also take really large steps. However, we will probably jump around too much end up with samples we will more often reject.

So the question is then, How should we set our step size?

Based on people who have done some analysis, we should tune our acceptance rate. We keep adjusting it until we get a magic number, which is \alpha \approx 0.234. This is optimal under a set of assumptions. In practice, we want anything between 0.1 and 0.7 depending on what we are doing.

Summary – MCMC vs. Importance Sampling

We have looked at two MC samplers, MCMC and Importance Sampling.

For Importance Sampling we say: if we design some proposal, generate some samples, assign each sample a weight, we can now take the weighted-average over the samples. In particular, we know that the samples gives us an estimate of the marginal likelihood.

w^s = \frac{\gamma(x^s)}{q(x^s)}, \qquad x^s \sim q(x) \qquad E[w^s] = p(y)

For Metropolis-Hasting we do something different: We say, suppose we have a sample x^{s-1}. Now we are going to propose a new sample, and are going to pick a number between 0 and 1. We are going to calculate the acceptance ratio, and if my number is larger than my acceptance ratio then I keep the previous sample otherwise I keep the new sample.

MCMC has a transition probability/kernel x^s \sim \kappa(x\mid x^{s-1}) and satisfies detailed balance \pi(x^s)\kappa(x^{s-1}\mid x^s) = \pi(x^{s-1})\kappa(x^{s}\mid x^{s-1}).

The difference: is that IS gives us the estimate of our marginal, and MH allows us to do hill-climbing, but doesn’t give us an estimate for the marginal.

We’ll discuss why having an estimate for the marginal is important when we discuss Annealed Importance Sampling!

Thanks for reading!