## Annealing Importance Sampling

We discuss an algorithm that reaps of the benefits of both MCMC and IS, Annealing Importance Sampling. This blog builds on my previous blogs on Importance Sampling (IS) and Monte Carlo Markov Chain (MCMC). First, I’ll discuss why the marginal likelihood is relevant via example. Then I’ll introduce the two main ideas behind Annealing Importance Sampling and end with a discussion on understanding those two main ideas.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are either inspired by or directly from his lectures. I’d like to add that more details on Annealing Importance Sampling can be found on this paper by Radford M. Neal.

As previously discuss, the benefit that MCMC has over Importance Sampling is its transition kernel that allows us to stay within the target density once we already have a sample from that target density. However, I mentioned in my previous blog, MCMC:Metropolis-Hastings, that Importance Sampling gives us something MCMC does not, an estimate of the marginal likelihood, $p(y)$. So now we answer the question, what is the benefit of having an estimate of the marginal likelihood? We discuss with an example.

#### Why is the Marginal Likelihood Relevant?

Suppose that we are doing clustering on the data points shown in the figure above, and we ask the question, “How many clusters K should I use?” In the first example, we try to fit 2 clusters to the data, in the second example, we try 3 clusters, and lastly, we try 10 clusters. If we look at the likelihood of the data given the parameters, then the last example has a really high likelihood, $p(y|\theta)$, but the first example has low likelihood. Even though the last example has high likelihood, we also know that the clusters are over-fitting. The question is, can we describe in math what over-fitting looks like?

The intuition is: If we randomly drew new clusters for the data, how likely is it that the clusters would be on top of the data? For the last example, the probability is next to zero. On the other hand, for the first example, some of the points would probably fall into the new clusters (shown in blue). We can formalize that by looking at the marginal likelihood, which is $p(y|k)$ (probability of y given the number of clusters k).

$K^* = \underset{{k \in {1, ... , K^{max}}}}{\text{argmax}} p(y|k) = \underset{k}{\text{argmax}} \int d\theta p(y|\theta)p(\theta |k)$

where $p(\theta |k)$, describes the probability of the positions of the clusters given the number of clusters. We are going to go from K=1 cluster to the max number of clusters, and for each of those, we are going to compute the integral. We can think of the marginal likelihood as the Best Average Fit for the data!

#### Comparing Importance Sampling with MCMC

So now we know why having the marginal likelihood is kind of a big deal. However, Importance Sampling has an annoying property that makes it hard to generate a bunch of good samples. Consider the example we just discussed. Importance Sampling would basically randomly guess where to put clusters, which wouldn’t be efficient. MCMC, on the other hand, has this nice property where if we have a bad sample, we have a transition kernel that can give us a new sample. If the new sample is better, then we keep it. So if we keep getting better and better samples, we are more likely to get more good samples than when we randomly guess (like in Importance Sampling).

So the question now is “Why not both MCMC and IS?

### Annealing Importance Sampling

#### Idea 1 (Importance Sampling)

Imagine that we have we have two tiny clusters (above on the left), and we decide to sample uniformly from the shaded square within the plot (right). It would actually take a very long time to get sufficient samples that lie within the two tiny clusters. So the question we ask here is, “Can we make this problem a little bit easier?” How can we make it easier?

Well, imagine that we take the small cluster and make it a little broader. Then, we take those broader clusters and make those broader, and we keep going until we get a large single cluster.

We end up with something that can easily sample from! But how can we make this happen… and how does really solve our initial problem? Well, what if we could get samples from the broadest cluster and then refine those samples to get samples from a less broad cluster… and keep going until our samples are so refined and come from the original two small clusters?

That seems like a good idea, right!?

Idea 1: Break the problem into simple problems. Specifically, sample via intermediates of distributions.

Let’s say that our end goal is get some joint probability, $p(y, \theta)$ of some data, $y$, and latent variables, $\theta$ (which can be model parameters). However, we also want a really easy distribution to start with. Well, the easiest probability distribution we can is use is the prior, $p(\theta)$. It’s easy because sampling from the prior doesn’t require any inference.

So now the question is, “what can we do so that we start at the easy distribution and smoothly transition to the final (and hard) joint distribution?”

Let’s think about this! The relationship between the prior and the joint probability is that the joint probability is equal to the likelihood times the prior:

$p(y, \theta) = p(\theta \mid y) p(\theta)$

So what really let’s us start from the prior and end up to the joint probability is the likelihood! We basically remove the likelihood from the joint probability and then gradually add it back in.

One way we can do this is by introducing a number, $\beta_n$, where we use it to get the $n^{th}$ intermediate distribution:

$p(y \mid \theta)^{\beta_n} p(\theta)$

If $\beta_n = 0$, what do have? We are left with just the prior. If $\beta_n =1$, we have the full joint distribution. So to get a sequence of intermediate distributions, we pick a sequence of values that smoothly go from 0 to 1.

#### Idea 2 (Transition Kernels)

Now that we know how to get intermediate distributions, let’s talk about how we can use them.

Before we move on, let’s think about the overall goal again. We started off with an example showing that we want to learn clusters for some data, $y$. If we are using Gaussian distributions to represent our clusters, then our parameters, $\theta$, would be the mean and standard deviation, $\theta = {\mu, \sigma^2 }$. Therefore we are trying to learn the cluster means and standard deviations to best represent the data.

Initialization

Let’s begin by saying that we will sample a first sample from some proposal distribution: $\theta_0^s \sim q(\cdot)$ (using a zero to signify the first sample we have). The sample is going to have an importance weight: $w_0^s = \frac{\gamma(x_0^s)} {q(x_0^s)}$, (just like in Importance Sampling).

Now image that we have two samples (as shown in the figure above in green and within the prior distribution, $\gamma_0(\cdot)$. So now what we want to do is evaluate those samples relative to the next (and new) density, $\gamma_{n+1}(\cdot)$.

To do that, we are doing to use some form of transition kernel (just like in MCMC) to move the samples within the new density. We can move it several times. This gives us a new sample within the new density.

$\theta_n^s \sim k_{n-1}(\theta_n \mid \theta_{n-1})$

However, this means that we have a new importance weight for these new samples. The new weight, $w_n^s$, is updated as follows (I will derive it in a moment):

$w_n^s = \frac{\gamma_n(\theta_n^s)}{\gamma_{n-1}(\theta_n^s)} w_{n-1}^s$

So you can imagine now doing that for all of the intermediate distributions until we get to the final one. The algorithm is rather straightforward from here: 1) sample from some initial proposal, 2) get an importance weight, 3) evaluate the sample at the next density which gives you a new importance weight, 4) move the sample around the new density, then 5) repeat to the following density.

#### Understanding Idea 1: Deriving New Importance Weights

So how do we end up with the equation of the new weight evaluated at the next density? To answer this, let’s begin with a valid importance weight. Let’s assume that we sampled from a normalized density, $x \sim \pi(x)_{n-1}$, and we have an unnormalized density at the next step, $\gamma_{n}(x)$. Then a reasonable importance weight would be:

$w_n = \frac{\gamma_n(x)}{\pi_{n-1}(x)} \qquad x \sim \pi_{n-1}(x)$

We know that the relationship between a normalized density and an unnormalized density is the normalization constant, $Z$, where $\pi_n = \frac{\gamma_n(x)}{Z_n}$. Then one way we can rewrite $w_n$ is:

$w_n = \frac{\gamma_n(x)}{\pi_{n-1}(x)} = \frac{\gamma_n(x)}{\gamma_{n-1}(x)} Z_{n-1} \qquad x \sim \pi_{n-1}(x)$

Now imagine instead of sampling from $\pi_{n-1}(x)$, we sample from $\pi_{n-2}(x)$. Then we’d have:

$w_n = \frac{\gamma_n(x)}{\pi_{n-2}(x)} \qquad x \sim \pi_{n-2}(x)$

Then if we use the multiplying by 1 trick on this, $1 = \frac{\gamma_{n-1}(x)}{\gamma_{n-1}(x)}$. We have

$w_n = \frac{\gamma_n(x)}{\pi_{n-2}(x)} = \frac{\gamma_n(x)}{\gamma_{n-1}(x)} \frac{\gamma_{n-1}(x)}{\pi_{n-2}(x)} \qquad x \sim \pi_{n-2}(x)$

Notice how $\frac{\gamma_{n-1}(x)}{\pi_{n-2}(x)} = w_{n-1}$! Therefore we can write:

$w_n = \frac{\gamma_n(x)}{\gamma_{n-1}(x)} w_{n-1} \qquad x \sim \pi_{n-2}(x)$

This is really nice for us because now all we have to do to get a new weight is relate it to the previous weight.

#### Understanding Idea 2: Transition Kernels

So now we talk about combining Importance Sampling with MCMC.

We start with an importance sampler which means we sample from a proposal and then we get an importance weight. Now, MCMC let us move the sample around using a transition kernel, $\kappa(x' \mid x)$ (assuming it satisfies detailed balance).

$w = \frac{\gamma(x)}{q(x)} \qquad x \sim q(x) \qquad x' \sim \kappa(x' \mid x)$

We know that we update the importance weight relative to a new density, but do we need to update our weight after moving our sample around in the new density? Well, let’s think a little bit.

$w' = \:?$

Let’s remember what an importance weight looks like: $w = \frac{\gamma(x)}{q(x)}$. So getting some $\gamma$ and $q$ for our $w'$ would mean considering both $x$ and $x'$. Let’s expand on that starting with the proposal.

The proposal mechanism here is really doing two things at once, getting $x$ and $x'$, $\tilde{q}(x, x')$. This is done as shown before, sampling from some proposal and then using a transition kernel.

$\tilde{q}(x,x') = q(x)\kappa(x'\mid x)$

The unnormalized density also considers both $x$ and $x'$, $\tilde{\gamma}(x, x')$.

$\tilde{\gamma}(x,x') = \gamma(x')\kappa(x\mid x')$

So then our new weight would look like:

$w' = \frac{\tilde{\gamma}(x,x')}{\tilde{q}(x,x')} = \frac{\gamma(x')\kappa(x\mid x')}{q(x)\kappa(x'\mid x)}$

If our kernel satisfies detailed balance, $\gamma(x)\kappa(x'\mid x) = \gamma(x')\kappa(x\mid x')$ … notice how if we rewrite our detailed balance equation to solve for $\gamma(x)$, we will have $\frac{\gamma(x')\kappa(x\mid x')}{\kappa(x'\mid x)} = \gamma(x)$. Our new weight now looks like:

$w' = \frac{\gamma(x')\kappa(x\mid x')}{q(x)\kappa(x'\mid x)} = \frac{\gamma(x)}{q(x)} = w$

which equals to the original importance weight we calculated when we evaluated the sample on the new density!

So back to the original question here: do we need to update our weight after moving our sample around in the new density? The answer is nope! This is neat because we use MCMC on each distribution while preserving the importance weights!

## MCMC: Metropolis-Hastings

This blog is a continuation of the previous discussion on MCMC, What are Monte Carlo Markov Chains (MCMC)? I introduce a well known MCMC algorithm, Metropolis-Hastings (MH).  I describe how MH behaves and prove how it holds detailed balance. Then I show an example of a bad proposal, and move on to show what generally people choose for good proposals. I close with a brief comparison of MCMC and (Self-Normalized) Importance Sampling.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are from his lectures. In addition, I’d like to address that more details on MC algorithms can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

It turns out that it’s not so easy to design a transition operator that guarantees to have detailed balance. But what if maybe we can use something that turns any proposal operator (which is something we can come up with) into something that satisfies detailed balance. This folks, is what Metropolis-Hastings does for us!

#### What is Metropolis-Hastings?

The intuition behind Metropolis-Hastings is that

1. We know we need to have an operator that satisfies detailed balance
2. We can propose something ( some sample $x'$)
3. We can check whether the proposal is good or not
4. And sometimes we are going to keep that proposal, and sometimes we are not…. in such a way that keeping it vs not keeping it satisfies detailed balance.

Now I’ll introduce mathematically how MH works and then discuss why it works.

We begin with a current sample, $x^s$, which we will use to get a new sample, $x^{'}$ using a proposal distribution, $q(x' \mid x^s)$. The notation looks as follows: $x' \sim q(x' \mid x^s)$.

Then we decide to keep the new sample, $x^{s+1} = x'$ with probability:

$\alpha = \min\left(1, \frac{\pi(x')q(x^s|x')}{\pi(x^s)q(x'|x^s)}\right)$

To discuss what this probability actually means, let’s remind ourselves what the detailed balance relationship is. It is: $\pi(x)p(x'|x) = \pi(x')p(x|x')$. But since we are now working with proposals, let’s switch notation from $p$ to $q$.

$\pi(x)q(x'|x) = \pi(x')q(x|x')$

Side Note: Sometimes transition probabilities such as $q(x \mid x')$ can be referred to as a kernel. Instead of being specific with what is making the transition, we generalize by saying we are using a transition kernel. Common notation is $\kappa(x \mid x')$. They both represent getting one sample dependent on another sample.

Let’s suppose that $q(\cdot)$ satisfies detailed balance, then the ratio here between $\pi(x)q(x'|x)$ and $\pi(x')q(x|x')$ is 1. So now let’s refer back to the probability:

$\alpha = \min\left(1, \frac{\pi(x')q(x^s|x')}{\pi(x^s)q(x'|x^s)}\right)$

Here we are saying that with probability $\alpha$, we are going to set the proposal to our new sample, $x^{s+1} = x'$, and with probability $(1 - \alpha)$ we are going to reject the proposal. When we reject the proposal, our next sample is set to the same sample as before, , $x^{s+1} = x$.

We see that if $\alpha = 1$, we always accept the proposal. Basically, if our transition kernel always satisfies detailed balance, we can always accept it!

However, if it does not satisfy detailed balance, then this means that somehow there’s an imbalance with the kinds of proposals we are getting. There are some kind of samples that we are proposing too often and other kinds that we are not proposing often enough.

Question: if the ratio $\frac{\pi(x')q(x^s|x')}{\pi(x^s)q(x'|x^s)}$ is larger than 1, does this mean we are proposing some kind of samples too often or not often enough?

Let’s think about it! If we are not proposing the kind of samples we want more, then we should probably accept them with higher probability. However, if we are seeing a sample too often, then maybe we should only accept it sometimes. Therefore if $\frac{\pi(x')q(x|x')}{\pi(x) q(x'|x)} > 1$, it’s because we are seeing a rare sample we want more of (we are proposing not often enough). However, basic probability tells us that probabilities max out at 1. This is why we add the $\text{min}$ to get the probability $\alpha$.

So what does Metropolis-Hastings do for us? It’s an algorithm that basically corrects our proposals in order to preserve detailed-balance! Neat, huh?

#### Proving Detailed Balance for Metropolis-Hastings

The idea behind Metropolis-Hasting is that we can get a new (transitioned/dependent) sample when we sample from a proposal, $x' \sim q(x'|x^s)$, and get our next sample, $x^{s+1} = x'$ with probability $\alpha = \min\left(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}\right)$, otherwise we keep our previous sample, $x^{s+1} = x^s$.

When we have this particular sampling scheme, we an have an implied transition kernel that can tell us (when we have $x$ and $x'$), what is the probability that we are going to see a transition from $x$ to $x'$. In order to reason about this, we need to consider two cases. It could be that $x'$ is not the same value as $x$, but they could be in fact the same value where $x = x'$. We formulate this by working backwards.

In the figure above, we show that in order to get a value $x \neq x'$, we first need to sample an $x'$ from $q(x'|x)$ and then we need to decide to keep the sample with probability $\alpha$. Therefore for the case where $x \neq x'$, we have $\alpha * q(x'|x)$.

Note that there are two cases where $x = x'$ and $\alpha = 1$. The first is if somehow we were able to sample the exact same sample $x$ again (this can only happen in some cases, for example if we are sampling from a Bernoulli distribution, but if we were sampling from a Normal, then it would be very unlikely). This gives us $\alpha * q(x|x) = q(x|x)$. The second possibility of getting the same sample again is by rejecting. Then we calculate the probability that any rejected sample will give us the same sample as before. We integrate over possible values we could have rejected, $x''$, and then we write the probability of $(1 - \alpha'')q(x''|x)$.

We can prove detailed balance once we write out the kernel like this quite easily because it splits the problem into two small proofs, one for the case where $x = x'$ and another for $x \neq x'$.

In the case where $x = x$, we can just say: $\pi(x)\kappa(x|x) = \pi(x)\kappa(x|x)$

Really in MH, the point is not to describe how often we stay at the sample (there are other properties that should help this from happening), but rather how often we transition from sample to sample. Therefore we care more about the next case.

In the case where $x = x'$, we can say:

$\pi(x)\kappa(x'|x) = \alpha q(x'|x)\pi(x) \qquad \text{replace } \alpha \text{ with } \text{min}\left(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}\right)$

$= \text{min}( \pi(x) q(x'|x), \pi(x')q(x|x'))$

$= \text{min}( \pi(x') q(x|x'), \pi(x)q(x'|x))$

$= \alpha q(x|x')\pi(x')$

$= \pi(x')\kappa(x|x')$

$\pi(x)\kappa(x'|x) = \pi(x')\kappa(x|x')$

#### Unnormalized Densities

Now we know that Metropolis-Hastings preserves detailed-balance. It is also the case that once you have detailed-balance, you can say if you have a normalized density, $\pi$, then

$\pi(x) = \frac{\gamma(x)}{Z} \text{ and } \pi(x') = \frac{\gamma(x')}{Z}$
$\frac{\pi(x')}{\pi(x)} = \frac{\gamma(x')}{\gamma(x)}$

In practice, when doing MH, this is a nice property because we don’t always have the normalized density. we can calculate the acceptance probability using the unnormalized density, $\gamma(x)$.

$\alpha = \min(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)})$
$= \text{min}(1, \frac{\gamma(x')q(x|x')}{\gamma(x) q(x'|x)})$
$\text{For a Bayesian Inference Example where } \gamma(x) = p(y,x)$
$= \text{min}(1, \frac{p(y,x')q(x|x')}{p(y,x) q(x'|x)})$

It’s important to note that MH is very general. We can design any transition proposal and if that proposal violates detailed-balance then we can use the Metropolis-Hastings’ acceptance ratio to clean up the mis-designed proposal. We have shown that we can always evaluate the ratio whenever we have an unnormalized density and don’t have the normalize density (which is always the case).

#### Example of a Bad Proposal

We can do likelihood weighting. Suppose that we have to sample $x' | x$. We can sample from the prior, which means that we can ignore was $x$ was.

$q(x'|x) = p(x') \;\;\;\;\; \text{(Independent from previous sample)}$

This is what we call Independent Metropolis-Hastings because new samples are independent from the previous sample (which is valid). Now we can see what our new acceptance ratio is:

$\alpha = \text{min}(1, \frac{p(y,x')q(x|x')}{p(y,x) q(x'|x)})$
$= \text{min}(1, \frac{p(y,x')p(x')p(x)}{p(y,x)p(x)p(x')})$
$= \text{min}(1, \frac{p(y,x')}{p(y,x)})$

Because our probabilities cancel, we are left with the ratio of the likelihoods. But we have the same issue as the likelihood weighting where we have low acceptance probabilities. With the exception that this sampler won’t let go of a good $x$, this is no better than doing likelihood weighting because we still have to wait until we get a good sample again.

#### More Common Examples of Choosing Proposals

It’s much more common to setup a proposal where we sample a new sample, $x'$, from a Normal distribution centered on $x$ with some standard deviation, $\delta^2$. It’s important to note that $\delta^2$ is symmetric. This means that the probability going from $x'$ to $x$ should be the same as the probability as going from $x$ to $x'$, or $q(x'|x) = q(x|x')$.

This makes our acceptance ratio a bit easier since $q(x'|x) = q(x|x')$

$\alpha = \text{min}(1, \frac{\gamma(x')q(x|x')}{\gamma(x) q(x'|x)})$
$= \text{min}(1, \frac{\gamma(x')}{\gamma(x)})$

What are the options for our step size, $\delta^2$? (1) We can take really small steps. This means we will then need a lot of steps to get somewhere (2) We can also take really large steps. However, we will probably jump around too much end up with samples we will more often reject.

So the question is then, How should we set our step size?

Based on people who have done some analysis, we should tune our acceptance rate. We keep adjusting it until we get a magic number, which is $\alpha \approx 0.234$. This is optimal under a set of assumptions. In practice, we want anything between 0.1 and 0.7 depending on what we are doing.

#### Summary – MCMC vs. Importance Sampling

We have looked at two MC samplers, MCMC and Importance Sampling.

For Importance Sampling we say: if we design some proposal, generate some samples, assign each sample a weight, we can now take the weighted-average over the samples. In particular, we know that the samples gives us an estimate of the marginal likelihood.

$w^s = \frac{\gamma(x^s)}{q(x^s)}, \qquad x^s \sim q(x) \qquad E[w^s] = p(y)$

For Metropolis-Hasting we do something different: We say, suppose we have a sample $x^{s-1}$. Now we are going to propose a new sample, and are going to pick a number between 0 and 1. We are going to calculate the acceptance ratio, and if my number is larger than my acceptance ratio then I keep the previous sample otherwise I keep the new sample.

MCMC has a transition probability/kernel $x^s \sim \kappa(x\mid x^{s-1})$ and satisfies detailed balance $\pi(x^s)\kappa(x^{s-1}\mid x^s) = \pi(x^{s-1})\kappa(x^{s}\mid x^{s-1})$.

The difference: is that IS gives us the estimate of our marginal, and MH allows us to do hill-climbing, but doesn’t give us an estimate for the marginal.

We’ll discuss why having an estimate for the marginal is important when we discuss Annealed Importance Sampling!

## What are Monte Carlo Markov Chains (MCMC)?

In this blog, I introduce what a Monte Carlo Markov Chain (MCMC) is and discuss how MCMC is different from Importance Sampling. I discuss what a homogeneous Markov Chain is and what convergence means for MCMC. Lastly I discuss detailed balance and how good transition probabilities for Markov Chains satisfy this property.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are inspired by or exactly from his lectures. In addition, I’d like to address that more details on MCMC algorithms can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

#### Why Move On from Importance Sampling?

Imagine that we have a distribution like in the example drawn on the left (in purple). The distribution is bimodal, meaning that there are two areas of concentration.

If we were to use Self-Normalizing Importance Sampling (SNIS) to get samples from this distribution (see my last blog on Importance Sampling vs Self-Normalizing IS), we would draw samples from the proposal distribution and either keep it if the importance weight is high, or throw it away if the weight is low.

Imagine we have a Gaussian proposal distribution that covered the entire purple density. If the mean of the proposal distribution overlaid in between the two modes in the purple density, most samples drawn would have a low importance weight (keep in mind that samples from a Gaussian distribution center around its mean). In this case, we would probably throw away a lot of samples since we specifically want more samples that are within the two concentrated areas of the distribution.

Why is throwing away so many samples bad? Ideally, we want to get as many good samples as possible because the more we have, the better we can know what the posterior distribution looks like. However, realistically, we work with high-dimensional problems. The number of samples we need exponentially increases as the number of dimensions increase. Imagine getting 1000 samples and throwing away 990. 10 good samples is definitely not enough to describe a posterior distribution of high dimensionality.

The question is, how do we move away from the idea of throwing away a lot of samples? Instead, is there a way that we continue to get similar good samples once we have a good sample already? We can, using MCMC.

#### Monte Carlo Markov Chain (MCMC)

Monte Carlo Markov Chains move on from the idea of randomly sampling from the proposal. Instead MCMC introduces the idea of getting similar samples to already good samples, i.e. start with a sample we previously drew, $x^{s-1}$, now use it to propose a new sample nearby $x^{s}$.

Markov Chains

Having a chain means having a sequence, $X^1, ... , X^S$ where $S$ is the total number of samples. The Markov property says that instead of getting a new sample based on all of the previous samples, we can get a new sample just based off the previous (single) sample.

$X^s | X^{s-1} ... \perp\!\!\!\perp X^1, ... X^{s-2}$

where $\perp\!\!\!\perp$ means conditionally independent

$p(x^s|x^{1:1-s}) = p(x^s|x^{s-1})$

What is an example of something that is not a Markov Chain? Consider words in a sentence. Each word in a sentence is not a Markov Chain since each word does not depend on only the previously said word, but instead on several previous words, or even previous sentences.

Homogenous Markov Chains

Additionally we are going to talk about a specific kind of Markov Chain, a homogeneous Markov Chain. The best way to describe what a homogeneous chain is to start with what is not a homogeneous chain. Let’s consider a sentence with words again. Let’s say you’re generating a sentence, most of the time, this process is non-homogeneous. This is because the conditional distribution changes as you keep generating words.

In order for a chain to be homogeneous, we have to use the same conditional distribution for the next step of the sequence. We see this idea in hidden Markov models. At each time we are at a state, we decide whether to go to another state or not. If we decide to not to go another state, at the next step, we decide again. When we keep doing that, we get a sequence where we keep flipping back and forth between states. So the prior probability there at each time of flipping states, is exactly the same. That is an example of a homogeneous Markov chain.

A Markov chain is homogeneous when we have the same transition distribution for every sample $s$:

$p(X^s= x^s|X^{s-1}=x^{s-1}) = p(X^2=x^s|X^1=x^{s-1})$

This simply means we use the sample probability to move around!

Convergence

Now that we have talked about a homogeneous Markov chain, I’ll discuss convergence towards the target density (another term for posterior or true distribution). Ideally, we want to converge to this target density, which we will refer to as $\pi(x)$. We can basically start looking at a limit of more and more samples and the probability that sample $s$ takes on a particular value $x$. If we continue taking samples long enough, we want that probability to be exactly the same as the probability that some random variable under the target density takes on.

$\lim_{s\rightarrow \infty} p(X^s=x) = \pi(X=x)$

What’s the intuition behind convergence here?

Let’s refer to the figure above. Let again consider the purple distribution. This will be the target density we wish to converge to. Remember that there are two modes to this density; there are two regions of high probability. Let’s initialize a Markov chain with $x^1$ being our first sample. If we keep transitioning to new points, $x^2, ... x^S$, you can image that we are doing some form of random walk. If we run our random walk long enough, then we hope that all these points in this random walk will visit each region within the space proportional to the target density. So regions of high probability are places we want our random walk to visit more often than those of low probability.

Detailed Balance

Now, notice how I haven’t really said anything yet, other than if we have a Markov chain, we are picking where to go next given a previous step. Also, I have said that if we design a good Markov chain, then maybe we can converge towards the target density.

I will now introduce what sort of Markov chains might converge to this target density. I’ll introduce a transition operator where if we already have a sample from the target density then that means we are going to stay at the target density. Remember that this is the idea behind MCMC that makes it different from Importance Sampling.

NOTE: We are going to do some hand waving and say that if we leave the target density invariant, it also converges. I won’t prove that here because it’s a bit of a pain. We will simply gloss over that theoretical aspect here, but I will introduce the notion of leaving a target density invariant.

The Markov Chain we will talk about is one that satisfies detailed balance. A homogeneous Markov Chain satisfies detailed balance when

$\pi(x)p(x'|x) = \pi(x')p(x|x')$

This says, suppose we have the density at x, $\pi(x)$, and we define a probability of going to the variable $x'$ given that I was at $x$. We say that this quantity should be same as the quantity of starting at $x'$, $\pi(x')$ and defining the probability of going to $x$ from $x'$.

Another way to say this is for every pair $x$ and $x'$, our sampling mechanism $p(x'|x)$ leaves $\pi(x)$ invariant:

Since $\int dx' p(x'|x) = 1$, (multiply by 1 trick!) we can write

$\pi(x)\ = \int dx' \pi(x) p(x'|x)$

Then by definition ($\pi(x)p(x'|x) = \pi(x')p(x|x')$) we can rewrite $\pi(x)$ as

$\pi(x) = \int dx' \pi(x') p(x|x')$

So what does this really mean? It means that if I start with a distribution of $\pi(x)$ and look at all possible ways we can end up at a given point $x$, starting from some other point, $x'$, and we take the expectation of how often we start at some $x'$, then we also get $\pi(x)$.

So if we start with $x'$, $x' \sim \pi(x)$, and use it to get a new sample from the transition probability, $x \mid x' \sim p(x\mid x')$, then I have just sampled $x$ from $\pi(x)$, $x \sim \pi(x)$.

Side Note: Even though using the transition probability means $x\sim \pi(x)$ and $x' \sim \pi(x)$, using the transition probability means that $x$ and $x'$ are not independent samples.

Now we can say that good Markov Chains have transition probabilities that satisfy this property, detailed balance!