# What are Monte Carlo Markov Chains (MCMC)?

In this blog, I introduce what a Monte Carlo Markov Chain (MCMC) is and discuss how MCMC is different from Importance Sampling. I discuss what a homogeneous Markov Chain is and what convergence means for MCMC. Lastly I discuss detailed balance and how good transition probabilities for Markov Chains satisfy this property.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are inspired by or exactly from his lectures. In addition, I’d like to address that more details on MCMC algorithms can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

#### Why Move On from Importance Sampling?

Imagine that we have a distribution like in the example drawn on the left (in purple). The distribution is bimodal, meaning that there are two areas of concentration.

If we were to use Self-Normalizing Importance Sampling (SNIS) to get samples from this distribution (see my last blog on Importance Sampling vs Self-Normalizing IS), we would draw samples from the proposal distribution and either keep it if the importance weight is high, or throw it away if the weight is low.

Imagine we have a Gaussian proposal distribution that covered the entire purple density. If the mean of the proposal distribution overlaid in between the two modes in the purple density, most samples drawn would have a low importance weight (keep in mind that samples from a Gaussian distribution center around its mean). In this case, we would probably throw away a lot of samples since we specifically want more samples that are within the two concentrated areas of the distribution.

Why is throwing away so many samples bad? Ideally, we want to get as many good samples as possible because the more we have, the better we can know what the posterior distribution looks like. However, realistically, we work with high-dimensional problems. The number of samples we need exponentially increases as the number of dimensions increase. Imagine getting 1000 samples and throwing away 990. 10 good samples is definitely not enough to describe a posterior distribution of high dimensionality.

The question is, how do we move away from the idea of throwing away a lot of samples? Instead, is there a way that we continue to get similar good samples once we have a good sample already? We can, using MCMC.

#### Monte Carlo Markov Chain (MCMC)

Monte Carlo Markov Chains move on from the idea of randomly sampling from the proposal. Instead MCMC introduces the idea of getting similar samples to already good samples, i.e. start with a sample we previously drew, $x^{s-1}$, now use it to propose a new sample nearby $x^{s}$.

Markov Chains

Having a chain means having a sequence, $X^1, ... , X^S$ where $S$ is the total number of samples. The Markov property says that instead of getting a new sample based on all of the previous samples, we can get a new sample just based off the previous (single) sample.

$X^s | X^{s-1} ... \perp\!\!\!\perp X^1, ... X^{s-2}$

where $\perp\!\!\!\perp$ means conditionally independent

$p(x^s|x^{1:1-s}) = p(x^s|x^{s-1})$

What is an example of something that is not a Markov Chain? Consider words in a sentence. Each word in a sentence is not a Markov Chain since each word does not depend on only the previously said word, but instead on several previous words, or even previous sentences.

Homogenous Markov Chains

Additionally we are going to talk about a specific kind of Markov Chain, a homogeneous Markov Chain. The best way to describe what a homogeneous chain is to start with what is not a homogeneous chain. Let’s consider a sentence with words again. Let’s say you’re generating a sentence, most of the time, this process is non-homogeneous. This is because the conditional distribution changes as you keep generating words.

In order for a chain to be homogeneous, we have to use the same conditional distribution for the next step of the sequence. We see this idea in hidden Markov models. At each time we are at a state, we decide whether to go to another state or not. If we decide to not to go another state, at the next step, we decide again. When we keep doing that, we get a sequence where we keep flipping back and forth between states. So the prior probability there at each time of flipping states, is exactly the same. That is an example of a homogeneous Markov chain.

A Markov chain is homogeneous when we have the same transition distribution for every sample $s$:

$p(X^s= x^s|X^{s-1}=x^{s-1}) = p(X^2=x^s|X^1=x^{s-1})$

This simply means we use the sample probability to move around!

Convergence

Now that we have talked about a homogeneous Markov chain, I’ll discuss convergence towards the target density (another term for posterior or true distribution). Ideally, we want to converge to this target density, which we will refer to as $\pi(x)$. We can basically start looking at a limit of more and more samples and the probability that sample $s$ takes on a particular value $x$. If we continue taking samples long enough, we want that probability to be exactly the same as the probability that some random variable under the target density takes on.

$\lim_{s\rightarrow \infty} p(X^s=x) = \pi(X=x)$

What’s the intuition behind convergence here?

Let’s refer to the figure above. Let again consider the purple distribution. This will be the target density we wish to converge to. Remember that there are two modes to this density; there are two regions of high probability. Let’s initialize a Markov chain with $x^1$ being our first sample. If we keep transitioning to new points, $x^2, ... x^S$, you can image that we are doing some form of random walk. If we run our random walk long enough, then we hope that all these points in this random walk will visit each region within the space proportional to the target density. So regions of high probability are places we want our random walk to visit more often than those of low probability.

Detailed Balance

Now, notice how I haven’t really said anything yet, other than if we have a Markov chain, we are picking where to go next given a previous step. Also, I have said that if we design a good Markov chain, then maybe we can converge towards the target density.

I will now introduce what sort of Markov chains might converge to this target density. I’ll introduce a transition operator where if we already have a sample from the target density then that means we are going to stay at the target density. Remember that this is the idea behind MCMC that makes it different from Importance Sampling.

NOTE: We are going to do some hand waving and say that if we leave the target density invariant, it also converges. I won’t prove that here because it’s a bit of a pain. We will simply gloss over that theoretical aspect here, but I will introduce the notion of leaving a target density invariant.

The Markov Chain we will talk about is one that satisfies detailed balance. A homogeneous Markov Chain satisfies detailed balance when

$\pi(x)p(x'|x) = \pi(x')p(x|x')$

This says, suppose we have the density at x, $\pi(x)$, and we define a probability of going to the variable $x'$ given that I was at $x$. We say that this quantity should be same as the quantity of starting at $x'$, $\pi(x')$ and defining the probability of going to $x$ from $x'$.

Another way to say this is for every pair $x$ and $x'$, our sampling mechanism $p(x'|x)$ leaves $\pi(x)$ invariant:

Since $\int dx' p(x'|x) = 1$, (multiply by 1 trick!) we can write

$\pi(x)\ = \int dx' \pi(x) p(x'|x)$

Then by definition ($\pi(x)p(x'|x) = \pi(x')p(x|x')$) we can rewrite $\pi(x)$ as

$\pi(x) = \int dx' \pi(x') p(x|x')$

So what does this really mean? It means that if I start with a distribution of $\pi(x)$ and look at all possible ways we can end up at a given point $x$, starting from some other point, $x'$, and we take the expectation of how often we start at some $x'$, then we also get $\pi(x)$.

So if we start with $x'$, $x' \sim \pi(x)$, and use it to get a new sample from the transition probability, $x \mid x' \sim p(x\mid x')$, then I have just sampled $x$ from $\pi(x)$, $x \sim \pi(x)$.

Side Note: Even though using the transition probability means $x\sim \pi(x)$ and $x' \sim \pi(x)$, using the transition probability means that $x$ and $x'$ are not independent samples.

Now we can say that good Markov Chains have transition probabilities that satisfy this property, detailed balance!