## Sequential Monte Carlo (SMC)

We begin with a short recap of all the previous algorithms we discussed that are prerequisite to Sequential Monte Carlo. Then I intuitively describe how SMC works. After, we dig into the math behind SMC. We first discuss how importance weights are calculated at each time step, and then we discuss how SMC maintains this notion of survival of the fittest with sample trajectories through resampling.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are inspired by his lectures. In addition, I’d like to address that more details on SMC can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

### Mini Review

Before we move on to other Monte Carlo estimators we might use in machine learning, let’s recap:

• We have discussed what a Monte Carlo Estimator is. It helps us estimate an expectation of some function via sampling.
• We introduce Importance Sampling as our first Monte Carlo Estimator. We learn that Importance Sampling requires us to evaluate samples using a normalized density, which is something we cannot do (almost always).
• Then we describe Self-Normalized Importance Sampling (SNIS). SNIS uses two Monte Carlo Estimators; the same one in Importance Sampling and a second one to estimate the normalization constant. This allows us to evaluate samples via an unnormalized density, which we have access to.
• Importance Sampling methods give us marginal likelihoods, but the downfall is that they do random guessing.
• We then discuss Monte Carlo Markov Chains (MCMC). MCMC uses transition kernels that let us stay in the target density once we have a good sample (one from the target density). However, (almost always) successful MCMC algorithms need to satisfy detailed balance.
• We learn that Metropolis-Hastings (a fancy MCMC algorithm) let’s us use any proposal due to an acceptance probability it uses to manage detailed balance for us.
• In the previous post, we describe an algorithm that marries Importance Sampling and MCMC, Annealing Importance Sampling. This algorithm successfully breaks a hard problem into a series of simpler problems. It provides a marginal likelihood and uses a transition kernel.

Now we will introduce another Monte Carlo Estimator that is used for a set of different problems in machine learning, Sequential Monte Carlo.

### Tick Tock: Introducing Time to the Problem!

All of the algorithms we have discuss so far are for problems for which we do not need to consider the element of time! What does that mean? Let’s consider an example of looking at a sequence of video frames of a basketball moving from time frame to time frame.

We record that at time $t_1$ (which can be any arbitrary time stamp), the ball to be up high. We continue to watch the ball move at $t_2$ and $t_3$, steadily moving downward. The question we have is, “Where will the ball be in the future? This time-dependent data is referred to as time-series data.

There are many combinations of future movements the ball can make. Several can make sense by following the laws of physics, others can be absolutely garbage (like the ball suddenly shooting upwards, or spelling out a word).

Now consider using any of the Monte Carlo Estimating algorithms we have already discussed. If our data is time dependent, then Importance Sampling would have a very difficult time picking up a pattern that correlates movements between frames. It would probably take ages to get a good set of samples from the posterior! Instead, we can use a more appropriate algorithm for time-dependent problems, such as Sequential Monte Carlo (SMC).

### Intuition behind Sequential Monte Carlo

SAMPLING BUDGET. Let’s start off by stating that every sampling scheme has this notion of a sampling budget. This sampling budget (or sample budget) is the maximum number of samples we can afford to use. The number in our budget may depend on the computer we use, the dimensionality of the problem, or even just the amount of time we can afford running our algorithm (because these algorithms do take time!). Usually we use the notation $S$ to represent the number of samples in our sample budget, $x^1, x^2, ... x^S$.

In time-series problems, we perform inference (using some form of Monte Carlo Estimator) at each time step. This means we use the same number of samples for each time step, i.e. the number of samples in our sampling budget, $S$.

NOTE: People have adopted the term particle to replace the term sample. Instead of saying we have a sampling budget of 10 samples… it’s common to say we have a sampling budget of 10 particles.

So let’s now consider an example where we have a sampling budget of four, $S= 4$, and we are working with a small problem with five time steps, $t_1, ... t_5$. Imagine that each sample represents the position of the ball at its respective time frame. Below we show an example where we have 4 samples at $t_1$, represented by circles, $x^1, x^2, ..., x^4$. You can image that these samples represent 4 guesses of the position of the ball for the data from the video frame. Now imagine that each sample (or ball position) has an associated importance weight (just like we discussed in Importance Sampling), $(x^1, w^1), (x^2, w^2), ..., (x^4, w^4)$. The weight tells us how well the ball position matches the data in the video frame. In the figure below, I drew larger circles to represent higher importance weights and smaller circles to represent lower importance weights.

So what SMC does next is decide which samples we should keep and which ones which throw away. This is a way of saying that some guesses of the ball’s position are going to be good ones, and some will be not so good. We want to keep working with our already good guesses at the next time step, so that we can build on those guesses with other guesses. We do this by resampling. This means we will sample with replacement 4 new samples to build on in the next time step, $t_2$. We resample based by their weights (we’ll go more into detail in a bit).

In the figure above, it so happened that we resampled $x^1_{t_1}$ twice and $x^3_{t_1}$ again, twice. This means that those samples for the ball’s position were really good, and we want to keep working with those samples. Now at $t_2$, we sample new ball positions for that time step, $(x^s_{t_2}, w^s_{t_2})$, but keep into account the ball positions we resampled from the past (red notation above, $x^{s}_{t_1}$).

If we move on to $t_3$, we start to notice how certain trajectories (such as the one in gray) start dying out. This means that some sequences of ball positions will NOT be used for at future time steps.

We can continue this pattern until we reach the last time step, $t_5$. We will see that by then we still have 4 samples. However these 4 samples represent full trajectories, $x^{1}_{t=1:5}, ... , x^4_{t=1:5}$; these samples connect back to previous time steps. Full trajectories represent entire sequences of ball movements. The ones that survived are the ones that best match our video frame data.

Now that we have described how SMC behaves, let’s be a bit more precise and dig into the math. Don’t worry, I’ll explain it as I use it.

### Math Behind Sequential Monte Carlo

Let’s break this discussion into 2 parts. We’ll first discuss the Importance Sampling bit of SMC, and then we’ll discuss Importance Resampling.

Idea 1: Sequential Importance Sampling

As we previously mentioned, our samples have an associated importance weight. In a Monte Carlo estimate, an importance weight of a sample is evaluated in the same pattern we have seen before, with some unnormalized density and the proposal, $w =\frac{\gamma(x)}{q(x)}.$

Since we are now working with time-series data, we are working with a sequence of samples. Remember that with any problem we have data we observe, $y$, and data we are trying to learn, $x$ (sometimes referred to as states, or hypothesis). In the example above, $y$ would be the video frame data and $x$ would be the positions of the ball that we are trying to learn/predict.

So in Sequential Monte Carlo, our importance weight really looks more like this:

$w_t = \frac{p(y_{1:t}, x_{1:t})}{q(x_{1:t})}$

where $p(y_{1:t}, x_{1:t})$ represents our unnormalized density and $q(x_{1:t})$ is our proposal. These two distributions take time into consideration; meaning that the densities account for all the samples at each time step.

NOTE: $p(y_{1:t}, x_{1:t})$ is equivalent to $p(y_1, y_2, ...., y_t, x_1, x_2, ..., x_t)$

Let’s expand our unnormalized density using the chain rule:

$p(y_{1:t}, x_{1:t}) = p(y_{t}, x_{t} \mid x_{1:t-1}) p(y_{1:t-1}, x_{1:t-1})$

This is a way of expressing the density that predicts the $t^{th}$ video frame and ball position given the previous $1 : (t-1)$ video frames and ball positions.

And let’s do the same for our proposal distribution:

$q(x_{1:t}) = q(x_{t} \mid x_{1:t-1}) q(x_{1:t-1})$

Now that we have these two expanded forms of the distribution, let’s see how they look as an importance weight.

$w = \frac {p(y_{t}, x_{t} \mid x_{1:t-1}) } { q(x_{t} \mid x_{1:t-1})} \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$

This looks kind of messy. Let’s clean it up a bit. Remember that using chain rule, we could expand our unnormalized density as $p(y_{1:t}, x_{1:t}) = p(y_{t}, x_{t} \mid x_{1:t-1}) p(y_{1:t-1}, x_{1:t-1})$. Notice how we can solve for $p(y_{t}, x_{t} \mid x_{1:t-1})$:

$p(y_t, x_t|x_{1:t-1}) = \frac{p(y_{1:t}, x_{1:t})}{p(y_{1:t-1}, x_{1:t-1})}$

If we replace $p(y_t, x_t|x_{1:t-1})$ in the importance weight with the fraction above:

$w_t = \frac {p(y_{1:t}, x_{1:t}) } { p(y_{1:t-1}, x_{1:t-1}) q(x_{t} \mid x_{1:t-1})} \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$

we can recognized that the second fraction is an importance weight for the previous time step:

$w_{t-1} = \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$

which means our sequential importance weights are updated from time step to time step based on previous weights:

$w_t = \frac {p(y_{1:t}, x_{1:t}) } { p(y_{1:t-1}, x_{1:t-1}) q(x_{t} \mid x_{1:t-1})} \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$
$= \frac {p(y_{1:t}, x_{1:t}) } { p(y_{1:t-1}, x_{1:t-1}) q(x_{t} \mid x_{1:t-1})} w_{t-1}$

We can generalize using unnormalized densities if we say:

$p(x_t, y_t|x_{1:t-1}) = \frac{p(x_{1:t}, y_{1:t})}{p(x_{1:t-1}, y_{1:t-1})} = \frac{\gamma_t(x_{1:t})}{\gamma_{t-1}(x_{1:t-1})}$

Which means we can formulate a generalized update of importance weights at each next time step:

$w_t = \frac {\gamma_t(x_{1:t})} { \gamma_{t-1}(x_{1:t-1})q(x_{t} \mid x_{1:t-1})} w_{t-1}$

And there you have it! This is how we do the importance sampling in sequential problems. The importance weight of current samples depend on the importance weights of previous samples.

Idea 2: Importance Resampling

If we refer back to the intuition behind SMC, you’ll remember that we pick which sequence of samples we decide to reuse for future sampling.

When we decide that a sequence of samples no longer matches our observations well enough, or aren’t strong enough explanations of the data as other sequences of samples, we stop using them (refer to grey sequences in the figure above). These sequences die out because they were not chosen to continue based on their importance weights. We can think of this process as natural selection. The best sequences will continue to the end, while others will die out on their journey to the end.

Remember that at each time step, we work with a sample budget. We’re going to change notation a bit and now say that we have a sample budget of $K$ samples. So in SMC, at each time step, we sample $K$ particles, and compute an importance weight for each.

$w^k = \frac{\gamma(x^k)}{q(x^k)} \qquad x^k \sim q(x^k) \qquad k = 1, ..., K$

At this point, all of the particles’ importance weights are unnormalized. This means that they do not sum to 1. So in order to choose which samples are more fit than others (i.e. have proportionally higher weights), we normalize the importance weights.

$\bar{w^k} = \frac{w^k}{\sum_{k'=1}^K w^{k'}}$

Now we can use these normalized weights in order to choose which ones will continue on. We do this by sampling particle indexes proportionally to their normalized weight by means of a discrete (commonly known as categorical) distribution.

$a^k \sim Discrete( \bar{w^1}, ..., \bar{w^k} ) \qquad k=1, ..., K$

$a^k$ represents a single number, 1 through $K$, that is tells us which particle to pick. We do this K times (sample budget), and sample with replacement. This means that we can pick the same particle more than once (this usually happens with high weighted particles).

So now that we have $K$ number of numbers that are between 1 through $K$, we pick them out!

$\tilde{x}^{k'} = x^{a^k} \qquad k'=1, ..., K \quad k=1, ..., K$

where $\tilde{x}^{k'}$ represents a fit sample that we will use for the next time step.

So far we have chosen which samples are fit, $\tilde{x}^1, ..., \tilde{x}^K$. Our next step is to reweigh these fit samples. Since they were all chosen according to their weight, they do not sum up to 1 anymore. In fact, their weights do not fully represent their value anymore. So how do we fix this? Well, we can do this by preserving the average of the original weights, $w^1, ..., w^K$. Let’s say that $\hat{Z}$ represents the average:

$\hat{Z} = \frac{1}{K} \sum_{k=1}^K w^k$

If we decide to set all the new weights, $\tilde w^1, ..., \tilde{w}^K$, of all the fit samples, $\tilde{x}^1, ..., \tilde{x}^K$ to $\hat{Z}$, then the average is the same!

$\hat{Z} = \frac{1}{K} \sum_{k=1}^K \tilde{w}^k = \frac{1}{K} \sum_{k=1}^K \hat{Z}^k$

### Summary

So let’s talk about SMC some more with a sequence of question-answer pairs.

1. Why do we use SMC?

We use it when we are working with time-dependent data.
2. How is SMC like previous Monte Carlo Estimators we have talked about?

It’s Importance Sampling, except we update our weights with past weights, and there’s a notion of pruning out bad sequences of guesses through resampling.
3. Why do we update our weights after resampling?

After resampling, we have a new collection of fit guesses/samples/particles. We do it to preserve the average of the original weights.

Thanks for reading this blog! Check out past blogs that build up to SMC.

## Importance Sampling (IS) vs Self-Normalized IS

I begin by discussing why Monte Carlo Estimators are used. One Monte Carlo Estimator I introduce is Importance Sampling. Most of the time, however, Importance Sampling alone is not enough. Instead we resort to using Self-Normalized Importance Sampling. In this blog, I discuss what we really mean when we say Importance Sampling. We learn that, generally, when someone says they are using Importance Sampling, they are really mean Self-Normalized Importance Sampling.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are from his lectures. In addition, I’d like to address that more details on MC algorithms can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

## Monte Carlo Estimators

I’d like to begin by stating that Importance Sampling is a Monte Carlo Estimator. Our goal for a Monte Carlo Estimator is to calculate the expectation of some function with respect to the normalized density $\pi(x)$ where

$\pi(x) = \gamma(x) / Z \\ = P(y, x) / P(y) \\ = P(x | y) \\$

$\gamma(x)$ is an unnormalized density which we can sample from and $Z$ is a normalizing constant

$Z = \int dx \gamma(x) \\ = \int dx P(y, x) \\ = P(y)$

$F = E_{\pi(x)}[f(x)] \\ = \int dx \pi(x) f(x) \\ = E_{p(x|y)}[f(x)]$

Why do we use Monte Carlo Estimators?

We use Monte Carlo Estimators because we do not have access to the normalized density, $\pi(x)$. Having access to the normalized density requires us to compute the normalizing constant, $Z$, which means integrating over all possible values $\gamma(x)$ could evaluate to. This is what we (almost always) cannot do! The good news is that we do have access to the unnormalized density, $\gamma(x)$.

The key idea here is that we use samples from the unnormalized density, $\gamma(x)$ to estimate some function we are interested in learning. We assume here that samples from $\gamma(x)$ are independent and identically distributed, (IID).

$\hat{F}^S = \frac{1}{S} \sum_{s=1}^{S}f(x^s) \ x^s \sim \pi(x)$

There are two main properties that Monte Carlo Estimators hold. The first is that the estimate is unbiased. The second is that it is consistent. I’ll briefly describe what each mean.

The estimate is unbiased when the expected value of the estimator is the exact quantity we are trying to estimate.

$E[\hat{F}^S] = E[ \frac{1}{S} \sum_{s=1}^{S} f(x^S)] \\ = \frac{1}{S} \sum_{s=1}^{S} E[f(x^S)] \\ = E [f(x^S)] \\ = F$

Our estimator is consistent when the estimation becomes more accurate as we use more samples. This happens when the variance approaches zero as the number of samples approaches infinity:

$\lim\limits_{S \rightarrow \infty} E \left[(\hat{F}^S - F)^2\right] = 0$.

## Importance Sampling

When we use importance sampling, we try to find the best estimate we can by guessing! We ask, what is a distribution that can give me samples that might also be samples from the distribution we actually care about, $\pi(x)$?

The distribution we use to make guesses is what we call a proposal distribution, $q(x)$. If we look at the figure below, we see an example of what a normalized distribution (also called the true or posterior distribution) might look like drawn in purple, and what a guessing distribution looks like drawn in green (a gaussian in this case).

We expect $\pi(x)$ to be a distribution that is difficult to describe, that is why it has so many wiggles in this example. As for the guessing distribution, the proposal distribution, $q(x)$, it is generally a very simple distribution we know how to sample from. That is why we commonly use Gaussian distributions as proposal distributions.

The idea behind importance sampling is that we can measure how well a sample from the proposal does with respect to the posterior distribution. We can measure using an importance weight, $w$.

To get an intuition of what high and low importance weights mean, let’s refer to the example again. Let’s say we get a sample, x, and it is evaluated by both the proposal, $q(x)$, and posterior, $\pi(x)$, distributions ( I will now stop using the word distribution when referring to the proposal and posterior distributions).

Consider two cases.

1. The sample lands outside the proposal, but inside the posterior.
2. The sample lands inside the proposal, but outside the posterior.

In the first case, we get samples that are under represented by the proposal, thus producing higher importance weights!

In the second case, we get samples that are overrepresented by the proposal, which yields low importance weights!

The goal here is get samples that better represent the posterior. The higher the importance weight, the better it approximates a sample from the posterior.

More High-Weighted Samples = Better Approximation of Posterior Distribution

NOTE: $q(x)$ must have the same or a larger support than $\pi(x)$.

Let’s now show this using math.

Remember that with a Monte Carlo Estimator, we want to approximate some function. $E_{\pi(x)}[f(x)] = \int dx \pi(x) f(x)$

We can play around with this equation in order to make it more useful. We do this by using a trick we often use in machine learning, multiplying by 1! Our 1 we multiply here is $\frac{q(x)}{q(x)}$. So now we have:

$\int dx \pi(x) f(x) = \int dx q(x) \frac{\pi(x)}{q(x)} f(x)$

As part of being an estimator, we can approximate this integral using samples:

$E_{q(x)}[f(x)] \approx \frac{1}{S} \sum_{s=1}^{S} \frac{\pi(x^S)}{q(x^S)}f(x^S), \; x^S \sim q(x)$

$\frac{\pi(x^S)}{q(x^S)}$ is what we call the importance weight, $w$.

And there you have it, Importance Sampling!

To do Importance Sampling, all you need is a proposal distribution you can sample from, $x \sim q(x)$, in order to get an importance weight, $w = \frac{\pi(x)}{q(x)}$, that tells you how well it approximates being a sample from the posterior distribution, $\pi(x)$.

We’re done! … or are we?

If you’re thinking, “Wait! but I thought we couldn’t evaluate a sample using $\pi(x)$ because that requires us knowing the normalizing constant, $Z$! And if we can’t evaluate a sample at $\pi(x)$, we can’t get an importance weight!”

and you would be right!

Folks, this is where I now introduce Self-Normalized Importance Sampling.

## Self-Normalized Importance Sampling

As mentioned before, people usually refer to Self-Normalized Importance Sampling (SNIS) when they say Importance Sampling. Now, the question is, what makes SNIS better? The answer is, it’s more useful! Instead of relying on the normalized distribution, $\pi(x)$ (which we do not know), we use something we do know: the unnormalized distribution, $\gamma(x)$. But how? By using another Monte Carlo Estimator to estimate our normalizing constant, $Z$!

Remember that $Z = \int dx \gamma(x)$ and $\pi(x) = \gamma(x) / Z$

If we rewrite $\pi(x) = \gamma(x) / Z$ as $\gamma(x) = Z * \pi(x)$

We can rewrite $Z = \int dx\: \gamma(x)$ as $Z = Z \int dx\: \pi(x)$

We can then do our multiply by 1 trick again, $\frac{q(x)}{q(x)}$!

$Z = Z \int dx\: q(x) \frac{\pi(x)}{q(x)}$
We can replace $\pi(x)$
$Z = Z \int dx q(x) \frac{\gamma(x)}{q(x) Z}$
Notice how the $Z$s cancel above on the right side!
$Z = \int dx q(x) \frac{\gamma(x)}{q(x)}$
This now looks like an expectation
we can turn into a Monte Carlo Estimator:
$\hat{Z}^S = \frac{1}{S} \sum_{s=1}^{S} \frac{\gamma(x^S)}{q(x^S)} \:\: x^S \sim q(x)$

Now to get Self-Normalized Importance Sampling, we combine the Importance Sampling estimator with the normalization constant estimator we just derived.

We update

$E_{q(x)}[f(x)] \approx \frac{1}{S} \sum_{s=1}^{S} \frac{\pi(x^S)}{q(x^S)}f(x^S), \; x^S \sim q(x)$

to

$E_{q(x)}[f(x)] \approx \frac{1}{\hat{Z}^S} \frac{1}{S} \sum_{s=1}^{S} \frac{\gamma(x^S)}{q(x^S)}f(x^S) \:\: x^S \sim q(x)$

We can write this more neatly. Remember that we can define our weight as $w^s = \frac{\gamma(x^S)}{q(x^S)}$.

We can then rewrite our normalization constant estimate using this weight, $\hat{Z}^S= \frac{1}{S} \sum_{s=1}^{S}w^S \approx p(y)$

We can then write SNIS as

$\hat{F}^S = \frac{1}{S \hat{Z}^S} \sum_{s=1}^{S} w^S f(x^S) = \sum_{s=1}^{S} \frac{w^S}{\sum_{s=1}^{S} W^S} f(x^S)$

As you probably noticed, the weights are now normalized, hence Self-Normalized Importance Sampling!

Thank you for reading this blog!