# Sequential Monte Carlo (SMC)

We begin with a short recap of all the previous algorithms we discussed that are prerequisite to Sequential Monte Carlo. Then I intuitively describe how SMC works. After, we dig into the math behind SMC. We first discuss how importance weights are calculated at each time step, and then we discuss how SMC maintains this notion of survival of the fittest with sample trajectories through resampling.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are inspired by his lectures. In addition, I’d like to address that more details on SMC can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

### Mini Review

Before we move on to other Monte Carlo estimators we might use in machine learning, let’s recap:

• We have discussed what a Monte Carlo Estimator is. It helps us estimate an expectation of some function via sampling.
• We introduce Importance Sampling as our first Monte Carlo Estimator. We learn that Importance Sampling requires us to evaluate samples using a normalized density, which is something we cannot do (almost always).
• Then we describe Self-Normalized Importance Sampling (SNIS). SNIS uses two Monte Carlo Estimators; the same one in Importance Sampling and a second one to estimate the normalization constant. This allows us to evaluate samples via an unnormalized density, which we have access to.
• Importance Sampling methods give us marginal likelihoods, but the downfall is that they do random guessing.
• We then discuss Monte Carlo Markov Chains (MCMC). MCMC uses transition kernels that let us stay in the target density once we have a good sample (one from the target density). However, (almost always) successful MCMC algorithms need to satisfy detailed balance.
• We learn that Metropolis-Hastings (a fancy MCMC algorithm) let’s us use any proposal due to an acceptance probability it uses to manage detailed balance for us.
• In the previous post, we describe an algorithm that marries Importance Sampling and MCMC, Annealing Importance Sampling. This algorithm successfully breaks a hard problem into a series of simpler problems. It provides a marginal likelihood and uses a transition kernel.

Now we will introduce another Monte Carlo Estimator that is used for a set of different problems in machine learning, Sequential Monte Carlo.

### Tick Tock: Introducing Time to the Problem!

All of the algorithms we have discuss so far are for problems for which we do not need to consider the element of time! What does that mean? Let’s consider an example of looking at a sequence of video frames of a basketball moving from time frame to time frame.

We record that at time $t_1$ (which can be any arbitrary time stamp), the ball to be up high. We continue to watch the ball move at $t_2$ and $t_3$, steadily moving downward. The question we have is, “Where will the ball be in the future? This time-dependent data is referred to as time-series data.

There are many combinations of future movements the ball can make. Several can make sense by following the laws of physics, others can be absolutely garbage (like the ball suddenly shooting upwards, or spelling out a word).

Now consider using any of the Monte Carlo Estimating algorithms we have already discussed. If our data is time dependent, then Importance Sampling would have a very difficult time picking up a pattern that correlates movements between frames. It would probably take ages to get a good set of samples from the posterior! Instead, we can use a more appropriate algorithm for time-dependent problems, such as Sequential Monte Carlo (SMC).

### Intuition behind Sequential Monte Carlo

SAMPLING BUDGET. Let’s start off by stating that every sampling scheme has this notion of a sampling budget. This sampling budget (or sample budget) is the maximum number of samples we can afford to use. The number in our budget may depend on the computer we use, the dimensionality of the problem, or even just the amount of time we can afford running our algorithm (because these algorithms do take time!). Usually we use the notation $S$ to represent the number of samples in our sample budget, $x^1, x^2, ... x^S$.

In time-series problems, we perform inference (using some form of Monte Carlo Estimator) at each time step. This means we use the same number of samples for each time step, i.e. the number of samples in our sampling budget, $S$.

NOTE: People have adopted the term particle to replace the term sample. Instead of saying we have a sampling budget of 10 samples… it’s common to say we have a sampling budget of 10 particles.

So let’s now consider an example where we have a sampling budget of four, $S= 4$, and we are working with a small problem with five time steps, $t_1, ... t_5$. Imagine that each sample represents the position of the ball at its respective time frame. Below we show an example where we have 4 samples at $t_1$, represented by circles, $x^1, x^2, ..., x^4$. You can image that these samples represent 4 guesses of the position of the ball for the data from the video frame. Now imagine that each sample (or ball position) has an associated importance weight (just like we discussed in Importance Sampling), $(x^1, w^1), (x^2, w^2), ..., (x^4, w^4)$. The weight tells us how well the ball position matches the data in the video frame. In the figure below, I drew larger circles to represent higher importance weights and smaller circles to represent lower importance weights.

So what SMC does next is decide which samples we should keep and which ones which throw away. This is a way of saying that some guesses of the ball’s position are going to be good ones, and some will be not so good. We want to keep working with our already good guesses at the next time step, so that we can build on those guesses with other guesses. We do this by resampling. This means we will sample with replacement 4 new samples to build on in the next time step, $t_2$. We resample based by their weights (we’ll go more into detail in a bit).

In the figure above, it so happened that we resampled $x^1_{t_1}$ twice and $x^3_{t_1}$ again, twice. This means that those samples for the ball’s position were really good, and we want to keep working with those samples. Now at $t_2$, we sample new ball positions for that time step, $(x^s_{t_2}, w^s_{t_2})$, but keep into account the ball positions we resampled from the past (red notation above, $x^{s}_{t_1}$).

If we move on to $t_3$, we start to notice how certain trajectories (such as the one in gray) start dying out. This means that some sequences of ball positions will NOT be used for at future time steps.

We can continue this pattern until we reach the last time step, $t_5$. We will see that by then we still have 4 samples. However these 4 samples represent full trajectories, $x^{1}_{t=1:5}, ... , x^4_{t=1:5}$; these samples connect back to previous time steps. Full trajectories represent entire sequences of ball movements. The ones that survived are the ones that best match our video frame data.

Now that we have described how SMC behaves, let’s be a bit more precise and dig into the math. Don’t worry, I’ll explain it as I use it.

### Math Behind Sequential Monte Carlo

Let’s break this discussion into 2 parts. We’ll first discuss the Importance Sampling bit of SMC, and then we’ll discuss Importance Resampling.

Idea 1: Sequential Importance Sampling

As we previously mentioned, our samples have an associated importance weight. In a Monte Carlo estimate, an importance weight of a sample is evaluated in the same pattern we have seen before, with some unnormalized density and the proposal, $w =\frac{\gamma(x)}{q(x)}.$

Since we are now working with time-series data, we are working with a sequence of samples. Remember that with any problem we have data we observe, $y$, and data we are trying to learn, $x$ (sometimes referred to as states, or hypothesis). In the example above, $y$ would be the video frame data and $x$ would be the positions of the ball that we are trying to learn/predict.

So in Sequential Monte Carlo, our importance weight really looks more like this:

$w_t = \frac{p(y_{1:t}, x_{1:t})}{q(x_{1:t})}$

where $p(y_{1:t}, x_{1:t})$ represents our unnormalized density and $q(x_{1:t})$ is our proposal. These two distributions take time into consideration; meaning that the densities account for all the samples at each time step.

NOTE: $p(y_{1:t}, x_{1:t})$ is equivalent to $p(y_1, y_2, ...., y_t, x_1, x_2, ..., x_t)$

Let’s expand our unnormalized density using the chain rule:

$p(y_{1:t}, x_{1:t}) = p(y_{t}, x_{t} \mid x_{1:t-1}) p(y_{1:t-1}, x_{1:t-1})$

This is a way of expressing the density that predicts the $t^{th}$ video frame and ball position given the previous $1 : (t-1)$ video frames and ball positions.

And let’s do the same for our proposal distribution:

$q(x_{1:t}) = q(x_{t} \mid x_{1:t-1}) q(x_{1:t-1})$

Now that we have these two expanded forms of the distribution, let’s see how they look as an importance weight.

$w = \frac {p(y_{t}, x_{t} \mid x_{1:t-1}) } { q(x_{t} \mid x_{1:t-1})} \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$

This looks kind of messy. Let’s clean it up a bit. Remember that using chain rule, we could expand our unnormalized density as $p(y_{1:t}, x_{1:t}) = p(y_{t}, x_{t} \mid x_{1:t-1}) p(y_{1:t-1}, x_{1:t-1})$. Notice how we can solve for $p(y_{t}, x_{t} \mid x_{1:t-1})$:

$p(y_t, x_t|x_{1:t-1}) = \frac{p(y_{1:t}, x_{1:t})}{p(y_{1:t-1}, x_{1:t-1})}$

If we replace $p(y_t, x_t|x_{1:t-1})$ in the importance weight with the fraction above:

$w_t = \frac {p(y_{1:t}, x_{1:t}) } { p(y_{1:t-1}, x_{1:t-1}) q(x_{t} \mid x_{1:t-1})} \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$

we can recognized that the second fraction is an importance weight for the previous time step:

$w_{t-1} = \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$

which means our sequential importance weights are updated from time step to time step based on previous weights:

$w_t = \frac {p(y_{1:t}, x_{1:t}) } { p(y_{1:t-1}, x_{1:t-1}) q(x_{t} \mid x_{1:t-1})} \frac{p(y_{1:t-1}, x_{1:t-1})}{ q(x_{1:t-1})}$
$= \frac {p(y_{1:t}, x_{1:t}) } { p(y_{1:t-1}, x_{1:t-1}) q(x_{t} \mid x_{1:t-1})} w_{t-1}$

We can generalize using unnormalized densities if we say:

$p(x_t, y_t|x_{1:t-1}) = \frac{p(x_{1:t}, y_{1:t})}{p(x_{1:t-1}, y_{1:t-1})} = \frac{\gamma_t(x_{1:t})}{\gamma_{t-1}(x_{1:t-1})}$

Which means we can formulate a generalized update of importance weights at each next time step:

$w_t = \frac {\gamma_t(x_{1:t})} { \gamma_{t-1}(x_{1:t-1})q(x_{t} \mid x_{1:t-1})} w_{t-1}$

And there you have it! This is how we do the importance sampling in sequential problems. The importance weight of current samples depend on the importance weights of previous samples.

Idea 2: Importance Resampling

If we refer back to the intuition behind SMC, you’ll remember that we pick which sequence of samples we decide to reuse for future sampling.

When we decide that a sequence of samples no longer matches our observations well enough, or aren’t strong enough explanations of the data as other sequences of samples, we stop using them (refer to grey sequences in the figure above). These sequences die out because they were not chosen to continue based on their importance weights. We can think of this process as natural selection. The best sequences will continue to the end, while others will die out on their journey to the end.

Remember that at each time step, we work with a sample budget. We’re going to change notation a bit and now say that we have a sample budget of $K$ samples. So in SMC, at each time step, we sample $K$ particles, and compute an importance weight for each.

$w^k = \frac{\gamma(x^k)}{q(x^k)} \qquad x^k \sim q(x^k) \qquad k = 1, ..., K$

At this point, all of the particles’ importance weights are unnormalized. This means that they do not sum to 1. So in order to choose which samples are more fit than others (i.e. have proportionally higher weights), we normalize the importance weights.

$\bar{w^k} = \frac{w^k}{\sum_{k'=1}^K w^{k'}}$

Now we can use these normalized weights in order to choose which ones will continue on. We do this by sampling particle indexes proportionally to their normalized weight by means of a discrete (commonly known as categorical) distribution.

$a^k \sim Discrete( \bar{w^1}, ..., \bar{w^k} ) \qquad k=1, ..., K$

$a^k$ represents a single number, 1 through $K$, that is tells us which particle to pick. We do this K times (sample budget), and sample with replacement. This means that we can pick the same particle more than once (this usually happens with high weighted particles).

So now that we have $K$ number of numbers that are between 1 through $K$, we pick them out!

$\tilde{x}^{k'} = x^{a^k} \qquad k'=1, ..., K \quad k=1, ..., K$

where $\tilde{x}^{k'}$ represents a fit sample that we will use for the next time step.

So far we have chosen which samples are fit, $\tilde{x}^1, ..., \tilde{x}^K$. Our next step is to reweigh these fit samples. Since they were all chosen according to their weight, they do not sum up to 1 anymore. In fact, their weights do not fully represent their value anymore. So how do we fix this? Well, we can do this by preserving the average of the original weights, $w^1, ..., w^K$. Let’s say that $\hat{Z}$ represents the average:

$\hat{Z} = \frac{1}{K} \sum_{k=1}^K w^k$

If we decide to set all the new weights, $\tilde w^1, ..., \tilde{w}^K$, of all the fit samples, $\tilde{x}^1, ..., \tilde{x}^K$ to $\hat{Z}$, then the average is the same!

$\hat{Z} = \frac{1}{K} \sum_{k=1}^K \tilde{w}^k = \frac{1}{K} \sum_{k=1}^K \hat{Z}^k$

### Summary

So let’s talk about SMC some more with a sequence of question-answer pairs.

1. Why do we use SMC?

We use it when we are working with time-dependent data.
2. How is SMC like previous Monte Carlo Estimators we have talked about?

It’s Importance Sampling, except we update our weights with past weights, and there’s a notion of pruning out bad sequences of guesses through resampling.
3. Why do we update our weights after resampling?

After resampling, we have a new collection of fit guesses/samples/particles. We do it to preserve the average of the original weights.

Thanks for reading this blog! Check out past blogs that build up to SMC.