# Importance Sampling (IS) vs Self-Normalized IS

I begin by discussing why Monte Carlo Estimators are used. One Monte Carlo Estimator I introduce is Importance Sampling. Most of the time, however, Importance Sampling alone is not enough. Instead we resort to using Self-Normalized Importance Sampling. In this blog, I discuss what we really mean when we say Importance Sampling. We learn that, generally, when someone says they are using Importance Sampling, they are really mean Self-Normalized Importance Sampling.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. The images shown are from his lectures. In addition, I’d like to address that more details on MC algorithms can be found in An Introduction to Probabilistic Programming available [arXiv]. This book is intended as a graduate-level introduction to probabilistic programming languages and methods for inference in probabilistic programs.

## Monte Carlo Estimators

I’d like to begin by stating that Importance Sampling is a Monte Carlo Estimator. Our goal for a Monte Carlo Estimator is to calculate the expectation of some function with respect to the normalized density $\pi(x)$ where

$\pi(x) = \gamma(x) / Z \\ = P(y, x) / P(y) \\ = P(x | y) \\$

$\gamma(x)$ is an unnormalized density which we can sample from and $Z$ is a normalizing constant

$Z = \int dx \gamma(x) \\ = \int dx P(y, x) \\ = P(y)$

$F = E_{\pi(x)}[f(x)] \\ = \int dx \pi(x) f(x) \\ = E_{p(x|y)}[f(x)]$

Why do we use Monte Carlo Estimators?

We use Monte Carlo Estimators because we do not have access to the normalized density, $\pi(x)$. Having access to the normalized density requires us to compute the normalizing constant, $Z$, which means integrating over all possible values $\gamma(x)$ could evaluate to. This is what we (almost always) cannot do! The good news is that we do have access to the unnormalized density, $\gamma(x)$.

The key idea here is that we use samples from the unnormalized density, $\gamma(x)$ to estimate some function we are interested in learning. We assume here that samples from $\gamma(x)$ are independent and identically distributed, (IID).

$\hat{F}^S = \frac{1}{S} \sum_{s=1}^{S}f(x^s) \ x^s \sim \pi(x)$

There are two main properties that Monte Carlo Estimators hold. The first is that the estimate is unbiased. The second is that it is consistent. I’ll briefly describe what each mean.

The estimate is unbiased when the expected value of the estimator is the exact quantity we are trying to estimate.

$E[\hat{F}^S] = E[ \frac{1}{S} \sum_{s=1}^{S} f(x^S)] \\ = \frac{1}{S} \sum_{s=1}^{S} E[f(x^S)] \\ = E [f(x^S)] \\ = F$

Our estimator is consistent when the estimation becomes more accurate as we use more samples. This happens when the variance approaches zero as the number of samples approaches infinity:

$\lim\limits_{S \rightarrow \infty} E \left[(\hat{F}^S - F)^2\right] = 0$.

## Importance Sampling

When we use importance sampling, we try to find the best estimate we can by guessing! We ask, what is a distribution that can give me samples that might also be samples from the distribution we actually care about, $\pi(x)$?

The distribution we use to make guesses is what we call a proposal distribution, $q(x)$. If we look at the figure below, we see an example of what a normalized distribution (also called the true or posterior distribution) might look like drawn in purple, and what a guessing distribution looks like drawn in green (a gaussian in this case).

We expect $\pi(x)$ to be a distribution that is difficult to describe, that is why it has so many wiggles in this example. As for the guessing distribution, the proposal distribution, $q(x)$, it is generally a very simple distribution we know how to sample from. That is why we commonly use Gaussian distributions as proposal distributions.

The idea behind importance sampling is that we can measure how well a sample from the proposal does with respect to the posterior distribution. We can measure using an importance weight, $w$.

To get an intuition of what high and low importance weights mean, let’s refer to the example again. Let’s say we get a sample, x, and it is evaluated by both the proposal, $q(x)$, and posterior, $\pi(x)$, distributions ( I will now stop using the word distribution when referring to the proposal and posterior distributions).

Consider two cases.

1. The sample lands outside the proposal, but inside the posterior.
2. The sample lands inside the proposal, but outside the posterior.

In the first case, we get samples that are under represented by the proposal, thus producing higher importance weights!

In the second case, we get samples that are overrepresented by the proposal, which yields low importance weights!

The goal here is get samples that better represent the posterior. The higher the importance weight, the better it approximates a sample from the posterior.

More High-Weighted Samples = Better Approximation of Posterior Distribution

NOTE: $q(x)$ must have the same or a larger support than $\pi(x)$.

Let’s now show this using math.

Remember that with a Monte Carlo Estimator, we want to approximate some function. $E_{\pi(x)}[f(x)] = \int dx \pi(x) f(x)$

We can play around with this equation in order to make it more useful. We do this by using a trick we often use in machine learning, multiplying by 1! Our 1 we multiply here is $\frac{q(x)}{q(x)}$. So now we have:

$\int dx \pi(x) f(x) = \int dx q(x) \frac{\pi(x)}{q(x)} f(x)$

As part of being an estimator, we can approximate this integral using samples:

$E_{q(x)}[f(x)] \approx \frac{1}{S} \sum_{s=1}^{S} \frac{\pi(x^S)}{q(x^S)}f(x^S), \; x^S \sim q(x)$

$\frac{\pi(x^S)}{q(x^S)}$ is what we call the importance weight, $w$.

And there you have it, Importance Sampling!

To do Importance Sampling, all you need is a proposal distribution you can sample from, $x \sim q(x)$, in order to get an importance weight, $w = \frac{\pi(x)}{q(x)}$, that tells you how well it approximates being a sample from the posterior distribution, $\pi(x)$.

We’re done! … or are we?

If you’re thinking, “Wait! but I thought we couldn’t evaluate a sample using $\pi(x)$ because that requires us knowing the normalizing constant, $Z$! And if we can’t evaluate a sample at $\pi(x)$, we can’t get an importance weight!”

and you would be right!

Folks, this is where I now introduce Self-Normalized Importance Sampling.

## Self-Normalized Importance Sampling

As mentioned before, people usually refer to Self-Normalized Importance Sampling (SNIS) when they say Importance Sampling. Now, the question is, what makes SNIS better? The answer is, it’s more useful! Instead of relying on the normalized distribution, $\pi(x)$ (which we do not know), we use something we do know: the unnormalized distribution, $\gamma(x)$. But how? By using another Monte Carlo Estimator to estimate our normalizing constant, $Z$!

Remember that $Z = \int dx \gamma(x)$ and $\pi(x) = \gamma(x) / Z$

If we rewrite $\pi(x) = \gamma(x) / Z$ as $\gamma(x) = Z * \pi(x)$

We can rewrite $Z = \int dx\: \gamma(x)$ as $Z = Z \int dx\: \pi(x)$

We can then do our multiply by 1 trick again, $\frac{q(x)}{q(x)}$!

$Z = Z \int dx\: q(x) \frac{\pi(x)}{q(x)}$
We can replace $\pi(x)$
$Z = Z \int dx q(x) \frac{\gamma(x)}{q(x) Z}$
Notice how the $Z$s cancel above on the right side!
$Z = \int dx q(x) \frac{\gamma(x)}{q(x)}$
This now looks like an expectation
we can turn into a Monte Carlo Estimator:
$\hat{Z}^S = \frac{1}{S} \sum_{s=1}^{S} \frac{\gamma(x^S)}{q(x^S)} \:\: x^S \sim q(x)$

Now to get Self-Normalized Importance Sampling, we combine the Importance Sampling estimator with the normalization constant estimator we just derived.

We update

$E_{q(x)}[f(x)] \approx \frac{1}{S} \sum_{s=1}^{S} \frac{\pi(x^S)}{q(x^S)}f(x^S), \; x^S \sim q(x)$

to

$E_{q(x)}[f(x)] \approx \frac{1}{\hat{Z}^S} \frac{1}{S} \sum_{s=1}^{S} \frac{\gamma(x^S)}{q(x^S)}f(x^S) \:\: x^S \sim q(x)$

We can write this more neatly. Remember that we can define our weight as $w^s = \frac{\gamma(x^S)}{q(x^S)}$.

We can then rewrite our normalization constant estimate using this weight, $\hat{Z}^S= \frac{1}{S} \sum_{s=1}^{S}w^S \approx p(y)$

We can then write SNIS as

$\hat{F}^S = \frac{1}{S \hat{Z}^S} \sum_{s=1}^{S} w^S f(x^S) = \sum_{s=1}^{S} \frac{w^S}{\sum_{s=1}^{S} W^S} f(x^S)$

As you probably noticed, the weights are now normalized, hence Self-Normalized Importance Sampling!

Thank you for reading this blog!