Graphical Models in ML and How Easy it is to Break Conditional Independence

In this blog, I briefly introduce properties that graphical models have and the kinds we see in machine learning (ML). I introduce a simple problem with its graphical model and then work up to a more realistic example. I then demonstrate how easy it is for us to break conditional independence and discuss what that means for us.

Before we start, I’d like to thank Jan-Willem van de Meent for his lectures in his Advance Machine Learning Class for PhD students at Northeastern University. I use drawings from his lectures. In addition, I’d like to address that more details on graphical models can be found on David Barber’s book, Bayesian Reasoning and Machine Learning.

What Kinds of Graphs Do We Work With?

Let’s start off by briefly discussing three different properties graphs can have. This is important because graphs are used to define the problem we are going to work with. By introducing different kinds of graphs, we are exposed to the kinds of graphs most machine learning people work with. In most cases, designing a graph cleverly can make the difference between trying to solve a hard problem and a simple one.

Property 1: Directed vs. Undirected Graphs

These are graphs where the nodes are random variables. In the direct graph (left), we have directed edges whereas the undirected graph does not (right). It’s important to note that in either case, you can define the same density, $p(a,b,c,d,e)$. It is common for us to care about how certain variables influence other variables (we’ll see a more concrete example of this in the examples below).

Property 2: Cyclic vs. Acyclic Graphs

How variables influence others is important. Our problem becomes more difficult if our graph contains cycles.

On the left, we can follow nodes A to B, B to C, C to E, and E back to B, this what called a cycle. On the right graph, we have an example of a graph that is acyclic. A graph that is acyclic and directed is known as a directed acyclic graph, or DAG. Most of the probabilistic models that we will discuss in machine learning will be DAGs. It should not be possible to visit a node twice by following the edges along the graph in a DAG.

Property 3: Connected Graphs

On the right, there are two paths to get to node f from node a, and on the left graph, there is only one path to get to f from node a. In most cases, we talk about graphs that are multiply connected and are acyclic. It’s common to work with problems where a certain variable can affect another variable via different variables in-between.

Bayesian Network Examples

Now that we have discussed these basic properties, let’s make these ideas more tangible by discussing two examples of Bayesian Networks. The first will be a simple example taken from David Barber’s book mentioned above. The second will be a more realistic example using Naive Bayes.

Simple Example

Above we show an example of a Bayesian Network (sometimes called a Belief Network and Probabilistic Generative Model). Generally when referred as a Belief Network, it is a case where all of the random variables are binary. Each letter in the nodes (circles) have a meaning. Together they describe a situation, or a scenario. The graph above shows the nodes:

B Sherlock Holmes’ apartment was Burgled
A The Alarm went off
W Watson heard the alarm
G Mrs.Gibbon heard the alarm

where the edges show which variable influences other variables. For example, if the apartment was burgled, then that can cause the alarm to go off which can then be heard. This graph shows a way of breaking down the joint distribution into conditional distributions.

For example: Whether Mrs.Gibbon hears the alarm only depends on whether the alarm went off. It does not depend on whether Watson heard the alarm. So this is the distribution of B given A, p(B|A). The graph shows p(G, W, A, B) = p(G|A)p(W|A)p(A|B)p(B).

NOTE: What this graph does NOT say is what kind of random variables they are. It also does not say which type of distributions defines these conditional distributions. It only says these conditional distributions exist. So it’s not a full specification of a model, but it is a full specification of the structure of a model.

Given a graph like above, you can reason whether the apartment was burgled given that Watson and Mrs. Gibbson heard the alarm, this would give us a posterior

$p(B|W, G) = \frac{p(G, W, B)}{p(W, G)} = \frac{\sum_{A} p(G,W,A,B)}{\sum_{A,B} p(G,W,A,B)}$

where in order to find the posterior, we need to find p(G,W,B) which describes the marginal over A and p(W,G), the marginal over A, B.

Realistic Example with Naive Bayes

The real power of Bayesian Networks is that if you know things about the world, such as which variables can causally influence other variables then you can structure your model in such a way that can reflect these causal relationships. Below is an example with Naive Bayes.

In Naive Bayes we make some assumptions. Let’s say we have YiBernoulli(π), i=1, …,N, where Yi=1 indicates whether an email is spam and Yi=0 indicates the message is not spam. So this probability will roughly be the fraction of the email that is spam. Then we model whether certain words occur in the email or not,Xij|Yi=c∼Bernoullijc) ,i=1, …,N, j=1,…,D.

Example: Let’s say that a general spam word is Viagra. So then the probability that the word Viagra occurs in the email given that the email is spam is Xij=1, (word j occurs in message i), and the probability that Viagra occurs in the email given that the email is not spam is Xij=0.

NOTE: This notation is almost like a script that can tell us how think about the problem in terms of code. Basically it says run a for loop, i = 1 to N, where for each i, create a random variable that you sample from a Bernoulli and run a second for loop, j = 1 to D, and for each j, sample whether the particular term will occur in the email or not. Essentially, this graphical model is a compressed way of taking this code and turning it into a picture.

we show a variable π that influences the variable Yi which influences the variable Xij where there are N variables of Yi and there are D different j. The overlap between the two boxes means that there is an X for every word and document combination. Finally, for all the different classes, C, (we only have 2) we have the probabilities of the word appearing in the document.

If you’re clever about the structure of the graphical model, then hopefully what you can do is structure it in such a way that there are conditional independencies and you can exploit them. For example, we can write the joint probability as

$p(X,Y) = \prod_{i=1}^{N} p(Y_i | \prod_{i=1}^{D} p(X_{ij} |Y_i) = \prod_{i=1}^{N} p(Y_i, X_i)$

where the documents are independent. Additionally we can write, given a word, the probability of a message being spam is

$p(Y_i|X_i) = \frac{p(X_i|Y_i) p(Y_i)}{p(X_i)}$

$p(X_i) = \sum_{k={\mathrm{spam, !spam}}} p(X_i|Y_i = k) p(Y_i=k)$

which is analytically tractable.

Easy to Break Conditional Independence

Suddenly if we change our model, i.e. make it more Bayesian, we’ll notice that it is really easy to break conditional independence. Imagine that instead of having a constant parameter of π, (instead of saying we know the fraction of email that is spam) we now say we do not know. Instead want to additionally infer what that probability is, Π. As well, we want to infer which words are occurring in spam emails and not spam emails, Θ. Suddenly, our graphical looks more like this graph to the left.

This changes our model to be

$Y_i| \Pi=\pi \sim \mathrm{Bernoulli}(\Pi) \\ X_{ij}| Y_i = c, \Theta_{jc}=\theta_{jc} \sim \mathrm{Bernoulli}(\Theta_{jc})$

and naturally we define our prior distributions for Π and Θjc as Beta distributions since they are between 0 and 1.

$\Pi \sim \mathrm{Beta}(\alpha^{\pi}, \beta^{\pi}) \\ \Theta \sim \mathrm{Beta}(\alpha^{\theta}, \beta^{\theta})$

Now we ask the question,“Can I still write my probability over spam classes or not spam classes and words as a product over individual documents?” The answer is no.

Our distribution changes and now we have

$p(X,Y) = p(X|Y) p(Y)$

$p(Y) = \int p(\pi) \prod_{i=1}^N p(Y_i|\pi) d\pi$

$p(X|Y) = \int \prod_{j=1}^{D} p(\theta_j) \prod_{i=1}^N p(X_{ij}|Y_j, \theta_j) d\theta$

Since we have to do all these integrals over these parameters, it is no longer true that this is just a product over independent terms. Because now we have a shared parent, these variables are no longer independent. This makes things a lot harder.

There will be many cases where many of the solutions are analytically intractable, and this is when we turn to approximate inference algorithms such as Monte Carlo methods, Self-Normalized Importance Sampling, Variational Inference, etc.