Variational Inference: A Brief Introduction

The purpose of this page is to provide a very brief intorduction to variation inference. This is very closely modeled after a series of 5 YouTube videos on this subject.

What we will need from information theory

This section provides some commonly used equations that will be encontered when studying variational inference.

First, the information associated with an event $x$ is

$\mathcal{I} = -\text{log} \ P(x).$

Now, if we have several events, we can calculate the weighted average of those events as,

$\mathcal{H} = - \sum_N \ P(x) \text{log} P(X) \ = \sum_N P(x) \mathcal{I}(x),$

which is commonly refered to as the entropy of the system.

Finally, when we move from the discrete domain to a continous one, we talk about differenital entropy, which is simpy the modification of the above equation to the continous domain,

$\mathcal{H} = \int_{\mathcal{D}} P(x) \mathcal{I}(x) dx$

An introduction to KL-Divergence

KL-Divergence is simply a measure of the distance between two probability distributions. If we have two distributions, then the KL-Diverence between them is written as,

$KL( P || Q )$

where it should be noted that the KL-Divergence is not symmetric

$KL( P || Q) \neq KL( Q || P ).$

One useful way to think about KL-Divergence is as a measure of relative entropy between two distributions,

$KL( P || Q ) = \sum_N ( P(x) \log( Q(x) ) ) - \mathcal{H}_{P(x)},$

which can be manipulated into the form generally presented as,

$KL( P || Q ) = \sum_N P(x) \text{log} \frac{P(x)}{Q(x)} = - \sum_N P(x) \text{log} \frac{Q(x)}{P(x)}$

Why use KL-Divergence

Let’s say that we have an unkown distribution, $p(z|x)$, that we would like to estimate. We can use a new distribution, q(z), to estimate the desired distribution with KL-Divergence being the measure of closness.

$KL(p(z|x),q(z)) = - \sum_N q(z) \frac{p(z|x)}{q(z)}$

Now, we can note that

$\frac{P(x,z)}{P(z)} = P(z|x),$

$KL(p(z|x),q(z)) = -\sum_N q(z) \frac{P(x,z)}{p(x)} \frac{1}{q(z)}.$

Now, manipulating the equation, we get

$KL(p(z|x),q(z)) = -\sum_N q(z) \frac{P(x,z)}{q(z)} \frac{1}{p(x)} = -\sum_N q(z) \big[ \text{log} \frac{P(x,z)}{q(z)} - \text{log} p(x) \big]$ $KL(p(z|x),q(z)) = -\sum_N q(z) \frac{P(x,z)}{q(z)} + \text{log} P(x) \sum_N q(z)$ $KL(p(z|x),q(z)) + \sum_N q(z) \frac{p(x,z)}{q(z)} = \text{log(p(x))}$

Now, let’s call the second term on the left hand side of the equation above $\mathcal{L}$ for lower bound. Then, we can write the equation in it’s final form as,

$KL + \mathcal{L} = C$

The equation above gives us the key insight provided by variational inference, (i.e., we can maximize this lower bound and gain the same results as trying to minimize the KL-Divergence when trying to approximate a conditional probability). And this is beneficial because the KL-divergnce contains a conditional probability which can be intractable while the lower bound contains the joint distribution which we can easily calculate.

Goal ::

We want to find $q(z)$ such that $$ \mathcal{L} = \sum_N q(z) \frac{p(x,z)}{q(z)} $$ is maximized.

What we will need from information theory

An introduction to KL-Divergence

Why use KL-Divergence

Related Posts: