Dirichlet Process Gaussian Mixture Model

This tutorial is provided to give a brief overivew of the Dirichlet Process Gaussian Mixture Model ( DP GMM ) and prerequisite information.

Prerequisite Information

Before diving into the DP-GMM, we will first discuss some prerequisite information. We will start by defining the multivariate Gaussian and its likelihood function. Next, a brief overview of the conjugate priors will be provided. Finally, we will discuss two commonly used distributions: the inverse-Wishart and the Dirichlet distribution.

Multivariate Gaussian

The multivariate Gaussian is simply a generalization of the univariate Guassian to higher dimensions. To define this slightly more formally, a vector a real-valued random variables, $\mathcal{X} = [ X_1, X_2 \ldots, X_n]$, has a Gaussian distribution with mean $\mu \in \mathcal{R}^n$ and covariance $\Sigma \in P^n_{++}$ — $P^n_{++}$ is a manifold composed of all symmetric positive definite nxn matricies (i.e., $P^n_{++} \in SO(n))$ — if it’s distribution can be charaterized by

$p(x | \mu,\Sigma) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} exp^{(-\frac{1}{2}|| x-\mu||_{\Sigma}^{2} )}.$

Multivariate Gaussian Likelihood

Give a mean $\mu$ and covariance $\Sigma$ we can calcuate the likelihod of a set of random vectors $\mathcal{X} = [X_1, X_2, \ldots, X_n]$ by,

$p(\mathcal{X} | \mu,\Sigma) = \prod^N_{n=1} \mathcal{N}(X_n|\mu,\Sigma),$

which can be represents as

$p(\mathcal{X} | \mu,\Sigma) = \frac{1}{(2\pi)^{ND/2} (|\Sigma|^{N/2})}e^{-\frac{N}{2}|| \mu - \bar{x}||_{\Sigma}} e^{-\frac{1}{2} tr(\Sigma^{-1}S_{\bar{x}})},$

where

$\bar{x} = \frac{1}{N} \sum^N_{n=1} x_n$ $S_{\bar{x}} = (x_n - \bar{x})(x_n - \bar{x})^{T}.$

Conjugate Prior

Conjugate priors are widely used because they simplify computation (i.e., they allow us to analytically integrate out latent variables and only sample parameters of interest through collapsed Gibbs sampling, which will be discussed in greater detail later).

Def: A family $\mathcal{F}$ of prior distributions, $p(\theta)$, is conjugate to a likelilhood, $p(\theta | \mathcal{D})$, if the posterior, $p(\theta | \mathcal{D})$, is in $\mathcal{F}$.

With that in mind, we will next define the Gaussian inverse Wishart and the Dirichlet distribution, which are two commonly used conjugate priors for the multivariate Gaussian and the catagorical distributions, respectively.

Gaussian Inverse Wishart

For the mean $\mu$ and covariance matrix $\Sigma$ of a multivariate Gaussian, the Gaussian-inverse-Wishart (GIW) prior is fully conjugate,

$GIW(\mu,\Sigma | m_o, \kappa_o, \nu_o, S_o) := \mathcal{N}(\mu | m_o, \frac{1}{\kappa_o}\Sigma) IW(\Sigma| S_o,\nu_o) ,$

which can be represented as,

$GIW(\mu,\Sigma | m_o, \kappa_o, \nu_o, S_o) = e^{-\frac{\kappa_o}{2}|| \mu - m_o||_{\Sigma}} e^{-\frac{1}{2} tr(\Sigma^{-1}S_o)} \frac{1}{Z_{GIW}(D,\kappa_o,\nu_o,S_o} |\Sigma|^{-\frac{\nu_o+D+2}{2}},$

where

$Z_{GIW}(D,\kappa_o,\nu_o,S_o) = 2^{\frac{(\nu_o+1)D}{2}}\pi^{\frac{D(D+1)}{4}}\kappa_o^{-D/2} \|S_o\|^{-\nu_o/2} \prod_{d=1}^{D} \Gamma(\frac{\nu_o+1-d}{2})$

Parameter Definition

$m_o$ –> Prior mean for $\mu$

$\kappa_o$ –> belief in $m_o$

$S_o$ –> prior $\Sigma$

$\nu_o$ –> belief in $S_o$

Dirichlet Distribution

To define a conjugate prior for the multinomial distribution, the Dirichlet distribution is commonly utilized. The Dirichlet distribution is a distribution over possible parameter vectors for a multinomial distribution (e.g., the Beta distribution is a special case of the Dirichlet distribution when n = 2). The Dirichlet distribution defined as

$\theta \sim \text{Dir}(\alpha) \quad \text{if} \quad p(\theta | \alpha) = \frac{\Gamma(\sum_i^n(\alpha_i))}{\prod_i^n(\Gamma(\alpha_i))}\prod_{i=1}^{n}\theta_i^{\alpha_i-1} \mathcal{I}(\theta \in S),$

where $\theta = (\theta_1, \ldots, \theta_n)$, $\alpha = (\alpha_1, \ldots, \alpha_n)$ s.t. $\alpha_i > 0$, and $S$ is the probability simplex. A simplex is simply a generalization of the triangle to n-dimensional space (i.e., $S = \lbrace \alpha \in \mathcal{R}^n : \alpha_i \geq 0 : \sum_i^n \alpha_i = 1 \rbrace$ ).

With that brief description of the Dirichlet distribution, it is useful to visualize an example case. To do this, a simple python script that plots the Dirichlet density when n = 3 (i.e., a situation when three clusters are present in dataset) has been provided. The figure below shows the density when $\alpha = $ , [1,1,1], [5,5,5], and [2,2,8]. As can be seen, when $\alpha = $ [1,1,1] there is a uniform likelihood for each cluster, (i.e., there is no prior knowledge); however, as the $\alpha$’s very, the density can be seen to shift around the simplex.

Fig 1 :: Dirichlet density

Mixture Models

With the overview provided above, we can now move to mixture models for data clustering. This section will start with a finite mixture model and move to an infinite mixture model to overcome the inherent difficultly of selecting the number of data partitions.

Finite Mixture Model

The model that will be utilized for the finite Bayesian mixture model is shown in the figure below. For each observed data vector $x_i$ , we have a latent variable $z_i \in [1, 2, . . . , K]$ indicating which of the K components $x_i$ belongs to. $ \pi_i = P(z = k)$ is the prior probability that $x_i$ belongs to component $k_i$. Given $z_i = k$, $x_i$ is generated by the $k^{th}$ Gaussian mixture component with mean vector $\mu_k$ and covariance matrix $\Sigma_k$.

Fig 2 :: Finite Mixture Model

We can provide a prior for our mixing weights, $\pi$ , using the Dirichlet distribution (i.e., $p(\pi | \alpha) = \text{Dir}(\pi | \alpha)$). We can also provide a prior for our mixture components using using the inverse Gaussian Wishart distribution (i.e., $p(\mu_k, \Sigma_k | m_o, \kappa_o, \nu_o, S_o) = GIW( \mu_k, \Sigma_k | m_o, \kappa_o, \nu_o, S_o) )$.

Collapsed Gibbs Sampling

Because we selected $\pi | \alpha$ and $p(\mu_k,\Sigma_k | \beta)$ — where we define the hyper-parameter $\beta = (m_o,\kappa_o,\nu_o,S_o)$ — to be conjugate, we can analytically integrate out the parameters $\pi$, $\mu_k$, and $\Sigma_k$. This allows us to dramatically reduce our sampling space to component assignmetns $z$. This is represented by,

$P(z_i = k | z_{\ i}, \mathcal{X}, \alpha, \beta) \propto p(z_i = k | z_{\ i} \alpha) p (x_i | \mathcal{X}_{\ i}, z_i = k, z_{\ i}, \beta) ,$

where

$p(z_i = k| z_{\ i}, \alpha) = \frac{ (N_k - 1) + \frac{\alpha}{K} }{N + \alpha -1}$

and

$p (x_i | \mathcal{X}_{\ i}, z_i = k, z_{\ i}, \beta) = \int_{\mu_k} \int_{\Sigma_k} p(\mathcal{X}_k | \mu_k, \Sigma_K) p(\mu_k, \Sigma_k | \beta) d\mu_k d\Sigma_k .$

Psudo code for collapsed Gibbs sampling for a finite GMM is given below.

Fig 3 :: Finite Mixture Algorithm

Infinite Mixture Model

Chinese Restaurant Process

Before we dive into the details of the infinite mixture model, we will first describe the Dirichlet process through the Chinese restaurant problem. This construct is a commonly used one is statistics. It is described as customers seating themselves in a resturant with an infinite number of tables. This process has a rich-get-richer property, that is, tables with more people have a higher probabily of getting more people. The rich-get-richer property can be more formally defined as,

$p(z_i = k | z_{\ i} , \alpha ) =\left\{ \begin{array}{ll} \frac{N_k}{N+\alpha-1} \quad \text{if k is occupied} \\ \frac{\alpha}{N+\alpha-1} \quad \text{if k is a new table} \end{array} \right.$

Now, we will move to the infinite mixture model. This model is closely related to the finite mixture model, with the exception being that the Dirichlet process is now used to define the mixing weight priors. This allows us to circimvent the issue of defining the number of partitions by allowing the model to select from an infinite number of partitions.

Fig 4 :: Infinite Mixture Model

Collapsed Gibbs Sampling

Again, because we selected $p(\pi | \alpha)$ and $p(\mu_k,\Sigma_k | \beta)$ — where we define the hyper-parameter $\beta = (m_o,\kappa_o,\nu_o,S_o)$ — to be conjugate, we can analytically integrate out the parameters $\pi$, $\mu_k$, and $\Sigma_k$. This is represented by,

$P(z_i = k | z_{\ i}, \mathcal{X}, \alpha, \beta) \propto p(z_i = k | z_{\ i} \alpha) p (x_i | \mathcal{X}_{\ i}, z_i = k, z_{\ i}, \beta) ,$

where,

$p(z | \alpha) = \frac{\alpha^K \prod_{k=1}^K (N_k-1)!}{\prod_{n=1}{N} ( i-1+\alpha}),$

and

$p(x_i | \mathcal{X_{\ i}}, z_i = k, z_{\ i}, \beta) = p(x_i,\beta) = \int_{\mu}\int_{\Sigma} p(x_i | \mu, \Sigma)p(\mu,\Sigma | \beta)d\mu d\Sigma.$

Psudo code for collapsed Gibbs sampling for a infinite GMM is given below.

Fig 5 :: Infinite Mixture Algorithm