Ref: Song, Mingzhou, and Hongbin Wang. “Highly efficient incremental estimation of Gaussian mixture models for online data stream clustering.”

Implementation

To make our robust optimization scheme incremental, we need to implement an efficient clustering algorithm. This efficiency needs to be w.r.t. both computation and storage. This means that, ideally, we will only have to store a small batch of the most recent observations in memory.

As an initial step to implement our incremental clustering approach, we need the ability to test the equivalence of mean vectors and covariance matrices of our current batch of observations against our prior mixture model. One method for testing this is presented below.

Covariance Test

To begin, let’s assume that we have a set of observations, $ { x_n } \in \mathbf{R}^d $. And we want to check if this set of observations has the same covariance as a hypothesis covariance matrix (i.e., we want to see if $\Sigma_x = \Sigma_0$, where $\Sigma_x = \text{cov}(x)$ and $\Sigma_0$ is our hypothesis).

To do this, we must first transform our original data set with Cholesky decomposition of our hypothesis covariance ( the covariance test only works for unit covariance matrices ), as shown below.

$\{ y_i \} = \{L_0^{-1} x_i \}, \ i=1,\cdots, n \quad \text{where} \quad \Sigma_0 = L_0 L_0^T$

Utilizing the transformed data set, we can construct the $W$-statistic, as shown below, which is known to have an asymptotic $\chi^2$ distribution with degrees of freedom $d(d+1)/2$.

$\frac{nWd}{2} \sim \chi^2_{d(d+1)/2} \qquad s.t. \qquad W = \frac{1}{d} Tr[(S_y - I)^2] - \frac{d}{n}[\frac{1}{d}Tr[S_y]]^2 + \frac{d}{n}$

Mean Test

To test the equivalence of mean vectors, we can construct the $T$-statistic, as shown below, which is known to have an asymptotic $F$ distribution.

$\frac{n-d}{d(n-1)}T^2 \sim F_{d,n-d} \qquad s.t. \qquad T^2 = n(\bar{x} - \mu_o)^T S^{-1} (\bar{x} - \mu_o)$

Test

As a simple test to validate our implementation, we utilized a simulated data set composed of several Gaussian components. The initial test is presented in the video below. From this simple test, we can see that we are able to distinguish when components in our steaming mixture model match our global mixture model.

Testing mean and covariance equivalence on a simulated data set.

Implementation

Covariance Test

Mean Test

Test

Related Posts: