Initial Testing

As one method to robustly optimize when confronted with erroneous data, we are testing clustering algorithms to learn the true residual distribution, which will — hopefully — allow us to properly de-weight faulty observables. All of the code used for this initial testing is housed HERE.

Data Generation

To begin testing the Gaussian Mixture Model with Dirichlet Process for outlier cluster, a simple 2D data-set was generated. This data-set can be see in Figure 1.

true photo true_zpslveuybzz.png

Fig 1 :: Generated data-set


Testing

For an initial test, both the inlier and outlier distributions are sampled evenly (i.e., 100 data points were selected from each distribution). The clustering results are shown below in figure 2. What is interesting to note is that the number of clusters was not specified; however, the iterative algorithm correctly classified both distributions without adding addition partitions to the data-set.

 photo igmmT1_zps3cxyalcv.png

Fig 2 :: DP GMM Initial Test


We can also look at the run-time of the collapsed Gibbs sampler. For the case were our data-set is composed of 200 data points, the result is shown below in figure 3. This shows that the collapsed Gibbs sampling is fairly consistent, with respect to time, over all iterations.

 photo itialTime_zpshhv3qg1x.png

Fig 3 :: Collapsed Gibbs Sampling Initial Time Test


However, were not necessarily interesting in the run-time of this implementation because it will need to be re-written later. So, something more beneficial to look at may be the mean iteration time as the size of the data-set grows. This is depicted in Figure 4, where a clear linear trend is shown between the data-set size and the iteration time.

 photo timeComp_zpskemvd6qx.png

Fig 4 :: Mean Iteration Time of Gibbs Sampling