Initial Testing

As one method to robustly optimize when confronted with erroneous data, we are testing clustering algorithms to learn the true residual distribution, which will — hopefully — allow us to properly de-weight faulty observables. All of the code used for this initial testing is housed HERE.

Data Generation

To begin testing the Gaussian Mixture Model with Dirichlet Process for outlier cluster, a simple 2D data-set was generated. This data-set can be see in Figure 1.

Fig 1 :: Generated data-set

Testing

For an initial test, both the inlier and outlier distributions are sampled evenly (i.e., 100 data points were selected from each distribution). The clustering results are shown below in figure 2. What is interesting to note is that the number of clusters was not specified; however, the iterative algorithm correctly classified both distributions without adding addition partitions to the data-set.

Fig 2 :: DP GMM Initial Test

We can also look at the run-time of the collapsed Gibbs sampler. For the case were our data-set is composed of 200 data points, the result is shown below in figure 3. This shows that the collapsed Gibbs sampling is fairly consistent, with respect to time, over all iterations.

Fig 3 :: Collapsed Gibbs Sampling Initial Time Test

However, were not necessarily interesting in the run-time of this implementation because it will need to be re-written later. So, something more beneficial to look at may be the mean iteration time as the size of the data-set grows. This is depicted in Figure 4, where a clear linear trend is shown between the data-set size and the iteration time.

Fig 4 :: Mean Iteration Time of Gibbs Sampling

Initial Testing

Data Generation

Testing

Related Posts: