References:

1) Alvarez, Ignacio, Jarad Niemi, and Matt Simpson. “Bayesian inference for a covariance matrix.” arXiv preprint arXiv:1408.4050 (2014).

Problem setting

When utilizing Bayesian inference, a core interest is the calculation of the posterior distribution. This calculation is generally made difficult – if not intractable – by the requirement to calculate the marginal likelihood (i.e., the denominator of Bayes theorem, as depicted below).

$p(\theta | x) = \frac{p(x|\theta)p(\theta)}{p(x)} \quad \text{where} \quad p(x) = \int p(x | \theta)p(\theta) d\theta$

To make this calculation tractable ( actually analytical ), we need to intelligently select priors.

The Inverse Wishart Prior

The inverse Wishart (IW) density is defined as

$\Sigma \sim \mathcal{W}^{-1} \quad \text{if} \quad p(\Sigma) \propto |\Sigma|^{-(\nu+d+1)/2} e^{\frac{1}{2}tr(\Lambda \Sigma^{-1})},$

where $\Lambda$ is a P.D. d-dimensional matrix, and $\nu$ is the degrees-of-freedom. This prior makes the assumption that each variance term is from a inverse chi-square distribution. This distribution can easily be sampled from using the Barlett decomposition, as presented below.

    """ Bartlett decomposition based sampling """
    def wishartrand(self):
        dim = self.inv_psi.shape[0]
        chol = np.linalg.cholesky(self.inv_psi)
        foo = np.zeros((dim, dim))

        for i in range(dim):
            for j in range(i + 1):
                if i == j:
                    foo[i, j] = np.sqrt(chi2.rvs(self.nu - (i + 1) + 1))
                else:
                    foo[i, j] = np.random.normal(0, 1)
        return np.dot(chol, np.dot(foo, np.dot(foo.T, chol.T)))

Utilizing the implementation presented above, we can sample the prior distribution. For this evaluation, we sampled the inverse Wishart prior 10,000 times with hyperparameters set as $\nu = d+1$, and $\Lambda = I$. The figure shows the first and second variance terms with the color corresponding the the magnitude of correlation between the two terms. From this figure, we can see a clear correlation between variance and correlation.

Fig 1 :: Sampling from Inverse Wishart Prior

The Scaled Inverse Wishart Prior

One possible covariance prior, which may allow us to overcome the dependency between variance and correlation terms is the , scaled inverse Wishart (SIW) . This prior also has the added benefit of allowing prior information to be incorporated about the individual standard deviation components

We can define the scaled inverse Wishart as $\Sigma := \Delta Q \Delta$, where $\Delta_{ii} = \delta_i$. The density for $Q$ and $\Delta$ is defined below.

$Q \sim \mathcal{W}^{-1}(\nu, \Lambda) \quad \text{and} \quad log(\delta_i) \sim \mathcal{N}(b_i, \zeta_i^2)$

The scaled inverse wishart can be sampling utilizing to code presented below.

    """ Scaled Inverse Wishart """
    def siw(self):
      Q = invwishart.rvs(self.dim, self.scale)
      D = norm.rvs(self.mean, self.var, self.dim)

      siw = np.zeros(shape=(self.dim, self.dim))
      for j in range(0, self.dim):
          for k in range(0, self.dim):
              if j is k:
                  siw[j, k] = np.exp(D[j]) * np.sqrt(Q[j, j])
              else:
                  siw[j, k] = D[j] * D[k] * Q[j, k]
      return siw

Again, utilizing the implementation presented above, we can sample the prior distribution. For this evaluation, we sampled the scaled inverse Wishart prior 10,000 times with hyperparameters set as $\nu = d+1$, and $\Lambda = 0.8I$, $b=0$, $\zeta=1$. The figure shows the first and second variance terms with the color corresponding the the magnitude of correlation between the two terms. From this figure, we can clearly see less of a correlation between the variance and correlation terms, when compared to the inverse Wishart.

Fig 2 :: Sampling from Scaled Inverse Wishart Prior

Hierarchical Half-t Prior

Finally, we can test the , hierarchical half-t prior, which we will define the density as

$\Sigma \sim \mathcal{W}^{-1}(\nu+d-1, 2\nu\Lambda),$

where $\Lambda$ is a diagonal matrix with $\Lambda_{i,i} = \lambda_i$ such that

$\lambda_i \sim Ga(\frac{1}{2}, \frac{1}{\zeta_i^2}).$

    """ Hierarchical Half-t Prior """
    def hiw(self):
      n = self.dim
      d = self.dof
      g = self.scale

      Lambda = np.diag(gamma.rvs(1.0 / 2.0, loc=0, scale=g, size=n))
      return invwishart.rvs(n + d - 1, 2 * d * Lambda)

Utilizing the implementation presented above, we can sample the prior distribution. For this evaluation, we sampled the hierarchical half-t prior 10,000 times with hyperparameters set as $\nu=2$, and $\zeta = 1.04$. The figure shows the first and second variance terms with the color corresponding the the magnitude of correlation between the two terms.