References:

1) Alvarez, Ignacio, Jarad Niemi, and Matt Simpson. “Bayesian inference for a covariance matrix.” arXiv preprint arXiv:1408.4050 (2014).

Problem setting

When utilizing Bayesian inference, a core interest is the calculation of the posterior distribution. This calculation is generally made difficult – if not intractable – by the requirement to calculate the marginal likelihood (i.e., the denominator of Bayes theorem, as depicted below).

$p(\theta | x) = \frac{p(x|\theta)p(\theta)}{p(x)} \quad \text{where} \quad p(x) = \int p(x | \theta)p(\theta) d\theta$

To make this calculation tractable ( actually analytical ), we can utilize the concept of conjugate priors.

Definition: If $\mathcal{A}$ is a family of distribution for $p(x|\theta)$ and $\mathcal{B}$ is a family of prior distribution for $p(\theta)$, then, $\mathcal{A}$ is conjugate to $\mathcal{B}$ if $p(x|\theta)p(\theta) \in \mathcal{A} \quad \forall \quad p(x|\theta) \in \mathcal{A}, p(\theta) \in \mathcal{B}$.

Conjugate Prior for P.S.D. matrices

For this work, we are interest in estimating covariance matrices. So, below, we will review several conjugate priors for positive semi-definite matrices (PD).

The Inverse Wishart Prior

The inverse Wishart (IW) density is defined as

$\Sigma \sim \mathcal{W}^{-1} \quad \text{if} \quad p(\Sigma) \propto |\Sigma|^{-(\nu+d+1)/2} e^{\frac{1}{2}tr(\Lambda \Sigma^{-1})},$

where $\Lambda$ is a P.D. d-dimensional matrix, and $\nu$ is the degrees-of-freedom. This prior makes the assumption that each variance term is from a inverse chi-square distribution.

This prior has two major issues,

Uncertainty for all variance parameters is linked by a single hyperparameter ($\nu$) – (i.e., no way to include prior information on individual variance components)
Dependency between variance and correlation terms – (i.e, large variance will force unity correlation, while small variance will force null correlation)

The Scaled Inverse Wishart Prior

For the , scaled inverse Wishart (SIW) , we will define our covariance as $\Sigma := \Delta Q \Delta$, where $\Delta_{ii} = \delta_i$. The density for $Q$ and $\Delta$ is defined below.

$Q \sim \mathcal{W}^{-1}(\nu, \Lambda) \quad \text{and} \quad log(\delta_i) \sim \mathcal{N}(b_i, \zeta_i^2)$

This prior is recommended over the IW because prior information can be incorporated about the individual standard deviation components.

Hierarchical Half-t Prior

For the , hierarchical half-t prior, we will define the density as

$\Sigma \sim \mathcal{W}^{-1}(\nu+d-1, 2\nu\Lambda),$

where $\Lambda$ is a diagonal matrix with $\Lambda_{i,i} = \lambda_i$ such that

$\lambda_i \sim Ga(\frac{1}{2}, \frac{1}{\zeta_i^2}).$

To Do:

Would like to test all of the defined prior models in our collapsed Gibb’s sampling implementation to see their affect.
Look into separation strategy methods – (i.e., model standard deviation and correlation coefficients independently and then combine to form a prior.)