Note Overview

For this tutorial I was asked to provide a brief overview of Expectation Maximization (E.M.) and model selection. However, to begin, I will need to start with Maximum Likelihood Estimation (M.L.E.). With that in mind, this set of notes will be organized as follows. First, I will discuss M.L.E and provide a short example that links it back to equations we’ve seen before. Then, I will discuss the E.M. algorithm and give a few applications. Finally, I will briefly talk about model selection and give an applicable example.

Maximum Likelihood Estimation

Assume that we are given a dataset $Y= \lbrace y_i, \ldots, y_n \rbrace $ where $y_i \in \mathbf{R}^d,$ where Y’s probability distribution depends on some unknown parameter $\theta \in \Theta$. Then, the likelihood of obserserving this data given a value for $\theta$ is,

$lik(\theta, Y) = \prod_i^n p( y_i, \theta),$

and the value of $\theta$ that makes our observed data the most probable is known as the M.L.E.

$\hat{\theta} = \text{argmax} \ \text{log} \ p( Y | \theta) = \sum_i^n \text{log} \ p(y_i | \theta)$

Example – Batch Estimation

As we have assume before, let’s suppose that we have a measurement model,

$\widetilde{y} = Hx + \nu \quad \text{where} \quad \nu = \mathcal{N}(0,R).$

Because we have defined $\nu$ is Gaussian, $\widetilde{y}$ is also Gaussian. With that knowledge, we may describe the PDF of the measurements as,

$p(\widetilde{y}) = \frac{1}{(2\pi)^{n/2} ||R||^{1/2}} e^{-\frac{1}{2}(\widetilde{y} - E[\widetilde{y}])^T R^{-1}(\widetilde{y}-E[\widetilde{y}])}$

where we will define the normalization constant as $c$ for the rest of this example.

Now, we can find the expectation of $\widetilde{y}$ as,

$E[\widetilde{y}] = E[Hx+\nu] = E[Hx] + E[\nu] = E[Hx] = Hx.$

With this information we can represent the conditional probability of our measurements given our states,

$p(\widetilde{y} | x) = c e^{-\frac{1}{2}(\widetilde{y} - Hx)^T R^{-1}(\widetilde{y}-Hx)} .$

As stated above, the goal of M.L.E is to find the state estimate, $\hat{x}$ that maximizes $p(\widetilde{y} | x)$, which is the likelihood that x resulted in the measurement of $\widetilde{y}$. This means that our objective function is

$\text{max}\ J(x) = c e^{-\frac{1}{2}(\widetilde{y} - Hx)^T R^{-1}(\widetilde{y}-Hx)} .$

Now, we note that maximizing $ p( \widetilde{y} | x ) $ will result in the same solution as maximizing $ ln( p( \widetilde{y} | x ) ) $. Therefore, we modify the objective function by

$\text{max}\ J(x) = ln ( c e^{-\frac{1}{2}(\widetilde{y} - Hx)^T R^{-1}(\widetilde{y}-Hx)} ) \quad = \quad ln(c) -\frac{1}{2}(\widetilde{y} - Hx)^T R^{-1}(\widetilde{y}-Hx).$

We can convert this to a minimization problem by negating the equation above,

$\text{min}\ J(x) = \frac{1}{2}(\widetilde{y} - Hx)^T R^{-1}(\widetilde{y}-Hx) - ln(c),$

which expands to yield,

$\text{min}\ J(x) = \frac{1}{2} [\widetilde{y}^T R^{-1} \widetilde{y} - 2x^T H^T R^{-1} \widetilde{y} + x^T H^T R^{-1} H x ] - ln(c).$

Now, we can apply the first differential condition,

$\frac{ \partial J(x) } { \partial x } = \frac{1}{2} [ -2 \widetilde{y}^{T} R^{-1} H + 2 x^T H^T R^{-1} H] = x^{T} H^T R^{-1} H - \widetilde{y} R^{-1} H = 0$ $x^{T} H^{T} R^{-1} H = \widetilde{y}^T R^{-1} H$ $\boxed{ \hat{x} = ( H^{T} R^{-1} H )^{-1} H^{T} R^{-1} \widetilde{y} }$

If we are also interested in the covariance of our estimate, we can find that easily too. We first note that the covariance of a state estimate is given by $P = E[(\hat{x}-\bar{x})(\hat{x}-\bar{x})^{T}].$

We can start by finding the expression for $\bar{x}$, which is given by,

$\bar{x} = E[\hat{x}] = E[( H^{T} R^{-1} H )^{-1} H^{T} R^{-1} (y+\nu)] = ( H^{T} R^{-1} H )^{-1} H^{T} R^{-1} y.$

Now, we can find an expression for $(\hat{x} - \bar{x}), $ which is

$\hat{x} - \bar{x} = ( H^{T} R^{-1} H )^{-1} H^{T} R^{-1}(y+\nu) - ( H^{T} R^{-1} H )^{-1} H^{T} R^{-1} y = (H^T R^{-1} H)^{-1} H^{T} R^{-1} \nu$

Finally, plugging the above equation into the covariance equation yields,

$P = E[ (H^T R^{-1} H)^{-1} H^{T} R^{-1} \nu \nu^T R^{-1} H(H^{T} R^{-1} H)^{-1} ]$ $\boxed{ = (H^{T}R^{-1}H)^{-1} }$

Additional Links

Expectation Maximization

Now, we can move onto Expectation Maximization. Where it should be noted that, all this algorithm is doing is calculating and M.L.E for data when some variables are not observed.

Motivation Examples

As shown above, if we are given a full dataset, we can pretty simple calculate a MLE. However, there are several applications where not provided all of the information need to fully characterize our problem. This issue segue us into the primary reason for utilizing E.M., which is to optimize is the presence of missing information. To provide a little motivation, I have provided two example applications.

Precise GNSS Data Processing

The first application is GNSS carrier-phase processing. When precise positioning of a platform is required and GNSS data is being utilized, the carrier-phase observable must be incorporated into your filtering algorithm. This carrier-phase observable can be thought of as a sinusoidal wave being propagated from the satellite to the user; however, you do not know the exact number of wave lengths that you are from the satellite. If the ambiguity can be resolved then cm level positioning is obtained.

Fig 1 :: Factor Graph With Only Pseudorange Observables

Fig 2 :: Factor Graph With Pseudorange and Carrier-Phase Observables

Data Clustering

The second example is a form of soft data clustering. In this example, we have a dataset with an unknown number of clusters. We can utilize E.M. to perform soft-clustering of each data point.

Fig 3 :: Soft Clustering Using E.M.

Brief Intuitive Explanation

Expectation Maximization is an iterative optimization method. This method is incredible similar to that of M.L.E., in that its aim is to estimate the unknown parameters $\theta$, given the set of measurements, Y; however, now we have missing information N. So, with E.M., we want to maximize the posterior probability of the parameters $\theta$ given the data Y, marginalizing over N:

$\Theta^* = \text{argmax} \sum_N P( \Theta, N | Y)$

In the next section, a slightly more informative discussion on E.M. is provided. Specifically, we will discuss expectation maximization is in the context of a lower bound maximization. That is, we can think of the expectation step as calculating a lower bound to the posterior distribution. Then, the maximization step is optimizing our bound, which improves the estimate for the unknowns.

Fig 4 :: EM as Lower Bound

E.M. as Lower Bound Optimization

Again, the premise of EM is to start with a guess $\Theta^i$ for the parameters $\Theta$, compute an easily computed lower bound $B(\Theta | \Theta^t)$ to the function $ \text{log} \ P( \theta | Y) $, and maximize that bound instead. If iterated, this procedure will converge to a MLE $\Theta^*$ of the objective function.

So, our goal is to maximize the likelihood of our parameter vector $\theta$ given the data, Y, and the unknown parameters, N.

$\Theta^* = \text{argmax} \ log \ P(Y | \Theta) = \text{argmax} \ log \ \sum_N P(Y,N,\Theta)$

Currently, this equation not easily solvable because it includes the logarithm of a summation. So, let’s manipulate it slightly to see if we can find a tractable bound.

$log \ P(Y | \Theta) = \ log \ \sum_N P(Y,N,\Theta) = \text{log} \sum_N f^t(N) \frac{P(Y,N,\Theta)}{f^t(N)}$

Where, for now, we can just say that f^t(N) is an arbitrary probability distribution of the space of N.

Now, by applying Jensen’s inequality, we have

Jensen's Inequality for Log ::

If $$ \sum_i \lambda_i = 1 $$ Then, $$ ln \sum_i \lambda_i Q_i \geq \sum_i \lambda_i ln Q_i $$

$B(\Theta, \Theta^t) := \sum_N f^t(N) \ \text{log} \ \frac{P(Y,N,\Theta)}{f^t(N)} \leq \ \text{log} \sum_N f^t(N) \frac{P(Y,N,\Theta)}{f^t(N)}$

How do we find an optimal bound?

The E.M. algorithm not only finds a lower bound, it finds the optimal one ( i.e., it finds the lower bound that touches the likelihood curve at the current estimate for $\Theta^t$. )

$B(\Theta^t, \Theta^t) = \sum_N f^t(N) \ \text{log} \ \frac{P(Y,N,\Theta)}{f^t(N)}$

Now, this becomes a constrained optimization problem because $\sum_N f^t(N) = 1.$ So, to solve for the optimal bound, Lagrange Multipliers must be introduced.

That is, we must find a new cost function that takes the form of

$min J'(x_1,x_2,\lambda) = \phi(x_1,x_2) + \lambda \psi(x_1,x_2).$

So, let’s define $\phi,$ and $\psi$ as

$\phi = \sum_N f^t(N) \ \text{log} \ \frac{P(Y,N,\Theta)}{f^t(N)},$

and

$\psi = 1 - \sum_N f^t(N).$

Now, we can define our new cost function as,

$J'(f^t) = \sum_N f^t(N) \ \text{log} \ \frac{P(Y,N,\Theta)}{f^t(N)} + \lambda[1-\sum_N f^t(J)] .$

Take the derivative,

$\frac{ \partial J'}{ \partial f^t(N)} = \text{log} \ P(Y,N,\Theta^t) - \text{log} f^t(N) - 1 - \lambda,$

and solve for $f^t(N)$

$f^t(N) = \frac{P(Y,N,\Theta^t)}{\sum_N P(Y,N,\Theta^t)} = P(N | Y,\Theta^t).$

Now, plugging this expression back into the first equation in this section, we get

$B(\Theta^T , \Theta^t) = \sum_N P(N | Y, \Theta^t) \ \text{log} \ \frac{P(Y,N,\Theta^t)}{P(N | Y,\Theta^t)} = \text{log} P(Y,\Theta^t)$

How do we optimize this bound

First, let’s start with the bound,

$B(\Theta, \Theta^t) = \sum_N P(N | Y,\Theta^t) \ \text{log} \frac{P(Y,N,\Theta)}{P(N|Y,\Theta^T)}$ $= \sum_N P(N|Y,\Theta^t) \ \text{log} P(Y,N,\Theta) -\sum_N P(N|Y,\Theta^t) \ \text{log} \ P(N|Y,\Theta^t)$ $= \sum_N P(N|Y,\Theta^t) \ \text{log} P(Y,N,\Theta) - \mathcal{H}$ $= \sum_N P(N|Y,\Theta^t) \ \text{log} P(Y,N | \Theta) + \text{log} P(\Theta) - \mathcal{H}$ $= Q^t(\Theta) + \text{log} P(\Theta) + \mathcal{H}$

So, after molding the equation, we have the

$\Theta^{t+1} = \text{argmax} [ Q^t(\Theta) + \text{log} P(\Theta) ].$

Final Algorithm

Every iteration of the EM algorithm starts by finding a lower bound $B(\Theta,\Theta^t)$ at the current guess $\Theta^t$. Then, the lower bound is maximized to improved the estimate $\Theta^{t+1}$. Thus, the EM algorithm can be written compactly as,

Expectation –> $f^t (N) = P( N | Y, \Theta^t) $
Maximization –> $ \Theta^{t+1} = \text{argmax} [Q^{t}(\Theta) + \text{log} P(\Theta)] $

Finally, it is important to know that $Q^t(\Theta)$ is calculated in the expectation by using the current estimate of $\Theta$. However, the M-Step is optimizing $Q^{t}$ w.r.t. the free paramater $\Theta$ to generate a new estimate, $\Theta^{t+1}$.

Additional Links

Model Selection –> Complexity Selection

The main idea of this section was summed up by George Box when he said “essentially, all models are wrong, but some are useful.” I’m not going to go into much detail on this topic because we have already discussed most of the important topics in the Bias-Variance discussion. However, I will provide an example of model selection that I conducted a few months ago.

Model Selection Example

Recently, there has been quite a bit of a discussion around GNSS spoofing. The primary concern is that the signal structure is public knowledge so anyone with the technical background could replicate the signal and broadcast is to a user. An example of a spoofed receiver is shown in the figure below.

Fig 5 :: Spoofing Example

I wanted to see how hard it is to detect spoofing. To conduct this study, I was provided a very large dataset from UT Austin ( $\sim$ 2,000,000 labeled data points ). And I trained a few learning algorithms; however, it seemed like no matter how complex the algorithm I was stuck at a threshold of 99.6% classification accuracy. For example, below is a fairly complex binary classification tree.

Fig 6 :: Decision Tree

So, I turned my attention to trying to minimize the complexity of the system while mainting the integrity of the classifier. To do this, I used a simple ensemble of trees, where the key concept is that many weak learners can be aggregated into one high-fidelity estimator.

To reduce the complexity, we first looked at the number of trees required in our ensemble. To measure the accuracy of our tree ensemble, we looked at the out of bag (oob) error. That is, each tree is trained on a subset of data ( i.e., a bag ) and tested on the remaining data. The oob error as a function of the number of trees in an ensemble is shown below.

Fig 7 :: Number of Required Trees

Next, we looked again at the obb error but this the we wanted to minimize the splits in a tree. For this, it can be seen in the image below that the error hits a minimum at roughly 25 splits.

Fig 8 :: Miniminum Number of Splits

So, a new learner was trained with 20 trees in the ensemble and a maximum of 25 splits in a single tree. The classification results for our simple ensemble are shown in Figure 9. As can be seen in the figure, the classification integrity of the learner was maintain as the complexity was decreased.

Fig 9 :: Confusion Matrix for Ensemble