EM algorithm

Next: Ensemble learning Up: Posterior approximations Previous: Laplace approximation Contents

EM algorithm

The EM algorithm was originally presented by Dempster et al. in 1977 [13]. They presented it as a method of doing maximum likelihood estimation for incomplete data. Their work formalised and generalised several old ideas. For example Baum et al. [3] used very similar methods with hidden Markov models several years earlier.

The term ``incomplete data'' refers to a case where the complete data consists of two parts, only one of which is observed. Missing part can only be inferred based on how it affects the observed part. A typical example would be the generative model in Section 3.1.1, where the $\mathbf{x}(t)$ are the observed data $\boldsymbol{X}$ and the sources $\mathbf{s}(t)$ are the missing data $\boldsymbol {S}$ .

The same algorithm can be easily interpreted in the Bayesian framework. Let us assume that we are trying to estimate the posterior probability $p(\boldsymbol{\theta}\vert \boldsymbol{X})$ for some model parameters $\boldsymbol {\theta }$ . In such a case, the Bayesian version gives a point estimate for posterior maximum of $\boldsymbol {\theta }$ .

The algorithm starts at some initial estimate $\widehat{\boldsymbol{\theta}}_0$ . The estimate is improved iteratively by first computing the conditional probability distribution $p(\boldsymbol{S}\vert \widehat{\boldsymbol{\theta}}_i, \boldsymbol{X})$ of the missing data given the current estimate of the parameters. After this, a new estimate $\widehat{\boldsymbol{\theta}}_{i+1}$ for the parameters is computed by maximising the expectation of $\log p(\boldsymbol{\theta}\vert \boldsymbol{S}, \boldsymbol{X})$ over the previously calculated $p(\boldsymbol{S}\vert \widehat{\boldsymbol{\theta}}_i, \boldsymbol{X})$ . The first step is called the expectation step (E-step) and the latter is called maximisation step (M-step) [16,57].

An important feature in the EM algorithm is that it gives the complete probability distribution for the missing data and uses point estimates only for the model parameters. It should therefore give better results than simple MAP estimation.

Neal and Hinton showed in [45] that the EM algorithm can be interpreted as minimisation of the cost function given by

$\displaystyle C = E_{q(\boldsymbol{S})} \left[ \log \frac{q(\boldsymbol{S})}{p(... ...\boldsymbol{\theta})} \right] - \log p(\boldsymbol{X}\vert \boldsymbol{\theta})$

(3.10)

where at step

, $q(\boldsymbol{S}) = p(\boldsymbol{S}\vert \widehat{\boldsymbol{\theta}}_{t-1}, \boldsymbol{X})$ . The notation $E_{q(\boldsymbol{S})} [ \cdots ]$ means that the expectation is taken over the distribution $q(\boldsymbol{S})$ . This interpretation can be used to justify many interesting variants of the algorithm.

Next: Ensemble learning Up: Posterior approximations Previous: Laplace approximation Contents

Antti Honkela 2001-05-30