Expectation-maximisation algorithm

Next: Markov chain Monte Carlo Up: Approximations Previous: The Laplace approximation Contents

Expectation-maximisation algorithm

The expectation-maximisation (EM) algorithm (Neal and Hinton, 1999; MacKay, 2003; Dempster et al., 1977) can be seen as a hybrid of point estimates and accurate Bayesian treatment. Recall the division of unknown variables $\boldsymbol{\Theta}$ into parameters $\boldsymbol{\theta}$ and latent variables $\boldsymbol{S}$ in Section 2.3. EM uses point estimates for parameters $\boldsymbol{\theta}$ and distributions for latent variables $\boldsymbol{S}$ . The idea is that updating is easy when these two are updated alternately. In the E-step, given certain values for parameters $\boldsymbol{\theta}$ , the posterior distribution of latent variables $\boldsymbol{S}$ can be solved. In the M-step, the parameters $\boldsymbol{\theta}$ are updated to maximise likelihood or posterior density, given a fixed distribution over latent variables $\boldsymbol{S}$ .

Originally the EM algorithm was used for maximum likelihood estimation in the presence of missing data. Later it was noted that latent variables can be seen as missing data and priors can be included to get the MAP estimate.

Note that EM does not only yield a posterior approximation, but also gives practical rules for iterative updates. In a sense, EM is a recipe or meta-algorithm which is used to devise particular algorithms using model-specific E and M steps.

EM is popular for its simplicity but speeding up the EM algorithm has been a topic of interest since its formulation (Meng and van Dyk, 1995). Petersen et al. (2005) analyse slowness in the limit of low noise (see also Raiko, 2001, Figure 6.1). It can be more effective to update both $\boldsymbol{S}$ and $\boldsymbol{\theta}$ at the same time by for instance using gradient-based algorithms (for comparison, see Kersting and Landwehr, 2004), hybrids of EM and gradient methods (Salakhutdinov et al., 2003), or complementing alternative updates with line search (Honkela et al., 2003). Also, there are many benefits from choosing a posterior approximation that is a distribution instead of a single set of values, such as robustness against overfitting and well-founded model selection.

The EM algorithm is adapted for parameter estimation of logical hidden Markov models in Publication VII.

Next: Markov chain Monte Carlo Up: Approximations Previous: The Laplace approximation Contents

Tapani Raiko 2006-11-21