The expectation-maximisation (EM) algorithm (Neal and Hinton, 1999; MacKay, 2003; Dempster et al., 1977)
can be seen as a hybrid of point estimates and accurate Bayesian treatment. Recall
the division of unknown variables
into parameters
and latent
variables
in Section 2.3. EM uses point
estimates for parameters
and distributions for latent
variables
. The idea is that updating is easy when these two are
updated alternately. In the E-step, given certain values for
parameters
, the posterior distribution of latent variables
can be
solved. In the
M-step, the parameters
are updated to maximise likelihood or posterior
density, given a fixed distribution over latent variables
.
Originally the EM algorithm was used for maximum likelihood estimation in the presence of missing data. Later it was noted that latent variables can be seen as missing data and priors can be included to get the MAP estimate.
Note that EM does not only yield a posterior approximation, but also gives practical rules for iterative updates. In a sense, EM is a recipe or meta-algorithm which is used to devise particular algorithms using model-specific E and M steps.
EM is popular for its simplicity but speeding up the EM algorithm has
been a topic of interest since its formulation (Meng and van Dyk, 1995). Petersen et al. (2005)
analyse slowness in the limit of low noise (see also Raiko, 2001, Figure 6.1).
It can be more effective to
update both
and
at the same time by for instance using
gradient-based algorithms (for comparison, see Kersting and Landwehr, 2004), hybrids of EM and gradient methods (Salakhutdinov et al., 2003),
or complementing alternative updates with line search (Honkela et al., 2003).
Also, there are many benefits
from choosing a posterior approximation that is a distribution instead
of a single set of values, such as robustness against overfitting
and well-founded model selection.
The EM algorithm is adapted for parameter estimation of logical hidden Markov models in Publication VII.