EM and MAP

Next: Construction of probabilistic models Up: Ensemble learning Previous: Connection to coding

EM and MAP

Expectation maximisation (EM) algorithm can be seen as a special case of ensemble learning. The set-up in EM is the following: Suppose we have a probability model $p(x, y \vert \theta)$ . We observe x but y remains hidden. We would like to estimate $\theta$ with maximum likelihood, i.e., maximise $p(x \vert \theta)$ w.r.t. $\theta$ , but suppose the structure of the model is such that integration over $p(x, y \vert \theta)$ is difficult, i.e., it is difficult to evaluate $p(x \vert \theta) = \int p(x, y \vert \theta) dy$ .

What we do is take the cost function $C_y(x \vert \theta)$ and minimise it alternately with respect to $\theta$ and $q(y \vert x, \theta)$ . The ordinary EM algorithm will result when $q(y \vert x, \theta)$ has a free form in which case $q(y \vert x, \theta)$ will be updated to be $p(y \vert x, \hat{\theta})$ , where $\hat{\theta}$ is the current estimate of $\theta$ . The method is useful if integration over $\ln p(x, y \vert \theta)$ is easy, which is often the case. This interpretation of EM was given by [4].

EM algorithm can suffer from overfitting because only point estimates for the parameters $\theta$ are used. Even worse is to use maximum a posterior (MAP) estimator where one finds the $\theta$ and y which maximise $p(y, \theta \vert x)$ . Unlike maximum likelihood estimation, MAP estimation is not invariant under reparametrisations of the model. This is because MAP estimation is sensitive to probability density which changes nonuniformly if the parameter space is changed nonlinearly.

MAP estimation can be interpreted in ensemble learning framework as minimising $C_{y, \theta}(x)$ and using delta-distribution as $q(y, \theta \vert x)$ . This makes the integral $\int q(y, \theta \vert x) \ln q(y, \theta \vert x) dy d\theta$ infinite. It can be neglected when estimating $\theta$ and y because it is constant with respect to $\hat{y}$ and $\hat{\theta}$ , but the infinity of the cost function shows that delta distribution, i.e. a point estimator, is a bad approximation for a posterior density.

Next: Construction of probabilistic models Up: Ensemble learning Previous: Connection to coding

Harri Lappalainen
2000-03-03