Next: Information theoretic approach Up: Bayesian methods for data Previous: EM algorithm Contents

Ensemble learning

Ensemble learning is a relatively new concept suggested by Hinton and van Camp in 1993 [23]. It allows approximating the true posterior distribution with a tractable approximation and fitting it to the actual probability mass with no intermediate point estimates.

The posterior distribution of the model parameters $\boldsymbol {\theta }$ , $p(\boldsymbol{\theta}\vert \boldsymbol{X}, \mathcal{H})$ , is approximated with another distribution or approximating ensemble $q(\boldsymbol {\theta })$ . The objective function chosen to measure the quality of the approximation is essentially the same cost function as the one for EM algorithm in Equation (3.10) [38]

$\displaystyle C(\boldsymbol{\theta}) = \operatorname{E}_{q(\boldsymbol{\theta})... ...{p(\boldsymbol{\theta}, \boldsymbol{X}\vert \mathcal{H})} d\boldsymbol{\theta}.$

(3.11)

Ensemble learning is based on finding an optimal function to approximate another function. Such optimisation methods are called variational methods and therefore ensemble learning is sometimes also called variational learning [30].

A closer look at the cost function $C(\boldsymbol{\theta})$ shows that it can be represented as a sum of two simple terms

$\begin{displaymath}\begin{split}C(\boldsymbol{\theta}) &= \int q(\boldsymbol{\th... ...{\theta} - \log p(\boldsymbol{X}\vert \mathcal{H}). \end{split}\end{displaymath}$

(3.12)

The first term in Equation (3.12) is the Kullback-Leibler divergence between the approximate posterior $q(\boldsymbol {\theta })$ and the true posterior $p(\boldsymbol{\theta}\vert \boldsymbol{X}, \mathcal{H})$ . A simple application of Jensen's inequality [52] shows that the Kullback-Leibler divergence $D(q\vert\vert p)$ between two distributions $q(\boldsymbol {\theta })$ and $p(\boldsymbol{\theta})$ is always nonnegative:

$\begin{displaymath}\begin{split}- D(q(\boldsymbol{\theta})\vert\vert p(\boldsymb... ...ldsymbol{\theta}) d\boldsymbol{\theta}= \log 1 = 0. \end{split}\end{displaymath}$

(3.13)

Since the logarithm is a strictly concave function, the equality holds if and only if $p(\boldsymbol{\theta}) / q(\boldsymbol{\theta}) = const.$ , i.e. $p(\boldsymbol{\theta}) = q(\boldsymbol{\theta})$ .

The Kullback-Leibler divergence is not symmetric and it does not obey the triangle inequality, so it is not a metric. Nevertheless it can be considered a kind of a distance measure between probability distributions [12].

Using the inequality in Equation (3.13) we find that the cost function $C(\boldsymbol{\theta})$ is bounded from below by the negative logarithm of the evidence

$\displaystyle C(\boldsymbol{\theta}) = D(q(\boldsymbol{\theta}) \vert\vert p(\b... ...(\boldsymbol{X}\vert \mathcal{H}) \ge - \log p(\boldsymbol{X}\vert \mathcal{H})$

(3.14)

and there is equality if and only if $q(\boldsymbol{\theta}) = p(\boldsymbol{\theta}\vert \boldsymbol{X}, \mathcal{H})$ .

Looking at this the other way round, the cost function gives a lower bound on the model evidence with

$\displaystyle p(\boldsymbol{X}\vert \mathcal{H}) \ge e^{-C(\boldsymbol{\theta})}.$

(3.15)

The error of this estimate is governed by $D(q(\boldsymbol{\theta}) \vert\vert p(\boldsymbol{\theta}\vert \boldsymbol{X}, \mathcal{H}))$ . Assuming the distribution $q(\boldsymbol {\theta })$ has been optimised to fit well to the true posterior, the error should be rather small. Therefore it is possible to approximate the evidence by $p(\boldsymbol{X}\vert \mathcal{H}) \approx e^{-C(\boldsymbol{\theta})}$ . This allows using the values of the cost function for model selection as presented in Section 3.2.1 [35].

An important feature for practical use of ensemble learning is that the cost function and its derivatives with respect to the parameters of the approximating distribution can be easily evaluated for many models. Hinton and van Camp [23] used a separable Gaussian approximating distribution for a single hidden layer MLP network. After that many authors have used the method for different applications.

Subsections

Information theoretic approach

Next: Information theoretic approach Up: Bayesian methods for data Previous: EM algorithm Contents

Antti Honkela 2001-05-30