Next: Model selection in ensemble Up: No Title Previous: Approximations of posterior pdf

Ensemble learning

Ensemble learning, [1], is a recently introduced method for parametric approximation of the posterior distributions where Kullback-Leibler information, [2], [5], is used to measure the misfit between the actual posterior distribution and its approximation. Let us denote the observed variables by x and the unknown variables by y. The true posterior distribution p(y|x) is approximated with the distribution q(y|x)by minimising the Kullback-Leibler information.

D(q(y \| x) \|\| p(y \| x))	=	$\displaystyle \int q(y \vert x) \ln \frac{q(y \vert x)}{p(y \vert x)} dy$
	=	$\displaystyle \int q(y \vert x) \ln \frac{p(x)q(y \vert x)}{p(x, y)} dy$
	=	$\displaystyle \int q(y \vert x) \ln \frac{q(y \vert x)}{p(x, y)} dy + \ln p(x)$	(4)

The Kullback-Leibler information is greater than or equal to zero, with equality if and only if the two distributions, p(y|x) and q(y|x), are equivalent. Therefore the Kullback-Leibler information acts as a distance measure between the two distributions.

If we note that the term p(x) is a constant over all the models, we can define a cost function C_y(x) which we are required to minimise to obtain the optimum approximating distribution

$\begin{displaymath}C_y(x) =D(q(y \vert x) \vert\vert p(y \vert x)) - \ln p(x) = \int q(y \vert x) \ln \frac{q(y \vert x)}{p(x, y)} dy \end{displaymath}$

(5)

We shall adopt the notation that the subindex of C denotes the variables that are marginalised over in the cost function. In general, they are the unknown variables of the model. The notation also makes explicit that the cost function C_y(x) gives an upper bound for $-\ln p(x)$ . Here we use the same notation as with probability distributions, that is C_y(x | z) means

C_y(x \| z)	=	$\displaystyle D(q(y \vert x, z) \vert\vert p(y \vert x, z)) - \ln p(x \vert z)$
	=	$\displaystyle \int q(y \vert x, z) \ln \frac{q(y \vert x, z)}{p(x, y \vert z)} dy$	(6)

and thus yields the upper bound for $-\ln p(x \vert z)$ .

Ensemble learning is practical if the terms p(x, y) and q(y | x)of the cost function C_y(x)can be factorised into simple terms. If this is the case, the logarithms in the cost function split into sums of many simple terms. By virtue of the definition of the models, the likelihood and priors are both likely to be products of simpler distributions, p(x, y) typically factorises into simple terms. In order to simplify the approximating ensemble, q is also modelled as a product of simple terms.

The Kullback-Leibler information is a global measure, providing that the approximating distribution is a global distribution. Therefore the measure will be sensitive to probability mass in the true posterior distribution rather than the absolute value of the distribution itself.

Training the approximating ensemble can be done by assuming a fixed parametric form for the ensemble (for instance assuming a product of Gaussians). The parameters of the distributions can then be set to minimise the cost function.

An alternative method is to only assume a separable form for the approximating ensemble. The distributions themselves can then be found by performing a functional minimisation of the cost function with respect to each distribution in the ensemble. While this method must always give ensembles with equivalent or lower misfits than those obtained by a assuming a parametric form, the distributions that are obtained are not always tractable and so the fixed form method may be more useful.

Next: Model selection in ensemble Up: No Title Previous: Approximations of posterior pdf

Harri Lappalainen
2000-03-03