Ensemble learning

*Ensemble learning* is a relatively new concept suggested by
Hinton and van Camp in 1993 [23]. It allows
approximating the true posterior distribution with a tractable
approximation and fitting it to the actual probability mass with no
intermediate point estimates.

The posterior distribution of the model parameters
,
, is approximated with another distribution or
*approximating ensemble*
. The objective function chosen
to measure the quality of the approximation is essentially the same
cost function as the one for EM algorithm in
Equation (3.10) [38]

Ensemble learning is based on finding an optimal function to
approximate another function. Such optimisation methods are called
*variational methods* and therefore ensemble learning is
sometimes also called *variational learning* [30].

A closer look at the cost function shows that it can be represented as a sum of two simple terms

The first term in Equation (3.12) is the
*Kullback-Leibler divergence* between the approximate posterior
and the true posterior
. A simple
application of Jensen's inequality [52] shows that the
Kullback-Leibler divergence between two distributions
and
is always nonnegative:

Since the logarithm is a strictly concave function, the equality holds if and only if , i.e. .

The Kullback-Leibler divergence is not symmetric and it does not obey the triangle inequality, so it is not a metric. Nevertheless it can be considered a kind of a distance measure between probability distributions [12].

Using the inequality in Equation (3.13) we find that the cost function is bounded from below by the negative logarithm of the evidence

and there is equality if and only if .

Looking at this the other way round, the cost function gives a lower bound on the model evidence with

The error of this estimate is governed by . Assuming the distribution has been optimised to fit well to the true posterior, the error should be rather small. Therefore it is possible to approximate the evidence by . This allows using the values of the cost function for model selection as presented in Section 3.2.1 [35].

An important feature for practical use of ensemble learning is that the cost function and its derivatives with respect to the parameters of the approximating distribution can be easily evaluated for many models. Hinton and van Camp [23] used a separable Gaussian approximating distribution for a single hidden layer MLP network. After that many authors have used the method for different applications.