Ensemble learning is a technique for parametric approximation of the posterior probability where fitting the parametric approximation to the actual posterior probability is achieved by minimising their misfit. The misfit is measured with Kullback-Leibler information [70], also known as relative or cross entropy. It is a measure suited for comparing probability distributions and, more importantly, it can be computed efficiently in practice if the approximation is chosen to be simple enough.
The Kullback-Leibler information between two probability density
functions q(x) and p(x) is
(14) |
Regarding the approximation of posterior probability, the most important benefit of ensemble learning is that Kullback-Leibler information is sensitive to probability mass and therefore the search for good models focuses on the models which have large probability mass as opposed to probability density. The drawback is that in order for ensemble learning to be computationally efficient, the approximation of the posterior needs to have a simple factorial structure. This means that most dependences between various parameters cannot be estimated. On the other hand, it should be possible to use ensemble learning instead of MAP estimation as the first stage in Laplace's method.
In the present form, the method was first presented by Hinton and van Camp [44] and the name ensemble learning was given by MacKay in [82]. Ensemble learning can also be seen as a variational method [60] and it also has a connection to the EM algorithm [93].