In general there are infinitely many possible explanations of different complexity for the observed data. All the possible explanations should be taken into account and weighted according to their posterior probabilities. This approach, known as Bayesian learning, optimally solves the tradeoff between under- and overfitting.
In practice, exact treatment of the posterior pdfs of the models is impossible. Therefore, some suitable approximation method must be used. Ensemble learning [4,1,7,9], which is one type of variational learning, is a method for parametric approximation of posterior pdfs. The basic idea in ensemble learning is to minimise the misfit between the posterior pdf and its parametric approximation.
Let
denote the exact posterior pdf and
its parametric approximation. The misfit is
measured with the Kullback-Leibler (KL) divergence
CKL between Pand Q, defined by the cost function
Because the KL divergence involves an expectation over a distribution,
it is sensitive to probability mass rather than to probability
density. The term
does not depend on the
parameters or the factors and can be neglected.
The learning resembles the expectation maximisation (EM) algorithm. The factors are adjusted while keeping the mapping constant and the mapping is adjusted while keeping the factors constant always minimising the cost function. All the parameters and factors are modelled with Gaussian distributions rather than point estimates.
A more detailed account of the unsupervised ensemble learning method used for nonlinear factor analysis and discussion of potentially appearing problems can be found in [6].