Approximation of the posterior

The result of learning is an approximation of the posterior probability of all the unknown variables given the observations. The unknown variables are the factors s(t), the parameters of the mapping f, variance parameters for factors and observation noise and the parameters of the hierarchical prior. The approximation is needed because the posterior joint probability of all the unknown variables is very complex due to the large number of unknowns and the complex structure of the model.

In most publications of this thesis, the approximation of the posterior is assumed to have a maximally factorial form, that is, all the unknown variables are assumed to be independent given the observations. This can be seen as a necessary and sufficient extension to point estimates which is sensitive to posterior probability mass instead of probability density. Publication VIII shows how some of the most important posterior correlations of the variables can be included in the approximation without compromising the computational efficiency. Notice that although the variables are assumed to be independent a priori, they are dependent a posteriori because the observations induce dependences between them.

Publication I presents the methods in minimum message length framework and therefore uses a uniform distribution as the approximation for the posterior. Other publications use the Gaussian distribution which is in general a better approximation to posterior densities. It is also often possible to choose a parameterisation, such as the logarithmic parameterisation of the variance or the ``softmax'' parametrisation of the mixture coefficients [84], which makes the Gaussian approximation even more valid.

The cost function in ensemble learning is the Kullback-Leibler information between the posterior probability and its approximation. Due to the simple factorial form of the approximation, the cost function and its derivatives can be computed efficiently. The required computations resemble very much the ones which would be carried out using the standard backpropagation algorithm for estimating the unknown variables of the model. The most notable difference is that scalar values are replaced by probability distributions of the values.

During learning, the approximation of the posterior is adapted by a modification of gradient descent which utilises the structure of the problem as explained in publication V. The difference to ordinary point estimation is that the weights and factors are characterised by their mean and variance. This is important as then the algorithm is sensitive to the probability mass in the posterior probability instead of being sensitive to probability density. Figure 7 illustrates the adaptation of the posterior probability density instead of a point estimate for an unknown variable of the model.

**Figure 7:** During adaptation, the approximation of the posterior is fitted to the true posterior distribution. The dashed line shows schematically how the approximation could change in one iteration step. The mean and variance of the approximation are adapted to fit the true posterior better.
$\begin{figure}\begin{center}\epsfig{file=adaptation.eps,width=10cm}\end{center} \end{figure}$