next up previous contents
Next: Approximation of the posterior Up: BAYESIAN NONLINEAR FACTOR ANALYSIS Previous: Model

   
Why simple methods fail

The standard backpropagation algorithm has been used successfully for estimating the parameters of an MLP network for a long time (see, e.g., [39,8]). It therefore requires some explanation why it cannot be used in this case. The basic reason is that in supervised learning, only the weights of the MLP network are unknown while in unsupervised learning the inputs to the MLP network are also unknown.

Posterior probability mass is proportional to the volume of the posterior peak which in turn is proportional to the posterior uncertainty of the unknown variables. With simple linear models used in factor analysis and independent factor analysis, it is possible to constrain the linear mapping A so that the posterior uncertainty of the factors is roughly constant or bounded from below. Even then the simple linear independent factor analysis algorithms suffer from overfitting as shown in [56]. With nonlinear models it is far more difficult to ensure that the posterior uncertainty of the factors is constant without posing very restricting conditions on the allowed nonlinearities (however, see [20] for an example) and the models are, in any case, more vulnerable to overfitting because they have more parameters to be estimated.

The severity of overfitting is roughly proportional to the number of those unknown variables in the model which are given point estimates and inversely proportional to the number of observations. In supervised learning, a sufficiently large number of observations can reduce overfitting, assuming that the model structure is predetermined. In unsupervised learning, increasing the number of observations cannot decrease the ratio below the dimension of latent space divided by the dimension of observations. This is because for each observation x(t), the corresponding values of latent variables s(t) need to be estimated separately.

The EM algorithm can be used for learning nonlinear latent variable models as shown in [31]. The number of variables which are assigned point estimates is then comparable to supervised learning. Most of the computational cost corresponds to the computation of the distribution of the factors and the extra computational cost of assigning posterior distributions to the rest of the parameters as well is not very high.


next up previous contents
Next: Approximation of the posterior Up: BAYESIAN NONLINEAR FACTOR ANALYSIS Previous: Model
Harri Valpola
2000-10-31