** Next:** Approximation of the posterior
** Up:** BAYESIAN NONLINEAR FACTOR ANALYSIS
** Previous:** Model

##

Why simple methods fail

The standard backpropagation algorithm has been used successfully for
estimating the parameters of an MLP network for a long time (see, * e.g.*, [39,8]). It therefore requires some
explanation why it cannot be used in this case. The basic reason is
that in supervised learning, only the weights of the MLP network are
unknown while in unsupervised learning the inputs to the MLP network
are also unknown.

Posterior probability mass is proportional to the volume of the
posterior peak which in turn is proportional to the posterior
uncertainty of the unknown variables. With simple linear models used
in factor analysis and independent factor analysis, it is possible to
constrain the linear mapping
**A** so that the posterior
uncertainty of the factors is roughly constant or bounded from below.
Even then the simple linear independent factor analysis algorithms
suffer from overfitting as shown in [56]. With
nonlinear models it is far more difficult to ensure that the posterior
uncertainty of the factors is constant without posing very restricting
conditions on the allowed nonlinearities (however, see [20] for
an example) and the models are, in any case, more vulnerable to
overfitting because they have more parameters to be estimated.

The severity of overfitting is roughly proportional to the number of
those unknown variables in the model which are given point estimates
and inversely proportional to the number of observations. In
supervised learning, a sufficiently large number of observations can
reduce overfitting, assuming that the model structure is
predetermined. In unsupervised learning, increasing the number of
observations cannot decrease the ratio below the dimension of latent
space divided by the dimension of observations. This is because for
each observation
**x**(*t*), the corresponding values of latent
variables
**s**(*t*) need to be estimated separately.

The EM algorithm can be used for learning nonlinear latent variable
models as shown in [31]. The number of variables
which are assigned point estimates is then comparable to supervised
learning. Most of the computational cost corresponds to the
computation of the distribution of the factors and the extra
computational cost of assigning posterior distributions to the rest of
the parameters as well is not very high.

** Next:** Approximation of the posterior
** Up:** BAYESIAN NONLINEAR FACTOR ANALYSIS
** Previous:** Model
*Harri Valpola*

*2000-10-31*