The MLP networks belong to the standard tool box of modern neural networks research. Almost always, however, they are used for supervised learning, to the extent that often MLP networks are thought to be suitable only for supervised learning.

Since the signal transformation approach has been successfully applied to the derivation of efficient, practical algorithms for linear PCA and ICA, it is natural to try the same approach for nonlinear models as well. Signal transformation approaches have been proposed for nonlinear independent component analysis using MLP networks in [13,86,134,125,135,126,87,45]. The flexibility of the MLP network makes overfitting a serious problem when point estimates are used. In order to alleviate overfitting, many of the ST approaches need to restrict the structure of the MLP network.

ML estimation has been applied to a nonlinear factor analysis model containing an MLP network in [94]. The use of a point estimate causes the same overfitting problem as with ST approaches. In [85], stochastic sampling has been used for learning a nonlinear factor analysis model using an MLP network. Due to the large number of unknown variables in the model, learning is extremely slow. The dynamic nonlinear models proposed in [31,12] are discussed in section 5.4. Section 6 summarises the nonlinear factor analysis algorithm developed in this thesis.

Self-organising maps have been found useful, computationally efficient tools for visualising the structure of data sets because they find a two-dimensional representation for high-dimensional observations. However, the parameterisation which is based on specifying points in the observation space is not well suited for factor analysis when the dimension of the latent space is even moderately high. This is because the number of neurons in the grid grows exponentially as a function of the dimension of the latent space. For linear models and MLP networks the number of parameters grows linearly as a function of the dimension of the latent space and they are therefore better suited for learning models with a large number of latent variables.

The signal transformation approach aims at estimating the recognition mapping from observations to factors while generative learning models the mapping from factors to observations. The third alternative, auto-associative learning, estimates both at once. The basic idea is to find a mapping, defined by an MLP network for instance, from observations to themselves through an information bottleneck. This forces the network to find a compact coding for the observations.

Learning is supervised in the sense that both the inputs and the outputs of an MLP network are specified. This also means that learning is exponentially slow in the number of hidden layers. It is, however, possible to use the same strategy as in unsupervised learning; gradually tightening bottlenecks can be added in the middle of the network as the learning proceeds. In any case, the learning is slower than with generative models because both the recognition mapping and the generative mapping need to be learned separately. In generative learning the recognition mapping can be computed from the generative mapping by Bayes' rule.

The information bottleneck is usually implemented simply by restricting the number of hidden neurons in the middle layer of the network. However, this alone does not restrict the information content of a real number because the amount of information depends on the accuracy of coding. Using the minimum message length inference or bits-back argument which leads to ensemble learning, it would be possible to accurately measure the information content but the resulting algorithm would no longer profit anything from restricting the recognition mapping to be the one implemented by the MLP network. The experiments using an MLP network for recognition mapping in [128] show that gradient descent based inversion of the generative mapping for new observations can profit from the initial guess provided by an MLP network estimating the recognition mapping, but the performance of an MLP network is poorer than that found when gradient descent alone is used.

Examples of the use of auto-associative MLP networks can be found in [40,48,47]. From the point of view of this thesis, references [48,47] are particularly relevant because they use flat minimum seach [46], a method which bears resemblance to minimum message length inference, for measuring the complexity of the MLP network. However, the complexity of the factors is not measured.