Learning nonlinear ICA can be based on several different criteria, but they all aim at finding models which could describe as large part of the observations as possible with as compact description of the sources as possible. The nonlinear ICA algorithms presented in the literature can be roughly divided in two classes: generative approaches which estimate the generative model and signal transformation (ST) approaches which estimate the the recognition model, that is, the inverse of the generative model.

Since the generative model is not directly estimated in the ST approach, it is difficult to measure how large part of the observations can be described with the sources except in the case when there are as many sources as there are observations. Then the observations can be perfectly reconstructed from the sources as long as the recognition mapping is invertible. To the best of our knowledge, all existing ST approaches are restricted to this case. The problem then reduces to transforming the observations into sources which are statistically as independent as possible. For an account on ST approaches for nonlinear ICA, see for instance [9,5,8] and references therein.

In the ST approaches, the problem of model indeterminacy inherent in nonlinear ICA has usually been solved by restricting the model structure. The number of hidden neurons is the same as the number of observations in [9]. In [5], the number of hidden neurons controls the complexity of the reconstruction model and [8] is restricted to post-nonlinear mixtures. Principled way of making the trade-off between the complexity of the mapping and the sources has not been presented in the general case for the ST approach.

In the generative approaches it is easy to measure how large part of the observations is explained by the sources, and consequently, easy to assess the quality of the model. It might seem that the estimation of the sources would be a problem, but this is not the case as shown here. During learning, small changes in the generative model result in small changes in the optimal values of the sources and it is therefore easy to track the source values by gradient descent.

Although it is possible to measure the complexity of the mapping and the sources in generative approaches, no algorithms which would do this for nonlinear ICA have been proposed apart from our algorithm. Most often the maximum a posteriori (MAP) or the maximum likelihood (ML) estimate is used at least for some of the unknown variables. In coding terms, a point estimate means that it is impossible to measure the description length because the accuracy of description of the variable is neglected. In nonlinear ICA it is necessary to use better estimates for the posterior density of the unknown variables or otherwise there will be problems with overfitting which can be overcome only by restricting the model structure.

Self-organising maps (SOM) and generative topographic mapping (GTM) have been used for nonlinear ICA. In [7], GTM was used for modelling the nonlinear mapping from sources to observations. The number of parameters grows exponentially as a function of sources both in SOM and GTM, which makes these mappings unsuitable for larger problems. ML estimate was used for the parameters of the mapping.

MLP networks have been used as generative models in [4,6,1]. In [4], the model for the sources is Gaussian and computationally expensive stochastic approximation is used for estimating the distribution of the unknown parameters. Only a very simple network with the structure 2-16-2 was tested. ML estimate for the sources and the parameters of the MLP was used in [6], while in [1], the posterior distribution of the parameters of an auto-associative MLP network was approximated. The distribution of the sources was not modelled in neither paper.

Although MLP networks are universal models, which means that any nonlinear mapping can be approximated with arbitrary accuracy given enough hidden neurons, it is difficult to approximate some mappings with MLP networks. This problem cannot be completely escaped by any model since for each model there are mappings which are difficult to represent. MLP networks are in wide use because they have been found to be good models for many naturally occurring processes. There are also modifications to the basic MLP network structure which further increase its representational power. For simplicity we have used the standard MLP structure but it would be possible to use many of these extensions in our algorithm also. It is evident that for instance the signals of the pulp process in Fig. 10 have strong time dependencies, and taking them into account will be an important extension.