The learning algorithm is a gradient based second order method [2,3]. It is able to efficiently prune away superfluous parts of the network as will be shown later. This ability is linked to the robustness of the learning algorithm against overfitting. It is necessary when fitting a flexible nonlinear model such as an MLP network to observations. The pruning capability can also be harmful in the beginning of learning when the network has not yet found a good representation of the observations because the network can prematurely prune away parts which are not useful in explaining the observations. These parts could be useful later on when the network refines its representation.

This problem can be avoided by making sure there is something reasonable to learn. In the beginning the sources are initialised to the values given by principal components of the observations. The sources are then kept fixed for 50 sweeps through the data and only the parameters of the MLP are updated. After the MLP has learned a mapping from PCA sources to observations also the sources will be adapted. After another 50 sweeps both the sources and the parameters of the MLP have reasonable values, and after that also the noise level of each observation channel and the distribution of the sources are updated.

The distribution of the sources is modelled by a mixture of Gaussians. In the beginning when the MLP has not yet found the correct nonlinear subspace where the observations lie, a complex model for the distribution of the sources is unnecessary, however, and therefore only one Gaussian is used in the mixture for the first 2000 sweeps. After that the sources are rotated by a linear ICA algorithm in order to find independent sources. The source distributions are thereafter modelled by mixtures of Gaussians. A total of 7500 sweeps through the data was used in all simulations.

This procedure can be seen as first using nonlinear PCA to estimate a nonlinear subspace and then using nonlinear ICA to refine the model. This is analogous to the linear case where linear PCA is often used for estimating a linear subspace for the linear ICA. Since the algorithm estimates the noise level on each channel separately, it is more appropriately called nonlinear independent factor analysis (IFA) or, when using only one Gaussian, nonlinear factor analysis (FA).