In nonlinear factor analysis (NFA), the mixing matrix used in FA is
replaced by a general mixing model
The assumption that the factors have a Gaussian distribution like in the basic FA is not restrictive anymore since the nonlinear mapping can compensate for it. The complexity is typically polynomial w.r.t. the number of factors which means that data with intrinsic dimensionality of the order of ten can be handled. It should be noted, however, that if the data is clustered, the corresponding mapping f would be very highly nonlinear and difficult to learn. Other methods are better suited for clustering.
If the mapping f is modelled with for example an MLP network with one hidden layer, an analogy to nonlinear component analysis (NCA) can be found. There is a nonlinear mapping between the data space and the higher dimensional feature space or hidden layer. The mapping between the underlying factors and the feature space or hidden layer is linear. The main difference is that in NCA, the mapping to feature space is predefined, but in NFA, the mapping from hidden layer to data is learned, too. The feature space in NFA is typically of higher dimensionality than the hidden layer in NFA.
The solutions of NFA are highly nonunique [31]. When compared to supervised learning where only the mapping f can be adjusted, one finds out that in NFA, many mappings are reasonably good, since the factors s(t) can be adjusted accordingly. NCA is comparable to NFA with part of the mapping fixed. There is a downside to the freedom of NFA: simple learning criteria lead to bad overfitting as will be discussed in Section . The problem can be overcome with e.g. Bayesian learning.
Valpola et al. [43,64] used an MLP network to model the mapping of NFA and ensemble learning for finding the parameters. Experiments showed that compared to the linear FA, a smaller number of factors was required to acquire the same level of accuracy in the reconstructions. Experiments with a modification by the present author and Valpola [57] showed that NFA performed better at reconstructing missing values in observations than FA.