A typical machine learning task is to estimate a probability
distribution in the data space that best corresponds to the set of
real valued data vectors
[3]. This probabilistic
model is said to be generative - it can be used to generate data.
Instead of finding the distributions directly, one can assume that
sources
have generated the observations
through a
(possibly) nonlinear mapping
:
It is difficult to visualise the situation if for instance a 10-dimensional source space is mapped to form a nonlinear manifold in a 30-dimensional data space. Therefore, some indirect measures for studying the situation are useful. We use real-world data to make the experiment setting realistic and mark parts of the data to be missing for the purpose of controlled comparison. By varying the configuration of the missing values and then comparing the quality of their reconstructions, we measure different properties of the algorithms.
Generative models handle missing values in an easy and natural way. Whenever a model is found, reconstructions of the missing values are also obtained. Other methods for handling missing data are discussed in [4]. Reconstructions are used here to demonstrate the properties of hierarchical nonlinear factor analysis (HNFA) [1] by comparing it to nonlinear factor analysis (NFA) [5], linear factor analysis (FA) [6] and to the self-organising map (SOM) [7]. Similar experiments using only the latter three methods were presented in [2].
FA is similar to principal component analysis (PCA) but it has an
explicit noise model. It is a basic tool that works well when
nonlinear effects are not important. The mapping
is
linear and the sources
have a diagonal Gaussian
distribution. Large dimensionality is not a problem. The SOM can be
presented in terms of (1), although that is not the
standard way. The source vector
contains discrete
map coordinates which select the active map unit. The SOM captures
nonlinearities and clusters, but has difficulties with data of high
intrinsic dimensionality and with generalisation.