Next: Bayesian Inference and Ensemble Up: Extensions of Factor Analysis Previous: Sparse Coding

Learning Criteria

Solving problems with complex models cannot be done analytically. Instead, one can formulate a learning criterion typically in a form of a cost function. The learning can done by adjusting parameters iteratively such, that the cost function is minimised. Once the criterion is determined, one can use methods of optimisation theory, such as gradient descent.

The first candidate for a cost function could be the mean square reconstruction error

$\begin{displaymath}C_{ms} = \frac{1}{T}\sum_{t=1}^{T}\left[\mathbf{x}(t)-\mathbf{f}(\mathbf{s}(t))\right]^2 . \end{displaymath}$

(2.18)

It is easy to see [5] that it is equivalent to maximising the likelihood of the data $p(\mathbf{x}\vert\mathbf{s},\mathbf{f}(\cdot))$ assuming a Gaussian noise with equal variance on each component of x(t). A logarithm of the reconstruction error is linearly dependent on the likelihood of the noise term. This approach works fine with supervised learning tasks, when the number of free parameters is relatively small.

In the basic ICA, the dimensionality of the model is the same as the dimensionality of the data. The reconstructions are perfect and the learning criterion cannot be based on reconstruction error. Instead, one can maximise the nongaussianity of the components. Also, when the curvature is allowed to be high enough in a nonlinear model, the data can be reconstructed perfectly. That does not guarantee that the model would be meaningful or generalise to new samples. The problem is called overfitting and one needs a better criterion for learning something meaningful.

With simple criteria, the amount of data required for avoiding overfitting is directly proportional to the number of free parameters [24]. In supervised learning, increasing the number of data points will always overcome the problem. In unsupervised learning, however, the number of unknown variables grows with the number of data points, since the factors corresponding to each observation are also unknown. This would suggest that the simple methods work only, when the number of factors is small enough compared to the dimensionality of the data. Even that is not true if one tries to estimate the variance of the noise. This will be demonstrated in Section . Bayesian inference solves the problem of overfitting and it is addressed in Chapter .

Next: Bayesian Inference and Ensemble Up: Extensions of Factor Analysis Previous: Sparse Coding

Tapani Raiko
2001-12-10