A trained PCA model can be used for reconstructing missing values by
. Although PCA performs a
linear transformation of data, overfitting is a serious problem for
large-scale datasets with lots of missing values. This happens when
the cost value (3) is small for training
data but the quality of prediction
is poor for new
data. This effect is illustrated using the following toy example.
Assume that the observation space is -dimensional, and most of
the data are only partly observed, that is either
or
is unknown for most
s. These observations are represented by
triangles placed on the two axes in Fig. 1. There are
only two full observations
which are shown on the
plot by circles. A solution which minimizes the cost function
(3) to zero is defined by a line that
passes through the two fully observed data points (see the left
subfigure). The missing values are then reconstructed by points lying
on the line.
![]() ![]() |
In this example, the solution is defined by only two points and the model is clearly overfitted: There is very little evidence in the data that there exists a significant correlation between the two dimensions. The overfitting problem is even more severe in high-dimensional problems because it is likely that there exist many such pairs of directions in which the evidence of correlation is represented by only a few samples. The right subfigure of Fig. 1 shows a regularized solution that is not overfitted. The correlation is not trusted and the missing values are reconstructed close to the row-wise means. Note that in regularized PCA, the reconstructions are no longer projections to the underlying subspace.
Another way to examine overfitting is to compare the number of model
parameters to the number of observed values in data. A rule of thumb
is that the latter should be at least tenfold to avoid overfitting.
Consider the subproblem of finding the th column vector of
given
th column vector of
while regarding
a constant.
Here,
parameters are determined by the observed values of the
th column vector of
. If the column has fewer than
observations, it is likely to suffer from overfitting, and if it has
fewer than
observations, the subproblem is underdetermined.