Overfitting in PCA

A trained PCA model can be used for reconstructing missing values by $\hat{x}_{ij} = \sum_{k=1}^c a_{ik} s_{kj}$ . Although PCA performs a linear transformation of data, overfitting is a serious problem for large-scale datasets with lots of missing values. This happens when the cost value (3) is small for training data but the quality of prediction $\hat{x}_{ij}$ is poor for new data. This effect is illustrated using the following toy example.

Assume that the observation space is

-dimensional, and most of the data are only partly observed, that is either $x_{1j}$ or $x_{2j}$ is unknown for most

s. These observations are represented by triangles placed on the two axes in Fig. 1. There are only two full observations $(x_{1j},x_{2j})$ which are shown on the plot by circles. A solution which minimizes the cost function (3) to zero is defined by a line that passes through the two fully observed data points (see the left subfigure). The missing values are then reconstructed by points lying on the line.

**Figure 1:** An artificial example where all but two data points have one of the two components missing. On the left, the correlation between the components is determined by these two samples, giving a badly overfitted solution. On the right, the desired solution where the correlation is not trusted as much (the reconstruction is obtained using the VB algorithm explained in Section 4).
$\includegraphics[width=.48\textwidth,trim=19mm 11mm 15mm 7mm,clip]{overfit.eps}$ $\includegraphics[width=.48\textwidth,trim=19mm 11mm 15mm 7mm,clip]{ofitreg.eps}$

In this example, the solution is defined by only two points and the model is clearly overfitted: There is very little evidence in the data that there exists a significant correlation between the two dimensions. The overfitting problem is even more severe in high-dimensional problems because it is likely that there exist many such pairs of directions in which the evidence of correlation is represented by only a few samples. The right subfigure of Fig. 1 shows a regularized solution that is not overfitted. The correlation is not trusted and the missing values are reconstructed close to the row-wise means. Note that in regularized PCA, the reconstructions are no longer projections to the underlying subspace.

Another way to examine overfitting is to compare the number of model parameters to the number of observed values in data. A rule of thumb is that the latter should be at least tenfold to avoid overfitting. Consider the subproblem of finding the

th column vector of $\mathbf{S}$ given

th column vector of $\mathbf{X}$ while regarding $\mathbf{A}$ a constant. Here,

parameters are determined by the observed values of the

th column vector of $\mathbf{X}$ . If the column has fewer than

observations, it is likely to suffer from overfitting, and if it has fewer than

observations, the subproblem is underdetermined.