Next: Introduction
PCA for Large Scale Problems with Lots of Missing Values
Tapani Raiko - Alexander Ilin - Juha Karhunen
Principal Component Analysis
for Large Scale Problems
with Lots of Missing Values
Tapani Raiko - Alexander Ilin - Juha Karhunen
Abstract:
Principal component analysis (PCA) is a well-known classical data
analysis technique. There are a number of algorithms for solving the
problem, some scaling better than others to problems with high
dimensionality. They also differ in their ability to handle missing
values in the data. We study a case where the data are
high-dimensional and a majority of the values are missing. In case
of very sparse data, overfitting becomes a severe problem even in
simple linear models such as PCA. We propose an algorithm based on
speeding up a simple principal subspace rule, and extend it to use
regularization and variational Bayesian (VB) learning. The
experiments with Netflix data confirm that the proposed algorithm is
much faster than any of the compared methods, and that VB-PCA method
provides more accurate predictions for new data than traditional PCA
or regularized PCA.
Tapani Raiko
2007-07-16