Principal component analysis (PCA) is a classical statistical method. This linear transform has been widely used in data analysis and compression. The following presentation is adapted from [9]. Some of the texts on the subject also include [30], [31]. Principal component analysis is based on the statistical representation of a random variable. Suppose we have a random vector population x, where
and the mean of that population is denoted by
and the covariance matrix of the same data set is
The components of , denoted by
, represent the
covariances between the random variable components
and
.
The component
is the variance of the component
. The
variance of a component indicates the spread of the component values
around its mean value. If two components
and
of the data
are uncorrelated, their covariance is zero
.
The covariance matrix is, by definition, always symmetric.
From a sample of vectors , we can
calculate the sample mean and the sample covariance matrix as the
estimates of the mean and the covariance matrix.
From a symmetric matrix such as the covariance matrix, we can
calculate an orthogonal basis by finding its eigenvalues and
eigenvectors. The eigenvectors and the corresponding
eigenvalues
are the solutions of the equation
For simplicity
we assume that the are distinct. These values can be
found, for example, by finding the solutions of the characteristic
equation
where the is
the identity matrix having the same order than
and the
|.| denotes the determinant of the matrix. If the data vector has
n components, the characteristic equation becomes of order n. This
is easy to solve only if n is small. Solving eigenvalues and
corresponding eigenvectors is a non-trivial task, and many methods
exist. One way to solve the eigenvalue problem is to use a neural
solution to the problem [30]. The data is fed as the input,
and the network converges to the wanted solution.
By ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. In this way, we can find directions in which the data set has the most significant amounts of energy.
Suppose one has a data set of which the sample mean and the covariance
matrix have been calculated. Let be a matrix consisting of
eigenvectors of the covariance matrix as the row vectors.
By transforming a data vector x, we get
which is a point in the orthogonal coordinate
system defined by the eigenvectors. Components of y can be seen
as the coordinates in the orthogonal base. We can reconstruct the
original data vector from
by
using the property of an
orthogonal matrix . The
is the
transpose of a matrix
. The original vector
was
projected on the coordinate axes defined by the orthogonal basis. The
original vector was then reconstructed by a linear combination of the
orthogonal basis vectors.
Instead of using all the eigenvectors of the covariance matrix, we may
represent the data in terms of only a few basis vectors of the
orthogonal basis. If we denote the matrix having the K first
eigenvectors as rows by , we can create a similar
transformation as seen above
and
This means that we project the original data vector on the coordinate axes having the dimension K and transforming the vector back by a linear combination of the basis vectors. This minimizes the mean-square error between the data and this representation with given number of eigenvectors.
If the data is concentrated in a linear subspace, this provides a way to compress data without losing much information and simplifying the representation. By picking the eigenvectors having the largest eigenvalues we lose as little information as possible in the mean-square sense. One can e.g. choose a fixed number of eigenvectors and their respective eigenvalues and get a consistent representation, or abstraction of the data. This preserves a varying amount of energy of the original data. Alternatively, we can choose approximately the same amount of energy and a varying amount of eigenvectors and their respective eigenvalues. This would in turn give approximately consistent amount of information in the expense of varying representations with regard to the dimension of the subspace.
We are here faced with contradictory goals: On one hand, we should simplify the problem by reducing the dimension of the representation. On the other hand we want to preserve as much as possible of the original information content. PCA offers a convenient way to control the trade-off between loosing information and simplifying the problem at hand.
As it will be noted later, it may be possible to create piecewise linear models by dividing the input data to smaller regions and fitting linear models locally to the data.
Now, consider a small example showing the characteristics of the eigenvectors. Some artificial data has been generated, which is illustrated in the Figure 3.1. The small dots are the points in the data set.
Figure 3.1: Eigenvectors of the artificially created data
Sample mean and sample covariance matrix can easily be calculated from
the data. Eigenvectors and eigenvalues can be calculated from the
covariance matrix. The directions of eigenvectors are drawn in the
Figure as lines. The first eigenvector having the largest eigenvalue
points to the direction of largest variance (right and upwards)
whereas the second eigenvector is orthogonal to the first one
(pointing to left and upwards). In this example the first eigenvalue
corresponding to the first eigenvector is while
the other eigenvalue is
. By comparing the values
of eigenvalues to the total sum of eigenvalues, we can get an idea how
much of the energy is concentrated along the particular eigenvector.
In this case, the first eigenvector contains almost all the energy.
The data could be well approximated with a one-dimensional
representation.
Sometimes it is desirable to investigate the behavior of the system
under small changes. Assume that this system, or phenomenon is
constrained to a n-dimensional manifold and can be approximated with
a linear manifold. Suppose one has a small change along one of the
coordinate axes in the original coordinate system. If the data from
the phenomenon is concentrated in a subspace, we can project this
small change to the approximative subspace built
with PCA by projecting
on all the basis vectors in
the linear subspace by
where the matrix has the K first eigenvectors as
rows. Subspace has then a dimension of K.
represents the change caused by the original small change. This can be
transformed back with a change of basis by taking a linear combination
of the basis vectors by
Then, we get the typical change in the real-world coordinate system
caused by a small change by assuming that the
phenomenon constrains the system to have values in the limited
subspace only.