Principal component analysis (PCA) is a powerful statistical method for multivariate data analysis. It is very efficient and simple to use for noise reduction, feature selection, compression, decorrelation and whitening. It is frequently used in practically all signal processing. Consider reading Jolliffe (2002), Haykin (1998) for more information on principal component analysis and algorithmic implementations.
The eigenvalues
and corresponding eigenvectors
of an
square matrix
are defined as the solutions of:
The principal components of multivariate data
are defined
as the orthonormal basis vectors, which contain the maximum variance of the
data. Each row of the data matrix
contains a different
observation with samples on the columns. One way of finding the principal
components is to use the EVD of the estimated covariance matrix of the data.
First of all, the covariance matrix
, where
is
a column of
and the expectation function
simply normalizes the values, is a real symmetric matrix. Therefore, the
following must hold for the EVD of a covariance matrix:
Clearly, this is only possible when
, thus the
eigenvectors must be orthogonal. Additionally, assuming the data has zero mean
and the eigenvectors are normalized, that is,
and
, the following holds for the data projected along a
principal direction
:
Thus, the eigenvalues corresponding to the orthonormal eigenvectors equal the variances of the data projected along the eigenvector directions. Therefore, decomposing the covariance matrix with EVD gives the principal components of the data. The usual convention is to make the one with the largest eigenvalue, or variance, the first principal component and so on.
Whitening observed data
, that is, making the rows
uncorrelated and their variances equal to unity, often makes further
processing easier. Whitening can be expressed as a linear transformation
and one way to find such a transformation is based on the EVD of the estimated
covariance matrix of the data. Again, assuming the data has zero mean, this
equals:
The following whitening transformation can be constructed from the eigenvalues and eigenvectors of the decomposition:
What makes this transformed model useful is that the new mixing matrix
is orthonormal, which reduces the number of free parameters. This
is evident from:
Additionally, using the EVD of the covariance matrix to form the whitening
transformation allows the reduction of free parameters also in another way. By
selecting only a subset
of the eigenvalues, and corresponding
eigenvectors for the transformation, the dimension of the whitened data can be
lowered effectively. The dimension reduction also reduces noise, that is,
improves the signal-to-noise ratio. Additive noise, for example Gaussian,
often appears distributed along several directions with a small energy. If the
selected eigenvalues are the largest ones, which identify the directions with
the highest variances, the dimension reduction is optimal in the sense that it
retains as much of the original signal power as possible.
The orthogonal principal components identified with PCA and, therefore, also the whitened data vectors are uncorrelated, which means that their covariance is zero:
Uncorrelated vectors are sometimes, confusingly, called linearly independent.
However, true statistical independence means that the joint probability
density function (pdf)
is factorisable on its marginal
probability densities as:
This means that the marginal distributions contain no information on each
other whatsoever. It also means that statistical independence is a stronger
restriction than uncorrelation, since the following holds for arbitrary
functions
and
:
Additionally, as directly seen from Equations A.9 and A.11, independence implies uncorrelation. This is actually the reason why whitening the data before ICA will not alter the independent components. The following introduces the key concepts in approximating statistical independence, but for more information consider reading Cardoso (2003), Hyvärinen and Oja (2000).
Since a closed form solution would require exact determination of the pdfs,
which is generally not possible, the independence has to be approximated. A
good starting point is the entropy
of a discrete random variable
, defined as:
Mutual information measures how much information is shared among random
variables. The mutual information
between
random variables is defined
using entropy as:
Clearly, mutual information between independent variables should be 0. Thus, estimating the independent components is possible by minimizing the mutual information between them. However, doing this in practice can be computationally very heavy.
Gaussian variables have the largest entropy among all random variables of
equal variance. Negentropy
is defined as:
Therefore, maximizing negentropy equals minimizing mutual information when estimating independence. Although negentropy is computationally simpler than mutual information, it is also based on the exact pdfs. To make the estimation feasible in practice, the requirement of knowing the exact pdfs should be eliminated. Fortunately, it turns out that negentropy can be approximated without knowing the exact pdfs.
Since negentropy is essentially measuring the difference between the Gaussian
distribution and that of the independent variables, it can be approximated by
estimating the non-Gaussianity of the random variables directly. Classical
measures of non-Gaussianity are skewness and kurtosis, or the third and fourth
order cumulants. The kurtosis of
is defined as:
Actually, as
is assumed to have unit variance, this simplifies to
. Kurtosis is zero for a Gaussian random variable, negative for
sub-Gaussian and positive for super-Gaussian. Thus, the absolute value of
kurtosis can be used as a measure of non-Gaussianity. Maximizing the norm of
kurtosis, which is roughly equivalent to maximizing independence, is
computationally very efficient, making ICA feasible in practice.
A very precise estimation of non-Gaussianity can be constructed as a weighted sum of both skewness and kurtosis, but it is often decided to use only one of them in practice. The possibility of using non-Gaussianity as a measure of independence is not so surprising when considering the central limit theorem. Basically, it states that the distribution of a mixture of i.i.d. random variables tends to be more Gaussian than the original ones. Therefore, the more non-Gaussian a variable is the more independent it has to be. Incidentally, this link between independence and non-Gaussianity is the reason why only one of the components in ICA can originally be Gaussian.
One of the most widely used implementations of ICA is called FastICA. It uses a Newton-iteration based fixed-point optimization scheme and an objective function based on negentropy. FastICA can search for the independent components one at a time, in deflation mode, or symmetrically, all at once. Its performance can also be tuned somewhat by choosing the nonlinearity used in the estimation, for example between the cubic or the tanh. For more information on the FastICA algorithm, and its derivation, consider reading Hyvärinen and Oja (1997,2000). For the actual implementation see FastICA, (1998).
The independent components are estimated one by one and the iteration simply
prevents the estimation of the
th component from converging to the same
solution as any of the previous
components. The basic FastICA
iteration takes the following form in deflation mode:
Since the signs of the independent components are not fixed, convergence means
that the old and new values of
are parallel. The function
, and its derivative
, in step 2. is the selectable
nonlinearity and using the cubic actually corresponds to estimating the
kurtosis.
In deflation mode, the previously estimated components are priviledged in step
3., that is, only the currently estimated weight vector is affected, and the
estimation order matters. For example, estimation error accumulates into the
last components. In symmetric mode, the order in which the components are
estimated has no effect. This is archieved by estimating all the
,
that is, the whole matrix
, first and orthogonalizing the matrix
only after that as an additional step.