A. Mathematical Concepts

A.1 Principal Component Analysis

Principal component analysis (PCA) is a powerful statistical method for multivariate data analysis. It is very efficient and simple to use for noise reduction, feature selection, compression, decorrelation and whitening. It is frequently used in practically all signal processing. Consider reading Jolliffe (2002), Haykin (1998) for more information on principal component analysis and algorithmic implementations.

A.1.1 Eigenvalue Decomposition

The eigenvalues $\lambda$ and corresponding eigenvectors $\mathbf{v}$ of an $n \times n$ square matrix $\mathbf{M}$ are defined as the solutions of:

$\displaystyle \mathbf{M}\mathbf{v}_i = \lambda_i\mathbf{v}_i \:\:\textrm{,}$

(A.1)

where $i = 1, \ldots, n$ . The rank of matrix $\mathbf{M}$ defines how many non-zero eigenvalues there are. Another way of expressing the same solutions is the eigenvalue decomposition (EVD):

$\displaystyle \mathbf{M} = \mathbf{V}\mathbf{D}\mathbf{V}^{-1} \:\:\textrm{,}$

(A.2)

where the columns of matrix $\mathbf{V}$ contain the eigenvectors and $\mathbf{D}$ is a diagonal matrix of the corresponding eigenvalues. Replacing a matrix with its EVD is very practical, since many mathematical operations are simpler and faster to perform on the decomposition. Additionally, the structure of the matrix is easy to study from its EVD.

A.1.2 Principal Components

The principal components of multivariate data $\mathbf{\tilde{X}}$ are defined as the orthonormal basis vectors, which contain the maximum variance of the data. Each row of the data matrix $\mathbf{\tilde{X}}$ contains a different observation with samples on the columns. One way of finding the principal components is to use the EVD of the estimated covariance matrix of the data. First of all, the covariance matrix $E\{\mathbf{\tilde{x}}{\mathbf{\tilde{x}}}^T\}$ , where $\mathbf{\tilde{x}}$ is a column of $\mathbf{\tilde{X}}$ and the expectation function $E\{\cdot\}$ simply normalizes the values, is a real symmetric matrix. Therefore, the following must hold for the EVD of a covariance matrix:

$\displaystyle \mathbf{M}^T = {\mathbf{V}^{-1}}^T\mathbf{D}^T\mathbf{V}^T = \mathbf{V}\mathbf{D}\mathbf{V}^{-1} = \mathbf{M} \:\:\textrm{.}$

(A.3)

Clearly, this is only possible when $\mathbf{V}^{-1} = \mathbf{V}^T$ , thus the eigenvectors must be orthogonal. Additionally, assuming the data has zero mean and the eigenvectors are normalized, that is, $E\{\mathbf{\tilde{x}}\} = 0$ and $\vert\vert\mathbf{v}\vert\vert = 1$ , the following holds for the data projected along a principal direction ${\mathbf{v}_i}^T\mathbf{\tilde{x}}$ :

$\begin{displaymath}\begin{array}{l} E\{{\mathbf{v}_i}^T\mathbf{\tilde{x}}\} = {\... ...lde{x}}}^T\}\mathbf{v}_i = \lambda_i \:\:\textrm{.} \end{array}\end{displaymath}$

(A.4)

Thus, the eigenvalues corresponding to the orthonormal eigenvectors equal the variances of the data projected along the eigenvector directions. Therefore, decomposing the covariance matrix with EVD gives the principal components of the data. The usual convention is to make the one with the largest eigenvalue, or variance, the first principal component and so on.

A.1.3 Whitening

Whitening observed data $\mathbf{\tilde{X}}$ , that is, making the rows uncorrelated and their variances equal to unity, often makes further processing easier. Whitening can be expressed as a linear transformation and one way to find such a transformation is based on the EVD of the estimated covariance matrix of the data. Again, assuming the data has zero mean, this equals:

$\displaystyle E\{\mathbf{\tilde{x}}{\mathbf{\tilde{x}}}^T\} = \mathbf{V}\mathbf{D}\mathbf{V}^T \:\:\textrm{.}$

(A.5)

The following whitening transformation can be constructed from the eigenvalues and eigenvectors of the decomposition:

$\displaystyle \mathbf{X} = \mathbf{D}^{-1/2}\mathbf{V}^T\mathbf{\tilde{X}} \:\:\textrm{,}$

(A.6)

where the operations on the diagonal matrix $\mathbf{D}$ can be calculated simply component-wise. Using the columns $\mathbf{x}$ of the whitened data matrix $\mathbf{X}$ it is easy to check, straight from Equations A.5 and A.6, that $E\{\mathbf{x}{\mathbf{x}}^T\} = \mathbf{I}$ . For the following ICA, this means that the whitening effectively transforms the mixing matrix into a new one. When the original ICA model is defined as $\mathbf{\tilde{X}} = \mathbf{\tilde{A}}\mathbf{S}$ , it can be seen from Equations 3.1 and A.6 that:

$\displaystyle \mathbf{X} = \mathbf{A}\mathbf{S} = \mathbf{D}^{-1/2}\mathbf{V}^T\mathbf{\tilde{A}}\mathbf{S} \:\:\textrm{.}$

(A.7)

What makes this transformed model useful is that the new mixing matrix $\mathbf{A}$ is orthonormal, which reduces the number of free parameters. This is evident from:

$\displaystyle E\{\mathbf{x}\mathbf{x}^T\} = \mathbf{A}E\{\mathbf{s}\mathbf{s}^T\}\mathbf{A}^T = \mathbf{A}\mathbf{A}^T = \mathbf{I} \:\:\textrm{,}$

(A.8)

where the independent components $\mathbf{S}$ are also assumed white. This is not a very restricting assumption, since they can be later transformed with the inverse of the whitening transformation anyway. Actually, it is shown, later on, that independence implies uncorrelation. Therefore, assuming that $E\{\mathbf{s}\mathbf{s}^T\} = \mathbf{I}$ is merely a scaling issue, which causes no problems, since the scaling in ICA is ambiguous anyway.

A.1.4 Reducing Dimension

Additionally, using the EVD of the covariance matrix to form the whitening transformation allows the reduction of free parameters also in another way. By selecting only a subset $k \le n$ of the eigenvalues, and corresponding eigenvectors for the transformation, the dimension of the whitened data can be lowered effectively. The dimension reduction also reduces noise, that is, improves the signal-to-noise ratio. Additive noise, for example Gaussian, often appears distributed along several directions with a small energy. If the selected eigenvalues are the largest ones, which identify the directions with the highest variances, the dimension reduction is optimal in the sense that it retains as much of the original signal power as possible.

A.2 Estimating Independence

The orthogonal principal components identified with PCA and, therefore, also the whitened data vectors are uncorrelated, which means that their covariance is zero:

$\displaystyle E\{v_i v_j\} - E\{v_i\}E\{v_j\} = 0 \:\:\textrm{.}$

(A.9)

Uncorrelated vectors are sometimes, confusingly, called linearly independent. However, true statistical independence means that the joint probability density function (pdf) $p(\mathbf{s})$ is factorisable on its marginal probability densities as:

$\displaystyle p(\mathbf{s}) = p(\mathbf{s}_1, \ldots, \mathbf{s}_k) = \prod_{i=1}^k p(\mathbf{s}_i) \:\:\textrm{.}$

(A.10)

This means that the marginal distributions contain no information on each other whatsoever. It also means that statistical independence is a stronger restriction than uncorrelation, since the following holds for arbitrary functions $f_i(\cdot)$ and $f_j(\cdot)$ :

$\displaystyle E\{f_i(\mathbf{s}_i)f_j(\mathbf{s}_j)\} = E\{f_i(\mathbf{s}_i)\}E\{f_j(\mathbf{s}_j)\} \:\:\textrm{.}$

(A.11)

Additionally, as directly seen from Equations A.9 and A.11, independence implies uncorrelation. This is actually the reason why whitening the data before ICA will not alter the independent components. The following introduces the key concepts in approximating statistical independence, but for more information consider reading Cardoso (2003), Hyvärinen and Oja (2000).

Since a closed form solution would require exact determination of the pdfs, which is generally not possible, the independence has to be approximated. A good starting point is the entropy of a discrete random variable $\mathbf{s}$ , defined as:

$\displaystyle H(\mathbf{s}) = - \sum_i p(\mathbf{s}=a_i) \log p(\mathbf{s}=a_i) \:\:\textrm{,}$

(A.12)

where the

are all the possible values of $\mathbf{s}$ . By definition, entropy measures the amount of information contained in the random variable $\mathbf{s}$ . This corresponds to the definition of differential entropy for continuous random variables.

A.2.1 Mutual Information

Mutual information measures how much information is shared among random variables. The mutual information between random variables is defined using entropy as:

$\displaystyle I(\mathbf{s}_1, \ldots, \mathbf{s}_k) = \sum_{i=1}^k H(\mathbf{s}_i)-H(\mathbf{s}) \:\:\textrm{.}$

(A.13)

Clearly, mutual information between independent variables should be 0. Thus, estimating the independent components is possible by minimizing the mutual information between them. However, doing this in practice can be computationally very heavy.

A.2.2 Negentropy

Gaussian variables have the largest entropy among all random variables of equal variance. Negentropy is defined as:

$\displaystyle J(\mathbf{s}) = H(\mathbf{s}_{gauss}) - H(\mathbf{s}) \:\:\textrm{,}$

(A.14)

where $\mathbf{s}_{gauss}$ is a Gaussian random variable with the same covariance as $\mathbf{s}$ . The important point is that as $\mathbf{s}$ is considered to be white, negentropy differs from mutual information only by a constant:

$\displaystyle I(\mathbf{s}_1, \ldots, \mathbf{s}_k) = C - \sum_{i=1}^k J(\mathbf{s}_i) \:\:\textrm{.}$

(A.15)

Therefore, maximizing negentropy equals minimizing mutual information when estimating independence. Although negentropy is computationally simpler than mutual information, it is also based on the exact pdfs. To make the estimation feasible in practice, the requirement of knowing the exact pdfs should be eliminated. Fortunately, it turns out that negentropy can be approximated without knowing the exact pdfs.

A.2.3 Non-Gaussianity

Since negentropy is essentially measuring the difference between the Gaussian distribution and that of the independent variables, it can be approximated by estimating the non-Gaussianity of the random variables directly. Classical measures of non-Gaussianity are skewness and kurtosis, or the third and fourth order cumulants. The kurtosis of is defined as:

$\displaystyle kurt(s) = E\{s^4\}-3{E\{s^2\}}^2 \:\:\textrm{.}$

(A.16)

Actually, as is assumed to have unit variance, this simplifies to $E\{s^4\}-3$ . Kurtosis is zero for a Gaussian random variable, negative for sub-Gaussian and positive for super-Gaussian. Thus, the absolute value of kurtosis can be used as a measure of non-Gaussianity. Maximizing the norm of kurtosis, which is roughly equivalent to maximizing independence, is computationally very efficient, making ICA feasible in practice.

A very precise estimation of non-Gaussianity can be constructed as a weighted sum of both skewness and kurtosis, but it is often decided to use only one of them in practice. The possibility of using non-Gaussianity as a measure of independence is not so surprising when considering the central limit theorem. Basically, it states that the distribution of a mixture of i.i.d. random variables tends to be more Gaussian than the original ones. Therefore, the more non-Gaussian a variable is the more independent it has to be. Incidentally, this link between independence and non-Gaussianity is the reason why only one of the components in ICA can originally be Gaussian.

A.3 Fast Fixed-Point Iteration

One of the most widely used implementations of ICA is called FastICA. It uses a Newton-iteration based fixed-point optimization scheme and an objective function based on negentropy. FastICA can search for the independent components one at a time, in deflation mode, or symmetrically, all at once. Its performance can also be tuned somewhat by choosing the nonlinearity used in the estimation, for example between the cubic or the tanh. For more information on the FastICA algorithm, and its derivation, consider reading Hyvärinen and Oja (1997,2000). For the actual implementation see FastICA, (1998).

A.3.1 Deflation Mode

The independent components are estimated one by one and the iteration simply prevents the estimation of the th component from converging to the same solution as any of the previous $1, \ldots, i-1$ components. The basic FastICA iteration takes the following form in deflation mode:

Begin with an initial (e.g. random) weight vector $\mathbf{w}_i$ .
Let ${\mathbf{w}_i}^+ = E\{\mathbf{x}g({\mathbf{w}_i}^T\mathbf{x})\}-E\{g'({\mathbf{w}_i}^T\mathbf{x})\}\mathbf{w}_i$
Let ${\mathbf{w}_i}^+ = {\mathbf{w}_i}^+ - \sum_{j=1}^{i-1} {{\mathbf{w}_i}^+}^T\mathbf{w}_j\mathbf{w}_j$
Let $\mathbf{w}_i = \mathbf{w}_i^+/\vert\vert\mathbf{w}_i^+\vert\vert$
Iterate steps 2. - 4. until convergence.

Since the signs of the independent components are not fixed, convergence means that the old and new values of $\mathbf{w}$ are parallel. The function $g(\cdot)$ , and its derivative $g'(\cdot)$ , in step 2. is the selectable nonlinearity and using the cubic actually corresponds to estimating the kurtosis.

A.3.2 Symmetric Mode

In deflation mode, the previously estimated components are priviledged in step 3., that is, only the currently estimated weight vector is affected, and the estimation order matters. For example, estimation error accumulates into the last components. In symmetric mode, the order in which the components are estimated has no effect. This is archieved by estimating all the $\mathbf{w}$ , that is, the whole matrix $\mathbf{W}$ , first and orthogonalizing the matrix only after that as an additional step.