Subsections


3. Independent Component Analysis


3.1 Motivation

Imagine a room with many sound sources, for example, multiple people speaking simultaneously and perhaps some music in the background. The sounds in the room are recorded using multiple microphones. These recorded signals are weighted sums of the original signals emitted from the different sound sources. The mixing of the signals depends on, for example, the location of the microphones or the acoustic properties of the environment. It would be very useful to be able to estimate the original source signals from the observed mixed signals alone. As an illustration, consider the source signals in Figure 3.1(a) and the microphone recordings in Figure 3.1(b). These are not very realistic sound signals, but should illustrate the problem clearly. It may seem impossible to estimate the source signals from the observed ones.

Figure 3.1: Illustration of source separation. (a) Each of the signals is emitted from a different source. (b) The recorded signals are differently mixed observations of the original source signals. (c) The source signals are estimated using only the observed mixed signals. The signals match very closely to the true source signals, with only minor differences like the reversed signs.
\includegraphics[width=0.24\textwidth]{images/ica_source1.eps} \includegraphics[width=0.24\textwidth]{images/ica_source2.eps} \includegraphics[width=0.24\textwidth]{images/ica_source3.eps} \includegraphics[width=0.24\textwidth]{images/ica_source4.eps}
(a)
\includegraphics[width=0.24\textwidth]{images/ica_mix1.eps} \includegraphics[width=0.24\textwidth]{images/ica_mix2.eps} \includegraphics[width=0.24\textwidth]{images/ica_mix3.eps} \includegraphics[width=0.24\textwidth]{images/ica_mix4.eps}
(b)
\includegraphics[width=0.24\textwidth]{images/ica_ic1.eps} \includegraphics[width=0.24\textwidth]{images/ica_ic2.eps} \includegraphics[width=0.24\textwidth]{images/ica_ic3.eps} \includegraphics[width=0.24\textwidth]{images/ica_ic4.eps}
(c)

The idea of solving the original source signals using only the observed signals with unknown mixing and minimal, if any, information on the sources is called blind source separation (BSS) (c.f., Jutten and Herault, 1991, Cardoso, 1990). If the mixing were known, the problem could be solved using classical methods. Yet, the problem of solving both the mixing and the original sources at the same time is considerably difficult. Approaches to solving it need to make some, hopefully minimal, assumptions. For example, assuming that the sources contain significant autocorrelations allows the use of temporal decorrelation algorithms, such as SOBI (Belouchrani et al., 1993) or TDSEP (Ziehe and Müller, 1998).

Independent component analysis (ICA) (c.f., Comon, 1994, Jutten and Herault, 1991) is perhaps the most widely used method for performing blind source separation, and is implemented in many algorithms, such as FastICA (Hyvärinen, 1999) or Infomax (Bell and Sejnowski, 1995, Amari et al., 1995). It is based on the assumption that the source signals are statistically independent. This seems like a natural assumption in many applications and, in fact, it does not have to hold exactly in practice for ICA to work. Figure 3.1(c) shows the source signals estimated using FastICA, based only on the observed mixed signals shown in Figure 3.1(b). The estimated signals are very close to the original source signals. The minor differences, like the reversed signs, are explained in detail later. As a textbook for additional background information on independent component analysis and different algorithms consider reading Hyvärinen et al. (2001), Stone (2004).


3.2 Mixture Model

The model used in ICA is a statistical generative model for an instantaneous linear mixture of random variables. When the mixed signals are represented as a data matrix $ \mathbf{X}$ , the mixing model can be expressed in matrix form as:

$\displaystyle \mathbf{X} = \mathbf{A}\mathbf{S} \:\:\textrm{.}$ (3.1)

Each row $ \mathbf{s}_k^T$ of the source matrix $ \mathbf{S}$ contains one independent component and each column $ \mathbf{a}_k$ of the mixing matrix $ \mathbf{A}$ holds the corresponding mixing weights, for a total of $ K$ sources. This is simply the matrix form of the linearly weighted sums $ \mathbf{x}_t = a_{t1}\mathbf{s}_1 + \cdots + a_{tK}\mathbf{s}_K$ .

Since both $ \mathbf{A}$ and $ \mathbf{S}$ are unknown, it immediately follows that the signs and scaling of the sources can not be identified. One can multiply the mixing vector $ \mathbf{a}_k$ and divide the source vector $ \mathbf{s}_k$ respectively with any coefficient. Additionally, the order of the sources is not fixed. Fortunately, the ambiguities in the model are often not so crucial. For example, the sign and scaling of the components are often normalized after ICA.


3.3 Estimating Independence

Theoretically, statistical independence means that the sources do not contain any information on each other. In other words, the joint probability density function (pdf) of the sources is factorisable on its marginal probability densities $ p(\mathbf{s}_1, ..., \mathbf{s}_K) = \prod_i p(\mathbf{s}_i)$ . Since a closed form solution to the ICA problem would require exact determination of the pdfs, which are generally not available, the sources have to be estimated by approximating independence with an objective function. This objective function can be based on concepts such as mutual information or negentropy (c.f., Hyvärinen and Oja, 2000) and essentially measures how non-Gaussian the estimated sources are.

One way of defining the connection between independence and non-Gaussianity, as explained in Appendix A, is given by the common statistical measures of information theory. First, since mutual information measures the amount of information shared between random variables, it can be used as a natural measure of independence. Mutual information, in turn, is closely tied to negentropy, which basically compares a given density to a Gaussian. Finally, the difference to a Gaussian can also be approximated, without the exact pdfs, directly from the observations by using measures of non-Gaussianity, such as skewness and kurtosis, that is, the third and fourth order cumulants. For example, kurtosis is defined as:

$\displaystyle kurt(s) = E\{s^4\}-3{E\{s}^2\}^2 \:\:\textrm{.}$ (3.2)

Another, more intuitive, explanation is offered by the central limit theorem. Basically, it states that the distribution of a mixture of i.i.d. random variables tends to be more Gaussian than the original ones. This means that, when the sources are made more non-Gaussian, they must become more unmixed, or independent.

Before estimating the independent components, the observed data $ \mathbf{\tilde{X}}$ can be whitened, that is, the samples made uncorrelated and their variances one. As explained in Appendix A, whitening is a linear transformation and can be constructed, for example, using principal component analysis (PCA) (c.f., Jolliffe, 2002). The direction $ \mathbf{v}$ of the first principal component is defined as the direction that maximizes the variance of the projection $ \mathbf{v}^T\mathbf{\tilde{x}}$ , where $ \mathbf{\tilde{x}}$ is a column of $ \mathbf{\tilde{X}}$ . Generally, the $ i$ th principal component is along the direction of highest variance that is orthogonal to the previous $ i-1$ components. One way of finding all the principal components is based on the eigenvalue decomposition (EVD) of the covariance matrix of the data, which gives two matrixes $ \mathbf{D}$ , a diagonal matrix of the eigenvalues, and $ \mathbf{V}$ , the corresponding eigenvectors.

The directions of the principal components define the uncorrelated basis and the corresponding variances give the scaling for the whitening transformation. Whitening the data does not constrain the ICA in any way, since the scaling is ambiguous anyway and independence implies uncorrelation. This is because uncorrelation means that the covariance is the identity, which always holds when the joint pdf is factorisable in the way shown before. The end result of the whitening is that the original mixing matrix $ \mathbf{\tilde{A}}$ is also transformed in the following way:

$\displaystyle \mathbf{A} = \mathbf{D}^{-1/2}\mathbf{V}^T\mathbf{\tilde{A}} \:\:\textrm{.}$ (3.3)

Estimating the independent components from the whitened data matrix $ \mathbf{X}$ is easier, since the number of free parameters is reduced. For example, the mixing matrix $ \mathbf{A}$ is orthonormal, making its inverse $ \mathbf{W}$ easy to calculate. If the whitening is done using PCA, the degrees of freedom can be lowered also in another way. By leaving out the weakest principal components the dimension of the data can be reduced in an optimal energy preserving manner, which improves the signal-to-noise ratio of the data.

Two examples of joint probability densities are shown in Figure 3.2. One is a mixture of arbitrary non-Gaussian densities, and the other one a mixture of Gaussians. The dashed curves around the densities plot the projected variance measured in all directions. The dashed line marks the direction of maximum variance, that is, the first principal component. Similarly, the values of kurtosis are shown using solid curves and the direction of maximum kurtosis with a solid line.

Figure 3.2: Example joint propability densities. (a) For non-Gaussian densities the principal (dashed line) and independent (solid line) directions can be identified, where as (b) for Gaussian ones the directions are all equal. The corresponding dashed and solid curves show the values of variance and kurtosis in all directions respectively.
\includegraphics[width=0.48\textwidth]{images/pdf_nongauss.eps}
(a)
\includegraphics[width=0.48\textwidth]{images/pdf_gauss.eps}
(b)

Looking at the arbitrary case, it is clear that ICA, based on the kurtosis, is able to identify the component directions much better than PCA. The principal components can also be identified and, as mentioned before, can be used in the whitening of the data. However, the directions of highest variance do not generally identify the independent components and the principal components are restricted, since they are always orthogonal.

On the other hand, the Gaussian case looks very different. Because a Gaussian density is perfectly symmetric and completely defined by the mean and standard deviation, the directions are all equal even with higher-order statistics, like kurtosis. Therefore, the results of PCA and ICA in such a case would be quite random. The special nature of the Gaussian density also means that, if more than one of the original sources is Gaussian, they can not be separated using ICA.


3.3.1 FastICA Algorithm

One particularly useful and widely used implementation of ICA is the FastICA algorithm (FastICA, 1998, Hyvärinen, 1999), which is very fast and robust. It performs well with big data sets and even under somewhat noisy conditions. FastICA uses a fixed-point optimization scheme based on Newton-iteration and an objective function related to negentropy. The remainder of this thesis focuses mainly on the use of FastICA, but the considerations should be relatively easy to extrapolate to other algorithmic implementations of ICA. FastICA can search for the independent components one at a time or all at once. Its performance can also be tuned somewhat by choosing from a range of nonlinearities. See Appendix A for more mathematical details of the fixed-point optimization.

The basic idea is to first whiten the data using PCA and then, based on the whitened data matrix $ \mathbf{X}$ , search for a solution in the form $ \mathbf{s}=\mathbf{w}^T\mathbf{x}$ , where $ \mathbf{s}$ and $ \mathbf{x}$ are columns of the source matrix and whitened data matrix, respectively. Or equivalently in matrix form:

$\displaystyle \mathbf{S} = \mathbf{W}\mathbf{X} \:\:\textrm{,}$ (3.4)


where $ \mathbf{W}=\mathbf{A}^T$ is the demixing matrix. The algorithm optimizes the objective function, which estimates the sources $ \mathbf{S}$ by approximating statistical independence. The algorithm starts from an initial condition, for example, random demixing weights $ \mathbf{w}$ . Then, on each iteration step, the weights $ \mathbf{w}$ are first updated, so that the corresponding sources become more independent, and then normalized, so that $ \mathbf{W}$ stays orthonormal. The iteration is continued until the weights converge. For example, when using the cubic nonlinearity, which corresponds to estimating kurtosis, the fixed-point update rule becomes (c.f, Hyvärinen and Oja, 1997):

$\displaystyle \mathbf{w}^+ = E\{\mathbf{x}(\mathbf{w}^T\mathbf{x})^3\}-3\vert\vert\mathbf{w}\vert\vert^2\mathbf{w}$ (3.5)


3.3.2 Estimation Errors

It is clear that with such estimation schemes the solutions are only approximate and noisy. The true solution may not always be found. For example, in addition to the ambiguities in the ICA model, the strict assumption of statistical independence of the sources may not hold for a given data. Still, the model in Equation 3.1 is noise free, and noise often makes finding the optimal solution harder. Additionally, a problem is that the solutions can be affected by the parameters of the algorithm, for example, the initial conditions. Even the convergence stopping criterion may prevent the algorithm from finding the optimal solution. Moreover, algorithms often use clever adaptive tricks during iteration to be faster and more robust.

The estimation errors can be considered as an error-surface. The shape of the surface is determined by the optimized objective function and the given data. For example, a 3-dimensional error-surface is shown in Figure 3.3, where in addition to the minimum point, the surface contains local minima and some noise. Different starting points and directions, that is, the initial conditions, cause the optimization to converge along different paths, shown as thick curves. Although a robust algorithm should not be affected by the noise, it can get stuck on a local minimum and even the optimal solutions reached along different paths can be slightly different.

Figure 3.3: Error-surface of estimation. Different initial conditions cause the algorithm to converge along different paths (shown as curves) on the 3-dimensional error-surface. Thus, the algorithm may get stuck on a local minimum and even the optimal estimates may be slightly different.
\includegraphics[width=\textwidth]{images/ica_paths.eps}

Furthermore, since the number of free parameters can be very high, it is relatively easy for the ICA estimation to overfit the data. Overfitting can cause severe estimation errors. Using PCA to reduce the dimension of the data during whitening has been shown (Särelä and Vigário, 2003) to help in preventing the possibility of overfitting.


3.4 Application to fMRI Data

ICA was first applied to fMRI data by McKeown et al. (1998) and has since been used in many studies (c.f., Kiviniemi et al., 2003, Jung et al., 2001). It has opened new possibilities in designing studies and analyzing measurements in functional brain research. To justify the application of ICA to the volumetric (3-dimensional) spatial fMRI signal, the data must match the assumptions and limitations of ICA, at least sufficiently. This has been shown (McKeown and Sejnowski, 1998) to be the case. For more information and great overviews on the use of ICA in fMRI studies consider reading, for example, McKeown et al. (2003), Calhoun et al. (2003).


3.4.1 Spatial ICA

To be able to use the fMRI signal in the ICA model, each scanned volume must be transformed into vector form in a bidirectional manner. Since ICA is only concerned with the statistics of the observations and not the order of samples within them, the voxels in the volumes can be reordered into vectors quite freely (c.f., Calhoun et al., 2003). Naturally, all volumes must be transformed using the same reordering.

Usually, ICA is applied to temporal signals, such as EEG or MEG recordings, in the same way as with the illustrative sound signals in Section 3.1. But, the fMRI signal is a temporal sequence of the scanned spatial volumes, and the different activation patterns are also spatial. Therefore, a transposed version of the ICA mixing model is used. The model is illustrated in Figure 3.4.

Figure 3.4: Spatial ICA of fMRI data. The rows of the data matrix $ \mathbf{X}$ and sources matrix $ \mathbf{S}$ are vectorized volumes. The corresponding columns of the mixing matrix $ \mathbf{A}$ are the time-courses. Note, that the statistical independence applies to the volumes.
\includegraphics[width=\textwidth]{images/ica_spatial.eps}

The fMRI signal is represented by a $ T \times V$ data matrix $ \mathbf{X}$ , where $ T$ is the number of time points and $ V$ is the number of voxels in the volumes. This means that each row of $ \mathbf{S}$ contains an independent spatial pattern and the corresponding column of $ \mathbf{A}$ holds its activation time-course. Since the whole mixing model is transposed, the statistical constraint applies to the spatial domain.

It is also important to note that, contrary to the traditional fMRI analysis method, the time-courses are not imposed in any way. In ICA they manifest themselves in a purely data-driven way. For more general information on the differences between supervised and unsupervised methods consider reading Haykin (1998).