Next: FastICA as EM-Algorithm with Up: Fast Algorithms for Bayesian Previous: EM-algorithm for Independent Component

Fast EM-algorithm by Filtering of Gaussian Noise

With low noise variance $\sigma^2$ the convergence of the EM-algorithm to the optimal value takes a time proportional to $1/\sigma^2$ . We will next show how the re-estimation step can be modified so that the convergence rate will be independent of $\sigma^2$ which yields a significant speedup if $\sigma^2$ is small.

Consider the case that we estimate the sources one at a time and that the sources are assumed to be whitened and the mixing matrix A orthonormal. Denote one of the source signals in the optimal solution as $\hat \mathbf s_{opt}$ . By optimality we mean that the standard EM-algorithm will eventually converge to $\hat \mathbf a_{opt} = \mathbf X \hat \mathbf s_{opt}^T$ with $\hat \mathbf s_{opt} = \operatorname{E}\{\mathbf s \vert \hat \mathbf a_{opt}, \mathbf X, \sigma^2 \mathbf I \}$ .

When we have not yet found the optimal vector $\hat \mathbf a_{opt}$ , we have

$\begin{displaymath}\mathbf s_0=\alpha \mathbf s_{opt}+\beta \mathbf s_G \end{displaymath}$

where $\alpha^2+\beta^2=1$ . The noise s_G is mostly due to the other sources and to a small extent the Gaussian noise in the data. We can think that the E-step filters away the noise by making use of the knowledge about the prior distribution of s. This gives one point of view into the slow convergence: in low noise case most of the unwanted signal s_G is due to other sources and a slow convergence results. From this point of view, it is obvious that we can speed up the convergence if we can filter away also that part of s_G which is due to other sources.
When we are far from the optimal solution, it is natural to assume that $\beta\approx 1$ and $\alpha\approx 0$ . Since aand s₀ are linearly related, we get

$\begin{displaymath}\alpha \hat \mathbf a_{opt}=\hat \mathbf a-\beta\hat\mathbf a_G\approx \hat \mathbf a-\hat \mathbf a_G. \end{displaymath}$

If we can compute a_G, we can adjust the vector ato take into account the apparent noise due to other sources. By the central limit theorem, the distribution of the sum of contributions from several other sources approaches Gaussian as the number of other sources increases. This leads to the following modification: we may estimate $\hat \mathbf a_G$ using the same re-estimation whose result will be approximately

$\begin{displaymath}\hat\mathbf a_G\approx\mathbf a+\sigma^2\mathbf X_G\mathbf F(\mathbf s_{0G})/M. \end{displaymath}$

where X_G is the set of mixtures replaced by Gaussian noise with the same covariance as X and s_0G is the source obtained as a^TX_G. The Gaussian source s_0G is the projection of Gaussian noise to the subspace spanned by a and therefore represents the contribution of the other sources and some Gaussian noise to the estimated source s₀. As derived above, we can eliminate much of this noise by updating a using the difference $\hat \mathbf a-\hat \mathbf a_G$ which is then normalized. The normalization can be done, since scaling of the sources is an undeterminacy in ICA.
Taking the difference yields approximately

$\begin{displaymath}\hat \mathbf a-\hat \mathbf a_G \approx \sigma^2[\mathbf X\mathbf F(\mathbf s_0)^T -\mathbf X_G\mathbf F(\mathbf s_{0G})^T]/M \end{displaymath}$

which shows that the normalization cancels the effect of $\sigma^2$ from the learning rule:

$\begin{displaymath}\hat \mathbf a_{new} = \frac{\hat \mathbf a-\hat \mathbf a_G} {\Vert\hat \mathbf a-\hat \mathbf a_G\Vert} \, . \end{displaymath}$

We assumed above that there was a lot of Gaussian noise by approximating $\beta\approx 1$ . It turns out that the above modification does not affect the optimal solutions of the algorithm, i.e., if $\hat \mathbf a_{opt}$ is a fixed point of the original EM-algorithm, it is also a fixed point of the modified algorithm. This follows immediately from the fact that $\hat \mathbf a_G$ is always parallel to a since X_G is spherically symmetric. To get a rough idea about why this is so, suppose there is a vector b which is orthogonal to a, i.e., b^T a = 0. Then

$\begin{displaymath}\mathbf b^T \hat \mathbf a_G \approx \mathbf b^T \mathbf a + ... ... s_{0G}) = \mathbf s_{0G}' \mathbf F(\mathbf s_{0G}) = 0 \, . \end{displaymath}$

The last step follows from the fact that s_0G' is a projection to an orthogonal direction form s_0G and by Gaussianity of X_G, statistically independent form F(s_0G). But since this must hold for all b which are orthogonal to a, it follows that $\hat \mathbf a_G$ has to be parallel to a.
In the next section we add validity to the result by showing that FastICA algorithm follows from this procedure.

Next: FastICA as EM-Algorithm with Up: Fast Algorithms for Bayesian Previous: EM-algorithm for Independent Component
Harri Lappalainen
2000-03-09