next up previous contents
Next: Experiments Up: Linear and Nonlinear Factor Previous: The Missing Values

Initialisation

The posterior variances of the factors are initialised to small values, but a simple linear method is applied to find sensible posterior means of the factors. The model is similar to the FA model given by (3 - 5) with the exception that the variance of the noise is constant $\xi_{k}=\xi$ for each dimension k. In principal component analysis (PCA) the matrix A is formed from the eigenvectors of the covariance matrix C of the data. The eigenvectors corresponding to the largest eigenvalues are chosen, since the eigenvalues are the variances $\sigma_{l}^{2}$ which should be maximised. In this case, C is calculated from only those pairs of data values where both values are observed:

\begin{displaymath}C_{kj}=\frac{\sum_{t}(x_{k}(t)-\mu_{k})(x_{j}(t)-\mu_{j})i_{k}(t)i_{j}(t)}{\sum_{t}i_{k}(t)i_{j}(t)}
\end{displaymath} (9)


\begin{displaymath}\mu_{k}=\frac{\sum_{t}x_{k}(t)i_{k}(t)}{\sum_{t}i_{k}(t)}
\end{displaymath} (10)

The maximum a posteriori estimate for s which is also the mean of the posterior distribution, is found by minimising  

\begin{displaymath}\sum_{k}\frac{1}{2}i_{k}(t)\left(x_{k}(t)-\sum_{l}A_{kl}s_{l}...
...+\sum_{l}\frac{1}{2}\frac{\xi^{2}}{\sigma_{l}^{2}}s_{l}(t)^{2}
\end{displaymath} (11)

and the solution is

\begin{displaymath}\mathbf{s}(t) = \left[\left(\mathbf{A}_{i=1}\right)^{T}\mathb...
...right)^{T}\left(\mathbf{x}_{i=1}(t)-\mathbf{b}_{i=1}\right)\ ,
\end{displaymath} (12)

where subscript i=1 stands for using only those rows or dimensions, where the corresponding dimension of i(t) is one. $\text{diag}\left(\frac{\xi^{2}}{\boldsymbol{\sigma}^{2}}\right)$ stands for a diagonal square matrix, whose diagonal values are $\frac{\xi^{2}}{\sigma_{l}^{2}}$. The noise variance $\xi^{2}$ is left as a parameter.

The initialisation values of the factors are important, because they are fixed for the first 50 sweeps through the entire data set. This allows the network to find a meaningful mapping from factors to the observations, thereby justifying using the factors for the representation. For the same reason, the parameters controlling the distributions of the factors, weights, noise and the hyperparameters are not adapted during the first 100 sweeps. They are adapted only after the network has found sensible values for the variables whose distributions these parameters control. This setting is important for the method because the network can effectively prune away unused parts, which would lead to a local minimum from which the network would never recover.


next up previous contents
Next: Experiments Up: Linear and Nonlinear Factor Previous: The Missing Values
Tapani Raiko
2001-09-26