next up previous
Next: DISCUSSION Up: Natural Conjugate Gradient in Previous: NONLINEAR STATE-SPACE MODEL

EXPERIMENTS

As an example, the method for learning nonlinear state-space models presented in Sec. 4.2 was applied to real world speech data. Experiments were made with different data sizes to study the performance differences between the algorithms.

Figure: Part of the speech spectrum data used in the experiments.
\includegraphics[width=0.5\textwidth]{spectrum.eps}

The data set in this experiment was a 21 dimensional real world speech data set. The full data set consisted of 2000 samples of mel frequency log power speech spectra of continuous human speech, which corresponds to roughly 15 seconds of real time. Part of the data set can be seen in Figure 3.

To study the performance differences between the natural conjugate gradient (NCG) method, the conjugate gradient (CG) method and the heuristic algorithm from (Valpola and Karhunen, 2002), the algorithms were applied to different sized parts of the speech data set. Unfortunately a reasonable comparison with a variational EM algorithm was impossible because the extended Kalman smoother (Anderson and Moore, 1979) was unstable and thus the E-step failed.

The size of the data subsets varied between 100 and 500 samples. A five dimensional state-space was used and the MLP networks for the observation and dynamical mapping had 20 hidden nodes. Five different initializations were used to avoid problems with local minima and the results were averaged over the different iterations. An iteration was assumed to have converged when $ \vert\mathcal{B}^t-\mathcal{B}^{t-1}\vert<(5\cdot 10^{-3} /N)$ for 200 consecutive iterations, where $ \mathcal{B}^t$ is the bound on marginal log-likelihood at iteration $ t$ and $ N$ is the size of the data set.

The results can be seen in Figure 4. In particular, as the data size increases, natural conjugate gradient tends to perform much better than the competing algorithms. The slightly anomalous behavior at the data size of 200 can be explained by a silent period in the speech data set between samples 150 and 200.

Figure: Convergence speed of the natural conjugate gradient (NCG) method, the conjugate gradient (CG) method and the heuristic algorithm with different data sizes. Top: Absolute computation times. Bottom: Relative computation times with the computation time of NCG method normalized to 1.
\includegraphics[width=0.45\textwidth]{speech_datasize.eps} \includegraphics[width=0.45\textwidth]{speech_datasize_normalized.eps}

Figure: Ratio of the normalized posterior variance of the states and the observation network output layer weights after the iteration has converged. The results are averaged over the different methods, as they all produced similar results.
\includegraphics[width=0.45\textwidth]{variance_ratio.eps}

The difference in the performance of the algorithms can be at least partially explained by the fact that the ratios of the variances of the different parameters change as the data size increases. The variance of the dynamical and observation mapping weights will tend to get smaller as the data size increases, but there will always be uncertainty left in the states. The variances of Gaussian distributions scale the natural gradient as seen in Eq. (25). Therefore large relative difference in variances can help to explain the poor performance of methods based on flat geometry with larger data sets, as the corrections imposed by the Riemannian geometry become more significant. The effect of data size on the variances is illustrated in Figure 5, where the ratio of the minimum of the normalized variances of the states and observation network output weights is plotted against data size.

As a slightly more realistic example, the full data set of 2000 samples was used to train a seven dimensional state-space model. In this experiment both MLP networks of the NSSM had 30 hidden nodes.

Figure: Comparison of the performance of the natural conjugate gradient (NCG) method, the conjugate gradient (CG) method and the heuristic algorithm with the full data set. Lower bound on marginal log-likelihood $ \mathcal{B}$ is plotted against computation time.
[cc][cc] $ \mathcal{B}$

The performance of the NCG method, CG method and the heuristic algorithm was compared. The results can be seen in Figure 6. Five different initializations were used to avoid problems with poor local optima. The results presented in Figure 6 are from the iterations that converged to the best local optimum.

Natural conjugate gradient clearly outperformed the other algorithms in this experiment. In particular, conventional conjugate gradient learning converged very slowly with this larger data set and regardless of initialization failed to reach a local optimum within reasonable time. Natural conjugate gradient also outperformed the heuristic algorithm (Valpola and Karhunen, 2002) by a factor of more than 10.


next up previous
Next: DISCUSSION Up: Natural Conjugate Gradient in Previous: NONLINEAR STATE-SPACE MODEL
Tapani Raiko 2007-04-18