Next: CONCLUSION Up: Natural Conjugate Gradient in Previous: EXPERIMENTS

DISCUSSION

In previous machine learning algorithms based on natural gradients (Amari, 1998), the aim has been to use maximum likelihood to directly update the model parameters $\boldsymbol{\theta}$ taking into account the geometry imposed by the predictive distribution for data $p(\boldsymbol{X}\vert \boldsymbol{\theta})$ . The resulting geometry is often much more complicated as the effects of different parameters cannot be separated and the Fisher information matrix is relatively dense.

In this paper, only the simpler geometry of the approximating distributions $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ is used. Because the approximations are often chosen to minimize dependencies between different parameters $\boldsymbol{\theta}$ , the resulting Fisher information matrix with respect to the variational parameters $\boldsymbol{\xi}$ will be mostly diagonal and hence easy to invert.

While taking into account the structure of the approximation, plain natural gradient in this case ignores the structure of the model and the global geometry of the parameters $\boldsymbol{\theta}$ . This is to some extent addressed by using conjugate gradients, and even more sophisticated optimization methods such as quasi-Newton or even Gauss-Newton methods can be used if the size of the problem permits it.

While the natural conjugate gradient method has been formulated mainly for models outside the conjugate-exponential family, it can also be applied to conjugate-exponential models instead of the more common variational EM algorithms. In practice, simpler and more straightforward EM acceleration methods may still provide comparable results with less human effort.

The experiments in this paper show that even a diagonal approximation of the Riemannian metric tensor is enough to acquire a large speedup. Considering univariate Gaussian distributions, the regular gradient is too strong for model variables with small posterior variance and too weak for variables with large posterior variance, as seen from Equations (8)-(10). The posterior variance of latent variables is often much larger than the posterior variance of model parameters, which means that maximal benefit from the natural gradient can be attained by combining at least parts of E and M steps of the variational EM.

When the data set is small, regular conjugate gradient method works reasonably well. However, for larger data sets natural conjugate gradient shows far superior performance.

Initial experiments with natural gradient ascent (without conjugacy) indicated that its performance is significantly worse than the other compared algorithms. However, it is possible that natural gradient ascent suffers more than natural conjugate gradient method from the approximations made in the computation of the Riemannian metric tensor.

Next: CONCLUSION Up: Natural Conjugate Gradient in Previous: EXPERIMENTS

Tapani Raiko 2007-04-18