next up previous
Next: Acknowledgments Up: Natural Conjugate Gradient in Previous: Experiments

Discussion

In previous machine learning algorithms based on natural gradients [9], the aim has been to use maximum likelihood to directly update the model parameters $ \boldsymbol{\theta}$ taking into account the geometry imposed by the predictive distribution for data $ p(\boldsymbol{X}\vert \boldsymbol{\theta})$. The resulting geometry is often much more complicated as the effects of different parameters cannot be separated and the Fisher information matrix is relatively dense. In this paper, only the simpler geometry of the approximating distributions $ q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ is used. Because the approximations are often chosen to minimize dependencies between different parameters $ \boldsymbol{\theta}$, the resulting Fisher information matrix with respect to the variational parameters $ \boldsymbol{\xi}$ will be mostly diagonal and hence easy to invert.

While taking into account the structure of the approximation, plain natural gradient in this case ignores the structure of the model and the global geometry of the parameters $ \boldsymbol{\theta}$. This is to some extent addressed by using conjugate gradients, and even more sophisticated optimization methods such as quasi-Newton or even Gauss-Newton methods can be used if the size of the problem permits it.

While the natural conjugate gradient method has been formulated mainly for models outside the conjugate-exponential family, it can also be applied to conjugate-exponential models instead of the more common VB EM algorithms. In practice, simpler and more straightforward EM acceleration methods may still provide comparable results with less human effort.

The experiments in this paper show that using even a greatly simplified variant of the Riemannian conjugate gradient method for some variables is enough to acquire a large speedup. Considering univariate Gaussian distributions, the regular gradient is too strong for model variables with small posterior variance and too weak for variables with large posterior variance, as seen from Eqs. (8)-(10). The posterior variance of latent variables is often much larger than the posterior variance of model parameters and the natural gradient takes this into account in a very natural manner.



Subsections
next up previous
Next: Acknowledgments Up: Natural Conjugate Gradient in Previous: Experiments
Tapani Raiko 2007-09-11