Next: Variational Bayes Up: Natural Conjugate Gradient in Previous: Natural Conjugate Gradient in

Introduction

Variational Bayesian (VB) methods provide an efficient and often sufficiently accurate deterministic approximation to exact Bayesian learning [1]. Most work on variational methods has focused on the class of conjugate exponential models for which simple EM-like learning algorithms can be derived easily.

Nevertheless, there are many interesting more complicated models which are not in the conjugate exponential family. Similar variational approximations have been applied for many such models [2,3,4,5,6,7]. The approximating distribution $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ , where $\boldsymbol{\theta}$ includes both model parameters and latent variables, is often restricted to be Gaussian with a somehow restricted covariance. Values of the variational parameters $\boldsymbol{\xi}$ can be found by using a gradient-based optimization algorithm.

When applying a generic optimization algorithm for such problem, a lot of background information on the geometry of the problem is lost. The parameters $\boldsymbol{\xi}$ of $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ can have different roles as location, shape, and scale parameters, and they can change the influence of other parameters. This implies that the geometry of the problem is in most cases not Euclidean.

Information geometry studies the Riemannian geometric structure of the manifold of probability distributions [8]. It has been applied to derive efficient natural gradient learning rules for maximum likelihood algorithms in independent component analysis (ICA) and multilayer perceptron (MLP) networks [9]. The approach has been used in several other problems as well, for example in analyzing the properties of an on-line variational Bayesian EM method [10].

In this paper we propose using the Riemannian structure of the distributions $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ to derive more efficient algorithms for approximate inference and especially mean field type VB. This is in contrast with the traditional natural gradient learning [9] which uses the Riemannian structure of the predictive distribution $p(\boldsymbol{X}\vert \boldsymbol{\theta})$ . The proposed method can be used to jointly optimize all the parameters $\boldsymbol{\xi}$ of the approximation $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ , or in conjunction with VB EM for some parameters. The method is especially useful for models that are not in the conjugate exponential family, such as nonlinear models [2,3,4,5,7] or non-conjugate variance models [6] that may not have a tractable exact VB EM algorithm.

Next: Variational Bayes Up: Natural Conjugate Gradient in Previous: Natural Conjugate Gradient in

Tapani Raiko 2007-09-11