Next: INFORMATION GEOMETRY AND NATURAL Up: Natural Conjugate Gradient in Previous: Natural Conjugate Gradient in

INTRODUCTION

Variational Bayesian (VB) methods provide an efficient and often sufficiently accurate deterministic approximation to exact Bayesian learning. Most work on variational methods has focused on the class of conjugate exponential models for which simple EM-like learning algorithms can be derived easily (Ghahramani and Beal, 2001; Winn and Bishop, 2005).

Nevertheless, there are many interesting more complicated models which are not in the conjugate exponential family. Similar variational approximations have been applied for many such models (Barber and Bishop, 1998; Valpola and Karhunen, 2002; Honkela and Valpola, 2005; Lappalainen and Honkela, 2000; Seeger, 2000; Valpola et al., 2004). The approximating distribution $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ , where $\boldsymbol{\theta}$ includes both model parameters and latent variables, is often restricted to be Gaussian with a somehow restricted covariance. Values of the variational parameters $\boldsymbol{\xi}$ can be found by using a gradient-based optimization algorithm.

When applying a generic optimization algorithm for such problem, a lot of background information on the geometry of the problem is lost. The parameters $\boldsymbol{\xi}$ of $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ often have different roles, as the distribution has separate location, shape, and scale parameters. This implies that the geometry of the problem is in most, especially more complicated cases, not Euclidean.

Information geometry studies the Riemannian geometric structure of the manifold of probability distributions (Amari, 1985). It has previously been applied to derive efficient natural gradient learning rules for maximum likelihood algorithms to problems such as independent component analysis (ICA) (Amari, 1998; Yang and Amari, 1997) and multilayer perceptron (MLP) networks (Amari, 1998) as well as to analyze the properties of general EM (Amari, 1995), mean-field variational learning (Tanaka, 2001), and online VB EM (Sato, 2001).

In this paper we propose using the Riemannian structure of the distributions $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ to derive more efficient algorithms for approximate inference and especially mean field type VB. The method can be used to jointly optimize all the parameters $\boldsymbol{\xi}$ of the approximation $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ , or in conjunction with variational EM for some parameters. The method is especially useful for models that are not in the conjugate exponential family, such as nonlinear models (Barber and Bishop, 1998; Valpola and Karhunen, 2002; Honkela and Valpola, 2005; Lappalainen and Honkela, 2000; Seeger, 2000) or non-conjugate variance models (Valpola et al., 2004) that may not have a tractable exact variational EM algorithm.

Next: INFORMATION GEOMETRY AND NATURAL Up: Natural Conjugate Gradient in Previous: Natural Conjugate Gradient in

Tapani Raiko 2007-04-18