Variational Bayesian (VB) methods provide an efficient and often sufficiently accurate deterministic approximation to exact Bayesian learning. Most work on variational methods has focused on the class of conjugate exponential models for which simple EM-like learning algorithms can be derived easily (Ghahramani and Beal, 2001; Winn and Bishop, 2005).
Nevertheless, there are many interesting more complicated models which
are not in the conjugate exponential family. Similar variational
approximations have been applied for many such
models (Barber and Bishop, 1998; Valpola and Karhunen, 2002; Honkela and Valpola, 2005; Lappalainen and Honkela, 2000; Seeger, 2000; Valpola et al., 2004).
The approximating distribution
, where
includes
both model parameters and latent variables, is often restricted to be
Gaussian with a somehow restricted covariance. Values of the
variational parameters
can be found by using a
gradient-based optimization algorithm.
When applying a generic optimization algorithm for such problem, a lot
of background information on the geometry of the problem is lost. The
parameters
of
often have different roles, as
the distribution has separate location, shape, and scale parameters.
This implies that the geometry of the problem is in most, especially
more complicated cases, not Euclidean.
Information geometry studies the Riemannian geometric structure of the manifold of probability distributions (Amari, 1985). It has previously been applied to derive efficient natural gradient learning rules for maximum likelihood algorithms to problems such as independent component analysis (ICA) (Amari, 1998; Yang and Amari, 1997) and multilayer perceptron (MLP) networks (Amari, 1998) as well as to analyze the properties of general EM (Amari, 1995), mean-field variational learning (Tanaka, 2001), and online VB EM (Sato, 2001).
In this paper we propose using the Riemannian structure of the
distributions
to derive more efficient algorithms for
approximate inference and especially mean field type VB.
The method can be used to jointly optimize
all the parameters
of the approximation
, or in
conjunction with variational EM for some parameters.
The method is especially useful for
models that are not in the conjugate exponential family, such as
nonlinear models (Barber and Bishop, 1998; Valpola and Karhunen, 2002; Honkela and Valpola, 2005; Lappalainen and Honkela, 2000; Seeger, 2000) or non-conjugate
variance models (Valpola et al., 2004) that may not have a tractable
exact variational EM algorithm.