Variational Bayesian (VB) methods provide an efficient and often sufficiently accurate deterministic approximation to exact Bayesian learning [1]. Most work on variational methods has focused on the class of conjugate exponential models for which simple EM-like learning algorithms can be derived easily.
Nevertheless, there are many interesting more complicated models which
are not in the conjugate exponential family. Similar variational
approximations have been applied for many such models [2,3,4,5,6,7].
The approximating distribution
, where
includes
both model parameters and latent variables, is often restricted to be
Gaussian with a somehow restricted covariance. Values of the
variational parameters
can be found by using a
gradient-based optimization algorithm.
When applying a generic optimization algorithm for such problem, a lot
of background information on the geometry of the problem is lost. The
parameters
of
can have different roles as
location, shape, and scale parameters, and they can change the
influence of other parameters.
This implies that the geometry of the problem is in most
cases not Euclidean.
Information geometry studies the Riemannian geometric structure of the manifold of probability distributions [8]. It has been applied to derive efficient natural gradient learning rules for maximum likelihood algorithms in independent component analysis (ICA) and multilayer perceptron (MLP) networks [9]. The approach has been used in several other problems as well, for example in analyzing the properties of an on-line variational Bayesian EM method [10].
In this paper we propose using the Riemannian structure of the
distributions
to derive more efficient algorithms for
approximate inference and especially mean field type VB.
This is in contrast with the traditional natural gradient learning [9] which uses
the Riemannian structure of the predictive distribution
.
The proposed method can be used to jointly optimize
all the parameters
of the approximation
, or in
conjunction with VB EM for some parameters.
The method is especially useful for
models that are not in the conjugate exponential family, such as
nonlinear models [2,3,4,5,7] or non-conjugate
variance models [6] that may not have a tractable
exact VB EM algorithm.