next up previous
Next: Natural gradient learning for Up: Natural Conjugate Gradient in Previous: Introduction

Variational Bayes

Variational Bayesian learning [1,5] is based on approximating the posterior distribution $ p(\boldsymbol{\theta}\vert \boldsymbol{X})$ with a tractable approximation $ q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$, where $ \boldsymbol{X}$ is the data, $ \boldsymbol{\theta}$ are the unknown variables (including both the parameters of the model and the latent variables), and $ \boldsymbol{\xi}$ are the variational parameters of the approximation (such as the mean and the variance of a Gaussian variable). The approximation is fitted by maximizing a lower bound on marginal log-likelihood

$\displaystyle \mathcal{B}(q(\boldsymbol{\theta}\vert \boldsymbol{\xi})) = \left...
...bol{\theta}\vert \boldsymbol{\xi})} \right\rangle = \log p(\boldsymbol{X}) - D_$KL$\displaystyle (q(\boldsymbol{\theta}\vert \boldsymbol{\xi}) \Vert p(\boldsymbol{\theta}\vert \boldsymbol{X})),$ (1)

where $ \langle \cdot \rangle$ denotes expectation over $ q$. This is equivalent to minimizing the Kullback-Leibler divergence $ D_$KL$ (q \Vert p)$ between $ q$ and $ p$ [1,5].

Finding the optimal approximation can be seen as an optimization problem, where the lower bound $ \mathcal{B}(q(\boldsymbol{\theta}\vert \boldsymbol{\xi}))$ is maximized with respect to the variational parameters $ \boldsymbol{\xi}$. This is often solved using a VB EM algorithm by updating sets of parameters alternatively while keeping the others fixed. Both VB-E and VB-M steps can implicitly optimally utilize the Riemannian structure of $ q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ for conjugate exponential family models [10]. Nevertheless, the EM based methods are prone to slow convergence, especially under low noise, even though more elaborate optimization schemes can speed up their convergence somewhat.

The formulation of VB as an optimization problem allows applying generic optimization algorithms to maximize $ \mathcal{B}(q(\boldsymbol{\theta}\vert \boldsymbol{\xi}))$, but this is rarely done in practice because the problems are quite high dimensional. Additionally other parameters may influence the effect of other parameters and the lack of this specific knowledge of the geometry of the problem can seriously hinder generic optimization tools.

Assuming the approximation $ q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ is Gaussian, it is often enough to use generic optimization tools to update the mean of the distribution. This is because the negative entropy of a Gaussian $ q(\boldsymbol{\theta}\vert \boldsymbol{\mu}, \mathbf{\Sigma})$ with mean $ \boldsymbol{\mu}$ and covariance $ \mathbf{\Sigma}$ is $ \left\langle \log q(\boldsymbol{\theta}\vert \boldsymbol{\xi}) \right\rangle = - \frac{1}{2} \log
\det(2 \pi e \mathbf{\Sigma})$ and thus straightforward differentiation of Eq. (1) yields a fixed point update rule for the covariance

$\displaystyle \mathbf{\Sigma}^{-1} = -2 \nabla_{\mathbf{\Sigma}} \left\langle \log p(\boldsymbol{X}, \boldsymbol{\theta}) \right\rangle.$ (2)

If the covariance is assumed diagonal, the same update rule applies for the diagonal terms.


next up previous
Next: Natural gradient learning for Up: Natural Conjugate Gradient in Previous: Introduction
Tapani Raiko 2007-09-11