next up previous contents
Next: Model selection Up: Variational Bayesian methods Previous: Variational Bayesian methods   Contents

Cost function

The basic idea in variational Bayesian learning is to minimise the misfit between the exact posterior pdf $ p(\boldsymbol{\Theta}\mid \boldsymbol{X},\mathcal{H})$ and its parametric approximation $ q(\boldsymbol{\Theta})$. The misfit is measured here with the Kullback-Leibler (KL) divergence

$\displaystyle \mathcal{C}_{KL} =
D(q(\boldsymbol{\Theta}) \parallel p(\boldsymb...
...ymbol{\Theta})}{p(\boldsymbol{\Theta}\mid \boldsymbol{X}, \mathcal{H})} \right>$     (4.2)
$\displaystyle = \int q(\boldsymbol{\Theta}) \ln \frac{q(\boldsymbol{\Theta})}{p(\boldsymbol{\Theta}\mid \boldsymbol{X}, \mathcal{H})} d\boldsymbol{\Theta},$      

where the operator $ \left< \cdot \right>$ denotes an expectation over the distribution $ q(\boldsymbol{\Theta})$. The marginal likelihood $ p(\boldsymbol{X}\mid \mathcal{H})$ is hard to evaluate and therefore the cost function $ \mathcal{C}$ that is actually used is

$\displaystyle \mathcal{C}= \left< \ln \frac{q(\boldsymbol{\Theta})}{p(\boldsymb...
...athcal{H})} \right> = \mathcal{C}_{KL} - \ln p(\boldsymbol{X}\mid \mathcal{H}).$ (4.3)

A typical choice of posterior approximations $ q(\boldsymbol{\Theta})$ is Gaussian with limited covariance matrix, that is, all or most of the off-diagonal elements are fixed to zero. Often the posterior approximation is assumed to be a product of independent factors. The factorial approximation, combined with the factorisation of the joint probability like in Equation (3.1), leads to the division of the cost function in Equation (4.3) into a sum of simple terms, and thus to a relatively low computational complexity.

Miskin and MacKay (2001) used VB learning for ICA (See Section 3.1.4). They compared two approximations of the posterior: The first was a Gaussian with full covariance matrix, and the second was a Gaussian with a diagonal covariance matrix. They noticed that the factorial approximation is computationally more efficient and still gives a bound on the evidence and does not suffer from overfitting. On the other hand, Ilin and Valpola (2005) showed that the factorial approximation favours a solution that has an orthogonal mixing matrix, which can deteriorate the performance.


next up previous contents
Next: Model selection Up: Variational Bayesian methods Previous: Variational Bayesian methods   Contents
Tapani Raiko 2006-11-21