next up previous
Next: Nonlinear factor analysis and Up: Missing Values in Hierarchical Previous: Introduction

Variational Bayesian learning for nonlinear models

Variational Bayesian (VB) learning techniques are based on approximating the true posterior probability density of the unknown variables of the model by a function with a restricted form. Currently the most common technique is ensemble learning [8] where Kullback-Leibler divergence measures the misfit between the approximation and the true posterior. It has been applied to ICA and a wide variety of other models (see [1,9] for some references).

In ensemble learning, the posterior approximation $ q(\boldsymbol{\theta})$ of the unknown variables $ \boldsymbol{\theta}$ is required to have a suitably factorial form $ q(\boldsymbol{\theta}) = \prod_i q_i(\boldsymbol{\theta}_i)$, where $ \boldsymbol{\theta}_i$ are the subsets of unknown variables. The misfit between the true posterior $ p(\boldsymbol{\theta}\mid \mathbf{X})$ and its approximation $ q(\boldsymbol{\theta})$ is measured by Kullback-Leibler divergence. An additional term $ -\log p(\mathbf{X})$ is included to avoid calculation of the model evidence term $ p(\mathbf{X})=\int p(\mathbf{X},\boldsymbol{\theta}) d\boldsymbol{\theta}$. The cost function is

$\displaystyle \mathcal{C}= D( q(\boldsymbol{\theta}) \parallel p(\boldsymbol{\t...
...rac{ q(\boldsymbol{\theta}) }{ p(\mathbf{X},\boldsymbol{\theta}) } \right> \, ,$ (2)

where $ \left< \cdot \right>$ denotes the expectation over distribution $ q(\boldsymbol{\theta})$. Note that since $ D( q \parallel p) \geq 0$, it follows that the cost function provides a lower bound for $ p(\mathbf{X}) \geq \exp
(-\mathcal{C})$. For a more detailed discussion, see [9].

The missing values in data behave like other latent variables and are therefore handled as a part of $ \boldsymbol{\theta}$ instead of $ \mathbf{X}$. The posterior approximation $ q(\boldsymbol{\theta})$ is estimated during the learning and it can be used as a reconstruction for the missing values. The fraction of missing values in the data does not affect computational complexity substantially.

Beal and Ghahramani [10] compare the VB method of handling incomplete data to annealed importance sampling (AIS). In their example, the variational method works more reliably and about 100 times faster than AIS. Chan et al. [11] used ICA with VB learning successfully to reconstruct missing values. A competing approach without VB by Welling and Weber [12] has an exponential complexity w.r.t. the data dimensionality. ICA can be seen as FA with a non-Gaussian source model. Instead of going into that direction, we choose to stick to the Gaussian source model and concentrate on extending the mapping to be nonlinear instead.

next up previous
Next: Nonlinear factor analysis and Up: Missing Values in Hierarchical Previous: Introduction
Tapani Raiko 2003-07-01