Variational Bayesian Learning

Next: Experiments Up: Overfitting in PCA Previous: Regularization

Variational Bayesian Learning

Variational Bayesian (VB) learning provides even stronger tools against overfitting. VB version of PCA [10] approximates the joint posterior of the unknown quantities using a simple multivariate distribution. Each model parameter is described a posteriori using independent Gaussian distributions: $q(a_{ik}) = \mathcal{N}\left(a_{ik};\overline{a}_{ik},\widetilde{a}_{ik}\right)$ and $q(s_{kj}) = \mathcal{N}\left(s_{kj};\overline{s}_{kj},\widetilde{s}_{kj}\right)$ , where $\overline{a}_{ik}$ and $\overline{s}_{kj}$ denote the mean of the solution and $\widetilde{a}_{ik}$ and $\widetilde{s}_{kj}$ denote the variance of each parameter. The means $\overline{a}_{ik}$ , $\overline{s}_{kj}$ can then be used as point estimates of the parameters while the variances $\widetilde{a}_{ik}$ , $\widetilde{s}_{kj}$ define the reliability of the estimates (or credible regions). The direct extension of the method in [10] to missing values can be computationally very demanding. VB-PCA has been used to reconstruct missing values in [11,12] with algorithms that complete the data matrix, which is also very inefficient in case a large part of data is missing. In this article, we implement VB learning using a gradient-based procedure similar to the subspace learning algorithm described in Section 3.

By applying the framework described in [12] to the model in Eqs. (7-8), the cost function becomes:

$\displaystyle C_$ KL

$\displaystyle = \operatorname{E}_q\left\{ \ln\frac{q(\mathbf{A},\mathbf{S})}{p(... ...xij} + \sum_{i=1}^d\sum_{k=1}^c C_{aik} + \sum_{k=1}^c\sum_{j=1}^n C_{skj} \, ,$

(10)

where individual terms are

$\displaystyle C_{xij} = \frac{(x_{ij}-\sum_{k=1}^c \overline{a}_{ik}\overline{s... ...detilde{a}_{ik}\widetilde{s}_{kj}\right)}{2 v_x} + \frac{\ln 2 \pi v_x}{2} \, ,$
$\displaystyle C_{aik} = \frac{\overline{a}_{ik}^2+\widetilde{a}_{ik}}{2} - \fra... ...- \frac{1}{2}\ln \frac{\widetilde{s}_{kj}}{v_{sk}} - \frac{1}{2} \, . \nonumber$

We update $\widetilde{a}$ and $\widetilde{s}$ to minimize the cost by setting the gradient of the cost function to zero:

$\displaystyle \widetilde{a}_{ik} \leftarrow \left[1+ \sum_{j\mid (i,j)\in O} \f... ...i,j)\in O} \frac{\overline{a}_{ik}^2+\widetilde{a}_{ik}}{v_x} \right]^{-1} \, .$

(11)

The gradients for learning $\overline{a}$ and $\overline{s}$ are somewhat similar to (4):

$\displaystyle \frac{\partial C_\text{KL}}{\partial \overline{a}_{il}}$	$\displaystyle = \overline{a}_{il} + \sum_{j\mid (i,j)\in O} \frac{-\left(x_{ij}... ...{kj}\right) \overline{s}_{lj} + \overline{a}_{il} \widetilde{s}_{lj}}{v_x} \, ,$	(12)
$\displaystyle \frac{\partial C_\text{KL}}{\partial \overline{s}_{lj}}$	$\displaystyle = \frac{\overline{s}_{lj}}{v_{sk}} + \sum_{i\mid (i,j)\in O} \fra... ...kj}\right) \overline{a}_{il} + \widetilde{a}_{il} \overline{s}_{lj} }{v_x} \, .$	(13)

We can use the speed-up described in Section 3 by computing the second derivatives. These derivatives happen to coincide with the inverse of the updated variances given in (11): $\partial^2 C_$ KL $/ \partial \overline{a}_{il}^2 = \widetilde{a}_{il}^{-1}$ and $\partial^2 C_$ KL $/ \partial \overline{s}_{lj}^2 = \widetilde{s}_{lj}^{-1}$ . The

and

parameters are updated to minimize the cost

KL assuming all the other parameters fixed. The complete algorithm works by alternating four update steps: $\{\widetilde{a}\}$ , $\{\widetilde{s}\}$ , $\{\overline{a},\overline{s}\}$ , and $\{v_x, v_{s}\}$ .

Next: Experiments Up: Overfitting in PCA Previous: Regularization

Tapani Raiko 2007-07-16