Variational Bayesian learning [1,5] is based on approximating the posterior distribution with a tractable approximation , where is the data, are the unknown variables (including both the parameters of the model and the latent variables), and are the variational parameters of the approximation (such as the mean and the variance of a Gaussian variable). The approximation is fitted by maximizing a lower bound on marginal log-likelihood
KL | (1) |
Finding the optimal approximation can be seen as an optimization problem, where the lower bound is maximized with respect to the variational parameters . This is often solved using a VB EM algorithm by updating sets of parameters alternatively while keeping the others fixed. Both VB-E and VB-M steps can implicitly optimally utilize the Riemannian structure of for conjugate exponential family models [10]. Nevertheless, the EM based methods are prone to slow convergence, especially under low noise, even though more elaborate optimization schemes can speed up their convergence somewhat.
The formulation of VB as an optimization problem allows applying generic optimization algorithms to maximize , but this is rarely done in practice because the problems are quite high dimensional. Additionally other parameters may influence the effect of other parameters and the lack of this specific knowledge of the geometry of the problem can seriously hinder generic optimization tools.
Assuming the approximation is Gaussian, it is often enough to use generic optimization tools to update the mean of the distribution. This is because the negative entropy of a Gaussian with mean and covariance is and thus straightforward differentiation of Eq. (1) yields a fixed point update rule for the covariance