One of the distinctive factors between different Bayesian approaches is the type of posterior approximation. We have concentrated on large unsupervised learning tasks, where point estimates are too prone to overfitting and sampling used in MCMC methods and particle filters, is often too slow. The problems tackled with particle filters in [Doucet et al.(2001)Doucet, de Freitas, and Gordon] vary from 1 to 10 in dimensionality, whereas the latent space in Section 7.3 is 128 dimensional. The variational Bayesian learning seems to provide a good compromise between point estimates and sampling methods.
Often the posterior distribution consists of clusters or solution modes. It
depends on the posterior approximation again, whether only one of the
clusters, or all of them are modelled. In our case, the expectation in
is taken over the approximate distribution
,
which in practice leads to modelling a single mode. In expectation propagation
Minka01, the Kullback-Leibler divergence is formed differently,
leading to modelling of all the modes. Also, sampling is supposed to take
place in all the modes. For the purpose of finding a single good
representative of the posterior probability mass, the first approach
should be better. In fact, the expectation over the true posterior, also
known as the Bayes estimate, is often degenerate due to symmetry. For
instance in factor analysis type models, the posterior is symmetric to
the permutation of factors. The number of permutations also gives a hint of the
infeasibility of accurately modelling all the modes in a
high-dimensional problem. Perhaps it would be best to find one mode
for the parameters, but all modes for the time-dependent variables when
feasible.
The variational Bayesian methods vary further depending on the posterior approximation. In this paper, all variables are assumed to be independent a posteriori. We have chosen to model individual distributions as Gaussians. Often different conjugate distributions are used instead, for instance, the variance of a Gaussian variable is modelled with a Gamma distribution. Conjugate distributions are accurate and in some sense practical, but by restricting to Gaussians, the nodes can be connected more freely allowing for example hierarchical modelling of variances. It should be noted that the effect of assuming independencies is far more significant compared to the effect of approximations in modelling individual distributions.
The scope of this paper has been restricted to models which can be learned using purely local computations. This is possible, if the parents of each node are independent a posteriori. This can be accomplished by using a factorial posterior approximation and by not allowing multiple computational paths between variables. Purely local computations result in a computational complexity that is linear w.r.t. the number of connections in the model. In small models, one could afford to take all the dependencies into account. In larger models, it might be desirable to model posterior dependencies within disjoint groups of variables, but to assume the groups statistically independent, as is done by Winn05JMLR.
According to our experience, almost maximally factorial posterior pdf
approximation
suffices in many cases. It seems
that a good model structure is usually more important than a good
approximation of the posterior pdf of the model. Therefore
the available computation time is often better invested in a larger model
using a simple posterior approximation. In any case, density estimates
of continuous valued latent variables offer an important advantage over point
estimates, because they are robust against overfitting and provide a cost
function suitable for learning model structures. With variational Bayesian
learning employing a factorial posterior pdf approximation
the density estimates are almost as efficient as point estimates.
Moreover, latent variable models often exhibit rotational and other
invariances which variational Bayesian learning can utilise by
choosing a solution where the factorial approximation is most accurate.
The basic algorithm for learning and inference is based on updating a variable at a time while keeping other variables fixed. It has the benefits of being completely local and guaranteed to converge. A drawback is that the flow of information through time can be slow while performing inference in a dynamical model. There are alternative inference algorithms, where updates are carried out in forward and backward sweeps. These include particle smoothing Doucet01, extended Kalman smoothing Anderson79, and expectation propagation Minka01. When the model needs to be learned at the same time, one needs to iterate a lot anyway, so the variational Bayesian algorithm that makes small but consistent improvements at every sweep might be preferable.
It is an important design choice that each node is updated while keeping the other nodes fixed. If new node types are added later on, there is no need to change the global learning algorithm, but it suffices to design an update rule for the new node type. Also, there is an option to update some nodes more often than others. When different parameters are coupled and cyclic updating is slow, it can be sped up by line search as described by Honkela03NPL. Note that all the updates are done in order to minimise a global cost function, that is, a cost function over all the variables. Expectation propagation Minka01 updates one approximation at a time, too. An important difference is that there updates are done using a local cost function, i.e. the local approximation is fitted to the local true posterior assuming that the rest of the approximation is accurate. This is the reason why expectation propagation may diverge.
Large nonlinear problems often have numerous suboptimal local solutions that should be avoided. We have used many tricks to avoid them, as discussed in Section 6.2. It depends on the application which tricks work best. It is an important aspect of future work to make the procedure as simple as possible for the user.