next up previous
Next: Variational Bayesian learning Up: Building Blocks for Variational Previous: Building Blocks for Variational


Introduction

Various generative modelling approaches have provided powerful statistical learning methods for neural networks and graphical models during the last years. Such methods aim at finding an appropriate model which explains the internal structure or regularities found in the observations. It is assumed that these regularities are caused by certain latent variables (also called factors, sources, hidden variables, or hidden causes) which have generated the observed data through an unknown mapping Bishop99LVM. In unsupervised learning, the goal is to identify both the unknown latent variables and generative mapping, while in supervised learning it suffices to estimate the generative mapping.

The expectation-maximisation (EM) algorithm has often been used for learning latent variable models Bishop95,Bishop99LVM,Jordan99,JordSejn01. The distribution for the latent variables is modelled, but the model parameters are found using maximum likelihood or maximum a posteriori estimators. However, with such point estimates, determination of the correct model order and overfitting are ubiquitous and often difficult problems. Therefore, full Bayesian approaches making use of the complete posterior distribution have recently gained a lot of attention. Exact treatment of the posterior distribution is intractable except in simple toy problems, and hence one must resort to suitable approximations. So-called Laplacian approximation method MacKay92,Bishop95 employs a Gaussian approximation around the peak of the posterior distribution. However, this method still suffers from overfitting. In real-world problems, it often does not perform adequately, and has therefore largely given way for better alternatives. Among them, Markov Chain Monte Carlo (MCMC) techniques MacKay99,MacKay03,Neal96 are popular in supervised learning tasks, providing good estimation results. Unfortunately, the computational load is high, which restricts the use of MCMC in large scale unsupervised learning problems where the parameters and variables to be estimated are numerous. For instance, Rowe03book has a case study in unsupervised learning from brain imaging data. He used MCMC for a scaled down toy example but resorted to point estimates with real data.

Ensemble learning Hinton93COLT,MacKay95Ens,MacKay03,Barber98NNML,Lappal-Miskin00, which is one of the variational Bayesian methods Jordan-etal99,attias00variational,JordSejn01, has gained increasing attention during the last years. This is because it largely avoids overfitting, allows for estimation of the model order and structure, and its computational load is reasonable compared to the MCMC methods. Variational Bayesian learning was first employed in supervised problems Wallace90,Hinton93COLT,MacKay95Ens,Barber98NNML, but it has now become popular also in unsupervised modelling. Recently, several authors have successfully applied such techniques to linear factor analysis, independent component analysis (ICA) ICABook01,Roberts01Book,Choudrey00,HojenSorensen02NC, and their various extensions. These include linear independent factor analysis Attias99, several other extensions of the basic linear ICA model Attias01RE,Chan01ICA,Miskin01RE,Roberts04, as well as MLP networks for modelling nonlinear observation mappings Lappalainen00,ICABook01 and nonlinear dynamics of the latent variables (source signals) Ilin-I3NN,Valpola02NC,Valpola03IEICE. Variational Bayesian learning has also been applied to large discrete models Murphy99 such as nonlinear belief networks Frey99NC and hidden Markov models MacKay97.

In this paper, we introduce a small number of basic blocks for building latent variable models which are learned using variational Bayesian learning. The blocks have been introduced earlier in two conference papers Valpola01ICA, Harva05UAI and their applications in Valpola03ICA_Nonlin, Honkela03NPL, Honkela05ESANN, Raiko03ICONIP, Raiko04IJCNN, Raiko05ICANN. Valpola04SigProc studied hierarchical models for variance sources from signal-processing point of view. This paper is the first comprehensive presentation about the block framework itself. Our approach is most suitable for unsupervised learning tasks which are considered in this paper, but in principle at least, it could be applied to supervised learning, too. A wide variety of factor-analysis-type latent-variable models can be constructed by combining the basic blocks suitably. Variational Bayesian learning then provides a cost function which can be used for updating the variables as well as for optimising the model structure. The blocks are designed so as to fit together and yield efficient update rules. By using a maximally factorial posterior approximation, all the required computations can be performed locally. This results in linear computational complexity as a function of the number of connections in the model. The Bayes Blocks software package by BayesBlocks is an open-source C++/Python implementation that can freely be downloaded.

The basic building block is a Gaussian variable (node). It uses as its input values both mean and variance. The other building blocks include addition and multiplication nodes, delay, and a Gaussian variable followed by a nonlinearity. Several known model structures can be constructed using these blocks. We also introduce some novel model structures by extending known linear structures using nonlinearities and variance modelling. Examples will be presented later on in this paper.

The key idea behind developing these blocks is that after the connections between the blocks in the chosen model have been fixed (that is, a particular model has been selected and specified), the cost function and the updating rules needed in learning can be computed automatically. The user does not need to understand the underlying mathematics since the derivations are done within the software package. This allows for rapid prototyping. The Bayes Blocks can also be used to bring different methods into a unified framework, by implementing a corresponding structure from blocks and by using results of these methods for initialisation. Different methods can then be compared directly using the cost function and perhaps combined to find even better models. Updates that minimise a global cost function are guaranteed to converge, unlike algorithms such as loopy belief propagation Pearl88, extended Kalman smoothing Anderson79, or expectation propagation Minka01.

Winn05JMLR have introduced a general purpose algorithm called variational message passing. It resembles our framework in that it uses variational Bayesian learning and factorised approximations. The VIBES framework allows for discrete variables but not nonlinearities or nonstationary variance. The posterior approximation does not need to be fully factorised which leads to a more accurate model. Optimisation proceeds by cycling through each factor and revising the approximate posterior distribution. Messages that contain certain expectations over the posterior approximation are sent through the network.

Beal03, GhahBeal01NIPS, and BealPhD view variational Bayesian learning as an extension to the EM algorithm. Their algorithms apply to combinations of discrete and linear Gaussian models. In the experiments, the variational Bayesian model structure selection outperformed the Bayesian information criterion Schwarz78BIC at relatively small computational cost, while being more reliable than annealed importance sampling even with the number of samples so high that the computational cost is hundredfold.

A major difference of our approach compared to the related methods by Winn05JMLR and by Beal03 is that they concentrate mainly on situations where there is a handy conjugate prior Gelman95 of the posterior distributions available. This makes life easier, but on the other hand our blocks can be combined more freely, allowing richer model structures. For instance, the modelling of variance in a way described in Section 5.1, would not be possible using the gamma distribution for the precision parameter in the Gaussian node. The price we have to pay for this advantage is that the minimum of the cost function must be found iteratively, while it can be solved analytically when conjugate distributions are applied. The cost function can always be evaluated analytically in the Bayes Blocks framework as well. Note that the different approaches would fit together.

Similar graphical models can be learned with sampling based algorithms instead of variational Bayesian learning. For instance, the BUGS software package by [Spiegelhalter et al.(1995)Spiegelhalter, Thomas, Best, and Gilks] uses Gibbs sampling for Bayesian inference. It supports mixture models, nonlinearities, and nonstationary variance. There are also many software packages concentrated on discrete Bayesian networks. Notably, the Bayes Net toolbox by Murphy01 can be used for Bayesian learning and inference of many types of directed graphical models using several methods. It also includes decision-theoretic nodes. Hence it is in this sense more general than our work. A limitation of the Bayes net toolbox Murphy01 is that it supports latent continuous nodes only with Gaussian or conditional Gaussian distributions.

Autobayes Gray02 is a system that generates code for efficient implementations of algorithms used in Bayes networks. Currently the algorithm schemas include EM, k-means, and discrete model selection. This system does not yet support continuous hidden variables, nonlinearities, variational methods, MCMC, or temporal models. One of the greatest strengths of the code generation approach compared to a software library is the possibility of automatically optimising the code using domain information.

In the independent component analysis community, traditionally, the observation noise has not been modelled in any way. Even when it is modelled, the noise variance is assumed to have a constant value which is estimated from the available observations when required. However, more flexible variance models would be highly desirable in a wide variety of situations. It is well-known that many real-world signals or data sets are nonstationary, being roughly stationary on fairly short intervals only. Quite often the amplitude level of a signal varies markedly as a function of time or position, which means that its variance is nonstationary. Examples include financial data sets, speech signals, and natural images.

Recently, Parra00NIPS have demonstrated that several higher-order statistical properties of natural images and signals are well explained by a stochastic model in which an otherwise stationary Gaussian process has a nonstationary variance. Variance models are also useful in explaining volatility of financial time series and in detecting outliers in the data. By utilising the nonstationarity of variance it is possible to perform blind source separation on certain conditions ICABook01,PhamCard01.

Several authors have introduced hierarchical models related to those discussed in this paper. These models use subspaces of dependent features instead of single feature components. This kind of models have been proposed at least in context with independent component analysis Cardoso98ICASSP,HyvHoy99NIPS,HyvHoy00NC,Park04ICA, and topographic or self-organising maps Kohonen97NC,GhahHint97NIPS. A problem with these methods is that it is difficult to learn the structure of the model or to compare different model structures.

The remainder of this paper is organised as follows. In the following section, we briefly present basic concepts of variational Bayesian learning. In Section 3, we introduce the building blocks (nodes), and in Section 4 we discuss variational Bayesian computations with them. In the next section, we show examples of different types of models which can be constructed using the building blocks. Section 6 deals with learning and potential problems related with it, and in Section 7 we present experimental results on several structures given in Section 5. This is followed by a short discussion as well as conclusions in the last section of the paper.


next up previous
Next: Variational Bayesian learning Up: Building Blocks for Variational Previous: Building Blocks for Variational
Tapani Raiko 2006-08-28