The final rightmost variance model in Fig. 6 is somewhat involved in that it contains both nonlinearities and hierarchical modelling of variances. Before going into its mathematical details and into the two simpler models in Fig. 6, we point out that we have considered in our earlier papers related but simpler block models. In Valpola03ICA_Nonlin, a hierarchical nonlinear model for the data is discussed without modelling the variance. Such a model can be applied for example to nonlinear ICA or blind source separation. Experimental results Valpola03ICA_Nonlin show that this block model performs adequately in the nonlinear BSS problem, even though the results are slightly poorer than for our earlier computationally more demanding model Lappalainen00,Valpola03IEICE,Honkela05NIPS with multiple computational paths.
In another paper Valpola04SigProc, we have considered hierarchical modelling of variance using the block approach without nonlinearities. Experimental results on biomedical MEG (magnetoencephalography) data demonstrate the usefulness of hierarchical modelling of variances and existence of variance sources in real-world data.
Learning starts from the simple structure shown in the left subfigure of Fig. 6. There a variance source is attached to each Gaussian observation node. The nodes represent vectors, with being the output vector of the variance source and the th observation (data) vector. The vectors and have the same dimension, and each component of the variance vector models the variance of the respective component of the observation vector .
Mathematically, this simple first model obeys the equations
Consider then the intermediate model shown in the middle subfigure of Fig. 6. In this second learning stage, a layer of sources with variance sources attached to them is added. These sources are represented by the source vector , and their variances are given by the respective components of the variance vector quite similarly as in the left subfigure. The (vector) node between the source vector and the variance vector represents an affine transformation with a transformation matrix including a bias term. Hence the prior mean inputted to the Gaussian variance source having the output is of the form , where is the bias vector, and is a vector of componentwise nonlinear functions (9). Quite similarly, the vector node between and the observation vector yields as its output the affine transformation , where is a bias vector. This in turn provides the input prior mean to the Gaussian node modelling the observation vector .
The mathematical equations corresponding to the model represented graphically in the middle subfigure of Fig. 6 are:
Compared with the simplest model (33)-(34), one can observe that the source vector of the second (upper) layer and the associated variance vector are of quite similar form, given in Eqs. (39)-(40). The models (37)-(38) of the data vector and the associated variance vector in the first (bottom) layer differ from the simple first model (33)-(34) in that they contain additional terms and , respectively. In these terms, the nonlinear transformation of the source vector coming from the upper layer have been multiplied by the linear mixing matrices and . All the ``noise'' terms , , , and in Eqs. (37)-(40) are modelled by similar zero mean Gaussian distributions as in Eqs. (35) and (36).
In the last stage of learning, another layer is added on the top of the network shown in the middle subfigure of Fig. 6. The resulting structure is shown in the right subfigure. The added new layer is quite similar as the layer added in the second stage. The prior variances represented by the vector model the source vector , which is turn affects via the affine transformation to the mean of the mediating variance node . The source vector provides also the prior mean of the source via the affine transformation .
The model equations (37)-(38) for the data vector and its associated variance vector remain the same as in the intermediate model shown graphically in the middle subfigure of Fig. 6. The model equations of the second and third layer sources and as well as their respective variance vectors and in the rightmost subfigure of Fig. 6 are given by
It should be noted that in the resulting network the number of scalar-valued nodes (size of the layers) can be different for different layers. Additional layers could be appended in the same manner. The final network of the right subfigure in Fig. 6 utilises variance nodes in building a hierarchical model for both the means and variances. Without the variance sources the model would correspond to a nonlinear model with latent variables in the hidden layer. As already mentioned, we have considered such a nonlinear hierarchical model in Valpola03ICA_Nonlin. Note that computation nodes as hidden nodes would result in multiple paths from the latent variables of the upper layer to the observations. This type of structure was used in Lappalainen00, and it has a quadratic computational complexity as opposed to linear one of the networks in Figure 6.