The data were generated by first choosing whether vertical, horizontal, both, or neither orientations were active, each with probability . Whenever an orientation is active, there is a probability for a bar in each row or column to be active. For both orientations, there are 6 regular bars, one for each row or column, and 3 variance bars which are 2 rows or columns wide. The intensities (grey level values) of the bars were drawn from a normalised positive exponential distribution having the pdf = , = . Regular bars are additive, and variance bars produce additive Gaussian noise having the standard deviation of its intensity. Finally, Gaussian noise with a standard deviation 0.1 was added to each pixel.
The network was built up following the stages shown in Figure 6.
It was initialised with a single layer with 36 nodes corresponding to the 36 dimensional data vector.
The second layer of 30 nodes was created at the sweep 20, and the third layer of 5 nodes at the sweep
100. After creating a layer only its sources were updated for 10
sweeps, and pruning was discouraged for 50 sweeps. New nodes were
added twice, 3 to the second layer and 2 to the third layer,
at sweeps 300 and 400. After that, only the sources were
updated for 5 sweeps, and pruning was again discouraged for 50 sweeps.
The source activations were reset at the sweeps 500, 600 and 700, and only
the sources were updated for the next 40 sweeps.
Dead nodes were removed every 20 sweeps.
The multistage training procedure was designed to avoid suboptimal local
solutions, as discussed in Section 6.2.
Figure 9 demonstrates that the algorithm finds a generative model that is quite similar to the generation process. The two sources on the third layer correspond to the horizontal and vertical orientations and the 18 sources on the second layer correspond to the bars. Each element of the weight matrices is depicted as a pixel with the appropriate grey level value in Fig. 9. The pixels of and are ordered similarly as the patches of and , that is, vertical bars on the left and horizontal bars on the right. Regular bars, present in the mixing matrix , are reconstructed accurately, but the variance bars in the mixing matrix exhibit some noise. The distinction between horizontal and vertical orientations is clearly visible in the mixing matrix .
|
A comparison experiment with a simplified learning procedure was run to demonstrate the importance of local optima. The creation and pruning of layers were done as before, but other methods for avoiding local minima (addition of nodes, discouraging pruning and resetting of sources) were disabled. The resulting weights can be seen in Figure 10. This time the learning ends up in a suboptimal local optimum of the cost function. One of the bars was not found (second horizontal bar from the bottom), some were mixed up in a same source (most variance bars share a source with a regular bar), fourth vertical bar from the left appears twice, and one of the sources just suppresses variance everywhere. The resulting cost function (5) is worse by 5292 compared to the main experiment. The ratio of the model evidences is thus roughly .
Figure 11 illustrates the formation of the posterior distribution of a typical single variable. It is the first component of the variance source in the comparison experiment. The prior means here the distribution given its parents (especially and ) and the likelihood means the potential given its children (the first component of ). Assuming the posteriors of other variables accurate, we can plot the true posterior of this variable and compare it to the Gaussian posterior approximation. Their difference is only 0.007 measured by Kullback-Leibler divergence.