next up previous
Next: Missing values in speech Up: Experimental results Previous: Experimental results


Bars problem

The first experimental problem studied was testing of the hierarchical nonlinear variance model in Figure 6 in an extension of the well-known bars problem dayanzemel95. The data set consisted of 1000 image patches each having $ 6 \times 6$ pixels. They contained both horizontal and vertical bars. In addition to the regular bars, the problem was extended to include horizontal and vertical variance bars, characterized and manifested by their higher variance. Samples of the image patches used are shown in Figure 8.

Figure 8: Samples from the 1000 image patches used in the extended bars problem. The bars include both standard and variance bars in horizontal and vertical directions. For instance, the patch at the bottom left corner shows the activation of a standard horizontal bar above the horizontal variance bar in the middle.
\begin{figure}\begin{center}
\epsfig{file=bar_data.eps,width=0.7\textwidth}
\end{center}
\end{figure}

The data were generated by first choosing whether vertical, horizontal, both, or neither orientations were active, each with probability $ 1/4$. Whenever an orientation is active, there is a probability $ 1/3$ for a bar in each row or column to be active. For both orientations, there are 6 regular bars, one for each row or column, and 3 variance bars which are 2 rows or columns wide. The intensities (grey level values) of the bars were drawn from a normalised positive exponential distribution having the pdf $ p(z)$ = $ \exp(-z), z \ge 0$, $ p(z)$ = $ 0, z < 0$. Regular bars are additive, and variance bars produce additive Gaussian noise having the standard deviation of its intensity. Finally, Gaussian noise with a standard deviation 0.1 was added to each pixel.

The network was built up following the stages shown in Figure 6. It was initialised with a single layer with 36 nodes corresponding to the 36 dimensional data vector. The second layer of 30 nodes was created at the sweep 20, and the third layer of 5 nodes at the sweep 100. After creating a layer only its sources were updated for 10 sweeps, and pruning was discouraged for 50 sweeps. New nodes were added twice, 3 to the second layer and 2 to the third layer, at sweeps 300 and 400. After that, only the sources were updated for 5 sweeps, and pruning was again discouraged for 50 sweeps. The source activations were reset at the sweeps 500, 600 and 700, and only the sources were updated for the next 40 sweeps. Dead nodes were removed every 20 sweeps. The multistage training procedure was designed to avoid suboptimal local solutions, as discussed in Section 6.2.

Figure 9 demonstrates that the algorithm finds a generative model that is quite similar to the generation process. The two sources on the third layer correspond to the horizontal and vertical orientations and the 18 sources on the second layer correspond to the bars. Each element of the weight matrices is depicted as a pixel with the appropriate grey level value in Fig. 9. The pixels of $ {\mathbf{A}}_{2}$ and $ {\mathbf{B}}_{2}$ are ordered similarly as the patches of $ {\mathbf{A}}_{1}$ and $ {\mathbf{B}}_{1}$, that is, vertical bars on the left and horizontal bars on the right. Regular bars, present in the mixing matrix $ {\mathbf{A}}_{1}$, are reconstructed accurately, but the variance bars in the mixing matrix $ {\mathbf{B}}_{1}$ exhibit some noise. The distinction between horizontal and vertical orientations is clearly visible in the mixing matrix $ {\mathbf{A}}_{2}$.

Figure: Results of the extended bars problem: Posterior means of the weight matrices after learning. The sources of the second layer have been ordered for visualisation purposes according to the weight (mixing) matrices $ {\mathbf{A}}_2$ and $ {\mathbf{B}}_2$. The elements of the matrices have been depicted as pixels having corresponding grey level values. The 18 pixels in the weight matrices $ {\mathbf{A}}_{2}$ and $ {\mathbf{B}}_{2}$ correspond to the 18 patches in the weight matrices $ {\mathbf{A}}_{1}$ and $ {\mathbf{B}}_{1}$.
\begin{figure}\begin{center}
\begin{tabular}{cc}
${\mathbf{A}}_{2}$\ ($18 \tim...
...g{file=bar_B1.eps,width=0.3\textwidth}
\end{tabular} \end{center}
\end{figure}
Figure 10: Left: Cost function plotted against the number of learning sweeps. Solid curve is the main experiment and the dashed curve is the comparison experiment. The peaks appear when nodes are added. Right: The resulting weights in the comparison experiment are plotted like in Figure 9.
file=kls.eps,width=0.95
$ {\mathbf{A}}_{2}$ ( $ 14 \times 2$) $ {\mathbf{B}}_{2}$ ( $ 14 \times 2$)
file=compar_AB2.eps,width=0.95
$ {\mathbf{A}}_{1}$ ( $ 36 \times 14$) $ {\mathbf{B}}_{1}$ ( $ 36 \times 14$)
file=compar_AB1.eps,width=0.95

A comparison experiment with a simplified learning procedure was run to demonstrate the importance of local optima. The creation and pruning of layers were done as before, but other methods for avoiding local minima (addition of nodes, discouraging pruning and resetting of sources) were disabled. The resulting weights can be seen in Figure 10. This time the learning ends up in a suboptimal local optimum of the cost function. One of the bars was not found (second horizontal bar from the bottom), some were mixed up in a same source (most variance bars share a source with a regular bar), fourth vertical bar from the left appears twice, and one of the sources just suppresses variance everywhere. The resulting cost function (5) is worse by 5292 compared to the main experiment. The ratio of the model evidences is thus roughly $ \exp(5292)$.

Figure 11 illustrates the formation of the posterior distribution of a typical single variable. It is the first component of the variance source $ {\mathbf{u}}_1(1)$ in the comparison experiment. The prior means here the distribution given its parents (especially $ {\mathbf{s}}_2(1)$ and $ {\mathbf{B}}_1$) and the likelihood means the potential given its children (the first component of $ {\mathbf{x}}(1)$). Assuming the posteriors of other variables accurate, we can plot the true posterior of this variable and compare it to the Gaussian posterior approximation. Their difference is only 0.007 measured by Kullback-Leibler divergence.

Figure 11: A typical example illustrating the posterior approximation of a variance source.
\begin{figure}\begin{center}
\epsfig{file=vsource_appr.eps,width=0.8\textwidth}
\end{center}
\end{figure}


next up previous
Next: Missing values in speech Up: Experimental results Previous: Experimental results
Tapani Raiko 2006-08-28