The resulting weights are shown in Figure . The neurons in the upper right corner of A1 are specialised to just a few data samples. Two neurons in the lower left corner have connection in the B1 matrix thus adding only noise to the reconstruction and three others in both A1 and B1. Other B connections are close to zero. Reading from upper left corner in A2 the first neuron of the third layer activates the specialised neurons and inhibits the neurons that add noise. Second and fourth neuron activate noise neurons and inhibits the low frequency neurons in the lower right of A1. The third neuron activates neurons with diagonal features from upper left to lower right corner and the fifth neuron activates the horizontal features.
The initialisation seems to affect the results considerably. This was already noticed with the simple example in Section . The algorithm can get stuck in a local minimum of the cost function and even regeneration of the neurons does not always help, since when there already is an adequate reconstruction of the data, it is very hard for a new neuron to fit in. The asymmetric initialisation with VQ seems to result in more asymmetric features than the ones presented here. Some more experiments need to be done to confirm that. I would like to stress that a local optimum of a good model is usually better than a global optimum of a bad one.
In the simple example in Section , the terms of the cost function corresponding to the third layer were greater than the ones corresponding to the second layer. This means that the reconstructions are effectively decided on the third layer and just focused on the second layer or that the neurons on the second layer are acting like computational units. But in this experiment the proportion is one in the third layer against eight in the second layer. This means that the reconstruction of the image is decided on the second layer and the third layer only guides it.
At the end of the iteration process the cost function would have become smaller by removing some neurons that were stuck in local minima. After the network is rebooted and updated some more, the number of neurons alive still diminishes. As there are sources on the third layer that correspond to horizontal and diagonal features in the second layer, there should also be sources corresponding to the vertical and the other diagonal directions. There is work to be done to improve the results.
The resulting patches in A1 resemble the wavelets or features found by ICA. They are more localised than features of the PCA but only the features that fill most of the patch seem to survive. They seem to follow the prior that is set to them in Section . Each patch is either close to zero or active in the whole patch or a large part of it. This would suggest that a sparse prior for the weights could help to get more local features like the ones from ICA algorithm in Figure . Preliminary experiments do support this assumption. It would also help to keep more neurons alive and make the reconstructions more accurate. Using a larger set of data would also help by making the relative cost of describing the weights smaller.
Sparse connectivity would also help the upper layer. As the number of features grows, there is a growing amount of cost to describe that a particular neuron in the third layer does not affect activities of most of the neurons on the second layer. With sparse connections the situation would be very different. It would be useful to have a neuron on the third layer that states that just two of the features are typically active together. When increasing the size of the image patches, this locality would become increasingly important.
The assumptions made in the posterior approximation might explain why either A or B describing the connections to a particular neuron and its variance neuron tend to get turned off. Upper layer cannot be connected straigth to the mean and variance prior of a neuron, because it would violate the independency assumption. The addition of the variance neuron in between restores the independency, but taking the dependency not into account increases the cost. The increase is smallest, when the dependency is smallest. If the connections A and B were similar to each other, not only would the wrong assumption of the independence rise the cost function, but also the cost would include the description of the weights, which are also assumed to be independent of each other.
The experiments on image data using topographic ICA and independent subspace analysis [30,29,31] are perhaps closest to these ones. The features found with these methods corresponding to A1 are similar to those of the basic ICA. The topography or the collection of subspaces correspond approximately to a fixed, predetermined matrix B2. In case of the topology, there would be a Gaussian spot in each patch of B2. In case of independent subspace analysis, each patch of B2 would activate a distinct group of features or the independent subspace. These methods do not have corresponding parts for B1 and A2. It seems that ICA has not previously been used to learn all parts of hierarchical models succesfully.