The resulting weights are shown in Figure . The
neurons in the upper right corner of
A1 are specialised to
just a few data samples. Two neurons in the lower left corner have
connection in the
B1 matrix thus adding only noise to the
reconstruction and three others in both
A1 and
B1. Other
B connections are close to
zero. Reading from upper left corner in
A2 the first
neuron of the third layer activates the specialised neurons and
inhibits the neurons that add noise. Second and fourth neuron activate
noise neurons and inhibits the low frequency neurons in the lower
right of
A1. The third neuron activates neurons with
diagonal features from upper left to lower right corner and the fifth neuron
activates the horizontal features.
The initialisation seems to affect the results considerably. This was
already noticed with the simple example in
Section . The algorithm can get stuck in a
local minimum of the cost function and even regeneration of the
neurons does not always help, since when there already is an adequate
reconstruction of the data, it is very hard for a new neuron to fit
in. The asymmetric initialisation with VQ seems to result in more
asymmetric features than the ones presented here. Some more
experiments need to be done to confirm that. I would like to stress
that a local optimum of a good model is usually better than a global
optimum of a bad one.
In the simple example in Section , the terms
of the cost function corresponding to the third layer were greater
than the ones corresponding to the second layer. This means that the
reconstructions are effectively decided on the third layer and just
focused on the second layer or that the neurons on the second layer
are acting like computational units. But in this experiment the
proportion is one in the third layer against eight in the second
layer. This means that the reconstruction of the image is decided on
the second layer and the third layer only guides it.
At the end of the iteration process the cost function would have become smaller by removing some neurons that were stuck in local minima. After the network is rebooted and updated some more, the number of neurons alive still diminishes. As there are sources on the third layer that correspond to horizontal and diagonal features in the second layer, there should also be sources corresponding to the vertical and the other diagonal directions. There is work to be done to improve the results.
The resulting patches in
A1 resemble the wavelets or
features found by ICA. They are more localised than features of the
PCA but only the features that fill most of the patch seem to survive.
They seem to follow the prior that is set to them in
Section . Each patch is either close to zero
or active in the whole patch or a large part of it. This would suggest
that a sparse prior for the weights could help to get more local
features like the ones from ICA algorithm in
Figure
. Preliminary experiments do support this
assumption. It would also help to keep more neurons alive and make the
reconstructions more accurate. Using a larger set of data would also
help by making the relative cost of describing the weights smaller.
Sparse connectivity would also help the upper layer. As the number of features grows, there is a growing amount of cost to describe that a particular neuron in the third layer does not affect activities of most of the neurons on the second layer. With sparse connections the situation would be very different. It would be useful to have a neuron on the third layer that states that just two of the features are typically active together. When increasing the size of the image patches, this locality would become increasingly important.
The assumptions made in the posterior approximation might explain why either A or B describing the connections to a particular neuron and its variance neuron tend to get turned off. Upper layer cannot be connected straigth to the mean and variance prior of a neuron, because it would violate the independency assumption. The addition of the variance neuron in between restores the independency, but taking the dependency not into account increases the cost. The increase is smallest, when the dependency is smallest. If the connections A and B were similar to each other, not only would the wrong assumption of the independence rise the cost function, but also the cost would include the description of the weights, which are also assumed to be independent of each other.
The experiments on image data using topographic ICA and independent subspace analysis [30,29,31] are perhaps closest to these ones. The features found with these methods corresponding to A1 are similar to those of the basic ICA. The topography or the collection of subspaces correspond approximately to a fixed, predetermined matrix B2. In case of the topology, there would be a Gaussian spot in each patch of B2. In case of independent subspace analysis, each patch of B2 would activate a distinct group of features or the independent subspace. These methods do not have corresponding parts for B1 and A2. It seems that ICA has not previously been used to learn all parts of hierarchical models succesfully.