The linear reconstruction mapping in equation 3.8 can be
used to derive the learning rule as explained in
section 2.1. The parameters of the network are the
weight vectors and the learning can be derived by
gradient descent of the quadratic reconstruction error. The outputs
of the network approximately minimise the same error function and
therefore the derivatives of the outputs can be ignored as in
equation 2.2.
We would like our network to learn sparse codes but, as has been noted
before, linear reconstruction does not promote sparsity in any way.
We know that the system will give the best reconstruction for input
when all the neurons participate in the representation. If all the
correlations between the weight
vectors were negative or zero then the neurons would have no
competition and they all could participate in the representation.
Therefore, if we only minimised the linear reconstruction error
defined by equation 3.8, the correlations between weight
vectors would decrease until all competition would vanish. Then
would automatically mean
, and the code would not be
sparse at all.
In order to learn sparse codes, we have to make a minor, biologically plausible addition to the learning rule. The result will be a combination of gradient descent of reconstruction error and use of neighbourhoods as in SOM [Kohonen, 1995]. If we only minimised the linear reconstruction error, the weight vectors would be pulled away from each other. The use of neighbourhoods introduces an opposing force which tries to turn the vectors more parallel. Those directions where input density is large get more weight vectors, and the resolution of the representation will be proportionally sharper. This result is qualitatively similar to that with SOM algorithm [Ritter and Schulten, 1986]. (For more information about SOM see the Master's Thesis of Jaakko Hollmén).
We shall first derive the first part of the learning rule by using
gradient descent. In order to do this, we have to find the gradient
of the quadratic reconstruction error
with respect to the weight vectors
, which are the
parameters of the network.
Here the reconstruction error . The last step follows from equation 2.2 as
explained in the beginning of this section. The standard gradient
descent learning rule would be
Here is the learning rate governing the length of steps
taken in the descending direction of the gradient.
The second part of the learning rule will look a bit similar to SOM
learning rule [Kohonen, 1995]. SOM is a vector quantisation
algorithm and is derived using the restriction that exactly one of the
outputs is one and the rest are zero. The main difference between SOM
and most other VQ algorithms is the use of neighbourhoods in adaptation.
In addition to adapting the winning neuron, also its neighbours are
adapted. In SOM algorithm, the learning rule for the model vectors
is
The amount of adaptation due to the cooperation is governed by the
neighbourhood function
. The reconstruction of input for each neuron in
SOM is
. There is also only one winner
c and thus
.
Therefore the learning rule could also be written
Here the weight vector is adapted in the direction of reconstruction
error plus the direction of the reconstruction
. In our case there can be many winners and for each
of them the reconstruction
is
different.
We can combine the use of neighbourhoods in equation 3.16 and the reconstruction error minimisation in equation 3.14 and sum the adaptations over all winners.
We have introduced a coefficient , which can be used to
regulate the ratio between adaptation due to the reconstruction error
minimisation and adaptation due to the cooperation in the
neighbourhood. The weight vectors have to be normalised after each
adaptation. Due to the normalisation we can drop away the term
. It has the same direction as
and
therefore normalisation cancels its affect anyway.
The determination of winning ratios in equation 3.6
required knowledge of the correlations between weight
vectors. It would be quite a heavy operation to compute the
correlations after each input, and therefore we shall use the batch
version of the adaptation algorithm. The adaptations are summed over
a set
of indices in the batch. Also the normalisation of
weight vectors can be done after each batch. When using the batch
version of the learning rule, the learning rate parameter
can
be dropped away.
Here we supposed that the neighbourhood function does not
change in time. As can be seen from the equation, the use of batch
version enables us to move the summation over the neighbourhoods
outside the batch, and therefore it needs to be done only once after
each batch. On the other hand, writing the summations in this order
prevents us from adapting only the set of winners and their
neighbours. If the batch is sufficiently long, there are so many
winners, however, that this order is computationally more efficient.