We shall first consider the algorithms based on reconstruction error
minimisation. Let us denote the inputs to the network by , and the outputs of the network by . Then the network can be described by a
mapping from the inputs to the outputs: . The reconstruction of the input is then a mapping
from the outputs to the inputs .
We shall assume that some information is lost and the reconstruction
is not exact (cf. pseudoinverse of a matrix). A reasonable way to
measure the goodness of the representation is to measure
the reconstruction error
with a given error function . Usually a
*quadratic reconstruction error* is
used due to its mathematical tractability. Figure 2.1
illustrates the architecture underlying unsupervised learning models
based on reconstruction error minimisation.

**Figure 2.1:**
Architecture underlying unsupervised learning models based on
reconstruction error minimisation.

Due to the information loss in mapping there is no unique
way to choose the reconstruction mapping . This can be
illustrated by considering a linear mapping from 2-dimensional input
to 1-dimensional output *y*. The mapping
is many-to-one, and therefore it does not have a unique inverse. If
we are given the output *y*, there are many candidate inputs
which could have generated the output, thus we have many
possible choices for the reconstruction mapping . There
is no obvious way to know which one of them would minimise the
reconstruction error. If the reconstruction mapping is
given instead of , however, we can simply define the mapping
to be the one that minimises the reconstruction error.
Among all different possible outputs we choose the one that minimises
the reconstruction error.

In our example this would mean that we first define a linear
reconstruction mapping from *y* to and then solve the
that minimises the quadratic reconstruction error. It
turns out that is a unique linear mapping.

The example shows how it is often difficult to find the reconstruction mapping for a given mapping , but equation 2.1 makes it easy to find the mapping from a given reconstruction mapping . By using equation 2.1 it is also easy to impose various constraints on the outputs , because we can restrict the search for the minimum in equation 2.1 to the set of allowed outputs.

A further advantage of using equation 2.1 is that the derivatives of the reconstruction error with respect to the outputs are zero because the outputs minimise the reconstruction error. This greatly simplifies the derivation of the learning rule for the network. If we denote the parameters of the network by , then, for a given input , the reconstruction error is a function of the outputs and the weights: . The goal of the learning is to adapt the parameters so that the average reconstruction error made by the network is minimised. We can derive the adaptation rule for the weights by applying gradient descent. To do this we have to be able to compute the derivatives of with respect to the weights .

The last step follows from the fact that in equation 2.1
the outputs are defined to minimise the reconstruction error and thus for all
*j*. The last step holds even if there were some constraints for the
outputs, which do not depend on the parameters . Let us denote
. This vector points to the
direction of the change in the output, when the parameter
changes by an infinitesimal amount. If the outputs are constrained,
can only point to a direction where is
allowed to change. The sum in equation 2.2 can be
interpreted as the directional derivative of in the
direction . The outputs minimise and
therefore the derivative is zero in all allowed directions of change
of the outputs.

The outputs could, for example, be restricted to be non-negative. Suppose that the output would be zero due to this constraint. If the output had been allowed to be negative, the reconstruction error could have been further decreased by decreasing . This means that the derivative cannot be zero. This time, however, the output would not change if the parameters changed only by an infinitesimal amount. The output would still remain exactly zero. Therefore, in this example it always holds that either or .

Problems may arise if equation 2.1 has degenerate solutions. If we are using the network in feedforward phase, that is, we are not adapting the parameters, then we can just pick one of the solutions. We might run into problems if we try to adapt the network, because the mapping has discontinuities where the equation 2.1 has degenerate solutions, and it is not possible to compute the derivatives of outputs in equation 2.2. Luckily in most practical cases the probability of degenerate solutions is zero.

Equations 2.1 and 2.2 provide us a general framework for developing neural networks based on reconstruction error minimisation. The development starts by defining the reconstruction error, the parametrised reconstruction mapping, and possible constraints for the outputs. Sometimes we are able to solve equation 2.1 in a closed form, but in general we have to solve it numerically.

We shall first consider a simple case where we can find the solution
to equation 2.1 in a closed form. If we require the
reconstruction to be a linear mapping and there are no
constraints imposed on the outputs, it turns out that, assuming the
quadratic reconstruction error, also the mapping will be
linear. There are many degrees of freedom in the mapping and we can
require the mapping to be orthogonal without any loss of generality. The *principal
component analysis* (PCA) gives this kind of orthogonal mapping
which is optimal with respect to the quadratic reconstruction error
(see e.g. Oja, 1995). The code is maximally dense in
the sense that all the neurons participate in representing any given
input.
(For more
information about PCA see the Master's Thesis of
Jaakko Hollmén).

If we require the reconstruction to be a linear mapping and
furthermore we require the outputs to be binary and exactly
one neuron to be active at any one time, then we have made sure that
the code we obtain is maximally local. The resulting coding scheme is
called *vector quantisation* (VQ) (see e.g. Abut,
1990), because we can simply associate each neuron with the
corresponding reconstruction and then decide the active neuron by
finding the neuron whose reconstruction vector is closest to the input
according to equation 2.1. This can be interpreted as quantifying the input.

When the same basic scheme can produce both maximally dense and local codes, it is obvious to try to produce also sparse codes. Saund (1994, 1995) has proposed a few different reconstruction mappings , which can be used for sparse coding. They are chosen so that the resulting coding is sparse whenever the input is sparse, i.e., is describable with a sparse code. Unfortunately the inputs have to be binary, which restricts the use of the algorithm. It might be possible to generalise the approach for graded inputs, but it does not seem very easy. Olshausen and Field (1995) have developed an algorithm which uses the same kind of linear reconstruction scheme as PCA and VQ, but have added an extra term which promotes sparsity to the reconstruction error function. Equation 2.1 does not have a simple closed form solution in either of these algorithms, and the outputs have to be computed iteratively by gradient descent of the reconstruction error . This makes the computational complexity depend quadratically on the number of the neurons in the network.

Although the mappings and can be related by using equation 2.1, it is also possible to parametrise and learn them both separately. Algorithms using this kind of autoencoder framework have been proposed by several authors, and there has also been modifications which yield sparse codes [Dayan and Zemel, 1995, Zemel, 1993, Hinton and Zemel, 1994]. These algorithms can compute the outputs without any time consuming iterations, but unfortunately learning seems to be very slow. This is probably due to the fact that the network has to learn the same things twice as the learning does not take into account the relation between and in any way.

Thu May 9 14:06:29 DST 1996