We shall first consider the algorithms based on reconstruction error minimisation. Let us denote the inputs to the network by , and the outputs of the network by . Then the network can be described by a mapping from the inputs to the outputs: . The reconstruction of the input is then a mapping from the outputs to the inputs . We shall assume that some information is lost and the reconstruction is not exact (cf. pseudoinverse of a matrix). A reasonable way to measure the goodness of the representation is to measure the reconstruction error with a given error function . Usually a quadratic reconstruction error is used due to its mathematical tractability. Figure 2.1 illustrates the architecture underlying unsupervised learning models based on reconstruction error minimisation.
Figure 2.1:
Architecture underlying unsupervised learning models based on
reconstruction error minimisation.
Due to the information loss in mapping there is no unique way to choose the reconstruction mapping . This can be illustrated by considering a linear mapping from 2-dimensional input to 1-dimensional output y. The mapping is many-to-one, and therefore it does not have a unique inverse. If we are given the output y, there are many candidate inputs which could have generated the output, thus we have many possible choices for the reconstruction mapping . There is no obvious way to know which one of them would minimise the reconstruction error. If the reconstruction mapping is given instead of , however, we can simply define the mapping to be the one that minimises the reconstruction error. Among all different possible outputs we choose the one that minimises the reconstruction error.
In our example this would mean that we first define a linear reconstruction mapping from y to and then solve the that minimises the quadratic reconstruction error. It turns out that is a unique linear mapping.
The example shows how it is often difficult to find the reconstruction mapping for a given mapping , but equation 2.1 makes it easy to find the mapping from a given reconstruction mapping . By using equation 2.1 it is also easy to impose various constraints on the outputs , because we can restrict the search for the minimum in equation 2.1 to the set of allowed outputs.
A further advantage of using equation 2.1 is that the derivatives of the reconstruction error with respect to the outputs are zero because the outputs minimise the reconstruction error. This greatly simplifies the derivation of the learning rule for the network. If we denote the parameters of the network by , then, for a given input , the reconstruction error is a function of the outputs and the weights: . The goal of the learning is to adapt the parameters so that the average reconstruction error made by the network is minimised. We can derive the adaptation rule for the weights by applying gradient descent. To do this we have to be able to compute the derivatives of with respect to the weights .
The last step follows from the fact that in equation 2.1 the outputs are defined to minimise the reconstruction error and thus for all j. The last step holds even if there were some constraints for the outputs, which do not depend on the parameters . Let us denote . This vector points to the direction of the change in the output, when the parameter changes by an infinitesimal amount. If the outputs are constrained, can only point to a direction where is allowed to change. The sum in equation 2.2 can be interpreted as the directional derivative of in the direction . The outputs minimise and therefore the derivative is zero in all allowed directions of change of the outputs.
The outputs could, for example, be restricted to be non-negative. Suppose that the output would be zero due to this constraint. If the output had been allowed to be negative, the reconstruction error could have been further decreased by decreasing . This means that the derivative cannot be zero. This time, however, the output would not change if the parameters changed only by an infinitesimal amount. The output would still remain exactly zero. Therefore, in this example it always holds that either or .
Problems may arise if equation 2.1 has degenerate solutions. If we are using the network in feedforward phase, that is, we are not adapting the parameters, then we can just pick one of the solutions. We might run into problems if we try to adapt the network, because the mapping has discontinuities where the equation 2.1 has degenerate solutions, and it is not possible to compute the derivatives of outputs in equation 2.2. Luckily in most practical cases the probability of degenerate solutions is zero.
Equations 2.1 and 2.2 provide us a general framework for developing neural networks based on reconstruction error minimisation. The development starts by defining the reconstruction error, the parametrised reconstruction mapping, and possible constraints for the outputs. Sometimes we are able to solve equation 2.1 in a closed form, but in general we have to solve it numerically.
We shall first consider a simple case where we can find the solution to equation 2.1 in a closed form. If we require the reconstruction to be a linear mapping and there are no constraints imposed on the outputs, it turns out that, assuming the quadratic reconstruction error, also the mapping will be linear. There are many degrees of freedom in the mapping and we can require the mapping to be orthogonal without any loss of generality. The principal component analysis (PCA) gives this kind of orthogonal mapping which is optimal with respect to the quadratic reconstruction error (see e.g. Oja, 1995). The code is maximally dense in the sense that all the neurons participate in representing any given input. (For more information about PCA see the Master's Thesis of Jaakko Hollmén).
If we require the reconstruction to be a linear mapping and furthermore we require the outputs to be binary and exactly one neuron to be active at any one time, then we have made sure that the code we obtain is maximally local. The resulting coding scheme is called vector quantisation (VQ) (see e.g. Abut, 1990), because we can simply associate each neuron with the corresponding reconstruction and then decide the active neuron by finding the neuron whose reconstruction vector is closest to the input according to equation 2.1. This can be interpreted as quantifying the input.
When the same basic scheme can produce both maximally dense and local codes, it is obvious to try to produce also sparse codes. Saund (1994, 1995) has proposed a few different reconstruction mappings , which can be used for sparse coding. They are chosen so that the resulting coding is sparse whenever the input is sparse, i.e., is describable with a sparse code. Unfortunately the inputs have to be binary, which restricts the use of the algorithm. It might be possible to generalise the approach for graded inputs, but it does not seem very easy. Olshausen and Field (1995) have developed an algorithm which uses the same kind of linear reconstruction scheme as PCA and VQ, but have added an extra term which promotes sparsity to the reconstruction error function. Equation 2.1 does not have a simple closed form solution in either of these algorithms, and the outputs have to be computed iteratively by gradient descent of the reconstruction error . This makes the computational complexity depend quadratically on the number of the neurons in the network.
Although the mappings and can be related by using equation 2.1, it is also possible to parametrise and learn them both separately. Algorithms using this kind of autoencoder framework have been proposed by several authors, and there has also been modifications which yield sparse codes [Dayan and Zemel, 1995, Zemel, 1993, Hinton and Zemel, 1994]. These algorithms can compute the outputs without any time consuming iterations, but unfortunately learning seems to be very slow. This is probably due to the fact that the network has to learn the same things twice as the learning does not take into account the relation between and in any way.