next up previous contents
Next: Mutual predictability minimisation Up: Previous algorithms for sparse Previous: Previous algorithms for sparse

Reconstruction error minimisation

 

We shall first consider the algorithms based on reconstruction error minimisation. Let us denote the inputs to the network by tex2html_wrap_inline1447 , and the outputs of the network by tex2html_wrap_inline1449 . Then the network can be described by a mapping from the inputs to the outputs: tex2html_wrap_inline1451 . The reconstruction of the input is then a mapping from the outputs to the inputs tex2html_wrap_inline1453 . We shall assume that some information is lost and the reconstruction is not exact (cf. pseudoinverse of a matrix). A reasonable way to measure the goodness of the representation tex2html_wrap_inline1455 is to measure the reconstruction error tex2html_wrap_inline1457 with a given error function tex2html_wrap_inline1459 . Usually a quadratic reconstruction error tex2html_wrap_inline1461 is used due to its mathematical tractability. Figure 2.1 illustrates the architecture underlying unsupervised learning models based on reconstruction error minimisation.

   figure87
Figure 2.1: Architecture underlying unsupervised learning models based on reconstruction error minimisation.

Due to the information loss in mapping tex2html_wrap_inline1463 there is no unique way to choose the reconstruction mapping tex2html_wrap_inline1465 . This can be illustrated by considering a linear mapping from 2-dimensional input tex2html_wrap_inline1467 to 1-dimensional output y. The mapping tex2html_wrap_inline1471 is many-to-one, and therefore it does not have a unique inverse. If we are given the output y, there are many candidate inputs tex2html_wrap_inline1467 which could have generated the output, thus we have many possible choices for the reconstruction mapping tex2html_wrap_inline1465 . There is no obvious way to know which one of them would minimise the reconstruction error. If the reconstruction mapping tex2html_wrap_inline1465 is given instead of tex2html_wrap_inline1463 , however, we can simply define the mapping tex2html_wrap_inline1463 to be the one that minimises the reconstruction error. Among all different possible outputs we choose the one that minimises the reconstruction error.

  equation101

In our example this would mean that we first define a linear reconstruction mapping from y to tex2html_wrap_inline1487 and then solve the tex2html_wrap_inline1463 that minimises the quadratic reconstruction error. It turns out that tex2html_wrap_inline1491 is a unique linear mapping.

The example shows how it is often difficult to find the reconstruction mapping tex2html_wrap_inline1465 for a given mapping tex2html_wrap_inline1463 , but equation 2.1 makes it easy to find the mapping tex2html_wrap_inline1463 from a given reconstruction mapping tex2html_wrap_inline1465 . By using equation 2.1 it is also easy to impose various constraints on the outputs tex2html_wrap_inline1455 , because we can restrict the search for the minimum in equation 2.1 to the set of allowed outputs.

A further advantage of using equation 2.1 is that the derivatives of the reconstruction error with respect to the outputs are zero because the outputs minimise the reconstruction error. This greatly simplifies the derivation of the learning rule for the network. If we denote the parameters of the network by tex2html_wrap_inline1503 , then, for a given input tex2html_wrap_inline1467 , the reconstruction error is a function of the outputs and the weights: tex2html_wrap_inline1507 . The goal of the learning is to adapt the parameters so that the average reconstruction error made by the network is minimised. We can derive the adaptation rule for the weights by applying gradient descent. To do this we have to be able to compute the derivatives of tex2html_wrap_inline1509 with respect to the weights tex2html_wrap_inline1511 .

  equation129

The last step follows from the fact that in equation 2.1 the outputs are defined to minimise the reconstruction error tex2html_wrap_inline1509 and thus tex2html_wrap_inline1515 for all j. The last step holds even if there were some constraints for the outputs, which do not depend on the parameters tex2html_wrap_inline1511 . Let us denote tex2html_wrap_inline1521 . This vector points to the direction of the change in the output, when the parameter tex2html_wrap_inline1511 changes by an infinitesimal amount. If the outputs are constrained, tex2html_wrap_inline1525 can only point to a direction where tex2html_wrap_inline1455 is allowed to change. The sum in equation 2.2 can be interpreted as the directional derivative of tex2html_wrap_inline1509 in the direction tex2html_wrap_inline1525 . The outputs minimise tex2html_wrap_inline1509 and therefore the derivative is zero in all allowed directions of change of the outputs.

The outputs could, for example, be restricted to be non-negative. Suppose that the output tex2html_wrap_inline1535 would be zero due to this constraint. If the output had been allowed to be negative, the reconstruction error could have been further decreased by decreasing tex2html_wrap_inline1535 . This means that the derivative tex2html_wrap_inline1539 cannot be zero. This time, however, the output tex2html_wrap_inline1535 would not change if the parameters changed only by an infinitesimal amount. The output tex2html_wrap_inline1535 would still remain exactly zero. Therefore, in this example it always holds that either tex2html_wrap_inline1545 or tex2html_wrap_inline1515 .

Problems may arise if equation 2.1 has degenerate solutions. If we are using the network in feedforward phase, that is, we are not adapting the parameters, then we can just pick one of the solutions. We might run into problems if we try to adapt the network, because the mapping tex2html_wrap_inline1463 has discontinuities where the equation 2.1 has degenerate solutions, and it is not possible to compute the derivatives of outputs tex2html_wrap_inline1535 in equation 2.2. Luckily in most practical cases the probability of degenerate solutions is zero.

Equations 2.1 and 2.2 provide us a general framework for developing neural networks based on reconstruction error minimisation. The development starts by defining the reconstruction error, the parametrised reconstruction mapping, and possible constraints for the outputs. Sometimes we are able to solve equation 2.1 in a closed form, but in general we have to solve it numerically.

We shall first consider a simple case where we can find the solution to equation 2.1 in a closed form. If we require the reconstruction tex2html_wrap_inline1465 to be a linear mapping and there are no constraints imposed on the outputs, it turns out that, assuming the quadratic reconstruction error, also the mapping tex2html_wrap_inline1463 will be linear. There are many degrees of freedom in the mapping and we can require the mapping to be orthogonalgif without any loss of generality. The principal component analysis (PCA) gives this kind of orthogonal mapping which is optimal with respect to the quadratic reconstruction error (see e.g. Oja, 1995). The code is maximally dense in the sense that all the neurons participate in representing any given input. (For more information about PCA see the Master's Thesis of Jaakko Hollmén).

If we require the reconstruction to be a linear mapping and furthermore we require the outputs tex2html_wrap_inline1455 to be binary and exactly one neuron to be active at any one time, then we have made sure that the code we obtain is maximally local. The resulting coding scheme is called vector quantisation (VQ) (see e.g. Abut, 1990), because we can simply associate each neuron with the corresponding reconstruction and then decide the active neuron by finding the neuron whose reconstruction vector is closest to the input according to equation 2.1gif. This can be interpreted as quantifying the input.

When the same basic scheme can produce both maximally dense and local codes, it is obvious to try to produce also sparse codes. Saund (1994, 1995) has proposed a few different reconstruction mappings tex2html_wrap_inline1465 , which can be used for sparse coding. They are chosen so that the resulting coding is sparse whenever the input is sparse, i.e., is describable with a sparse code. Unfortunately the inputs have to be binary, which restricts the use of the algorithm. It might be possible to generalise the approach for graded inputs, but it does not seem very easy. Olshausen and Field (1995) have developed an algorithm which uses the same kind of linear reconstruction scheme as PCA and VQ, but have added an extra term which promotes sparsity to the reconstruction error function. Equation 2.1 does not have a simple closed form solution in either of these algorithms, and the outputs have to be computed iteratively by gradient descent of the reconstruction error tex2html_wrap_inline1509 . This makes the computational complexity depend quadratically on the number of the neurons in the network.

Although the mappings tex2html_wrap_inline1463 and tex2html_wrap_inline1465 can be related by using equation 2.1, it is also possible to parametrise and learn them both separately. Algorithms using this kind of autoencoder framework have been proposed by several authors, and there has also been modifications which yield sparse codes [Dayan and Zemel, 1995, Zemel, 1993, Hinton and Zemel, 1994]. These algorithms can compute the outputs without any time consuming iterations, but unfortunately learning seems to be very slow. This is probably due to the fact that the network has to learn the same things twice as the learning does not take into account the relation between tex2html_wrap_inline1463 and tex2html_wrap_inline1465 in any way.


next up previous contents
Next: Mutual predictability minimisation Up: Previous algorithms for sparse Previous: Previous algorithms for sparse

Harri Lappalainen
Thu May 9 14:06:29 DST 1996