An important aspect of any kind of information processing is the way in which the information is represented. In the brain it could be represented by the activity of single, individually meaningful neurons, or it could be that only the global activity pattern across a whole neuron population corresponds to interpretable states. There are strong theoretical reasons and experimental evidence [Földiák and Young, 1995, Thorpe, 1995] suggesting that the brain adopts a compromise between these extremes which is often referred to as sparse coding.
Let us suppose that there are binary coding units, neurons, which can be either ``active'' or ``passive''. An important characteristic of such a code is the activity ratio, the fraction of active neurons at any one time. At its lowest value it is local representation, where each state is represented by a single active unit from a pool in which all other units are silent, somewhat like the letters on a typewriter keyboard that are locally encoded. In dense distributed representation, each state is represented on average by about half of the units being active. Examples of this are the binary (ASCII) encoding of characters used in computers or the coding of visual images by the retinal photoreceptor array. Codes with low activity ratios are called sparse codes.
When the neurons have graded outputs it is more difficult to give a
simple measure of sparseness. One is the entropy of the
outputs[Attneave, 1954, Barlow, 1960, Watanabe, 1981], which decreases as the
outputs become sparser. It is difficult to measure the entropy
directly, but in some cases the kurtosis of the outputs can be used instead
[Field, 1994]. The sparser the outputs, the larger the kurtosis.
Although sparse codes do have large kurtosis, the opposite does not
necessarily hold. It is possible to construct dense codes with large
kurtosis [Baddeley, 1996]. In this work, a somewhat heuristic
definition of sparseness for graded outputs is used. The code is said
to be sparse if a small fraction of the inputs convey most of the
information.
The activity ratio affects several aspects of information processing such as the architecture and robustness of networks, the number of distinct states that can be represented and stored, generalisation properties, and the speed and rules of learning (Table 1.1).
Table:
Properties of coding schemes according to Földiák and Young (1995).
The representational capacity of local codes is small: they can
represent only as many states as the number of units in the pool,
which is insufficient for any but the most trivial tasks. Even when
the number of units is as high as that in primate
cortex, the number of discriminable states
well exceeds this number. Making associations between a locally
encoded item and an output, however, is easy and fast. Single-layer
networks can learn any output association in a single trial by local,
Hebbian strengthening of connections between active representation and
output units, and the linear separability problem does not arise. In
such a lookup table, there is no interference between
associations to other discriminable states, and learning information
about new states does not interfere with old associations. This,
however, also means that there will be no generalisation to other
discriminable states -- which is a fundamental flaw, as we can expect
a system never to experience precisely the same pattern of stimulation
twice.
Dense distributed codes, on the other hand, can represent a very high
number ( , where N is the number of units) of different
states by combinatorial use of units. They are best suited for
minimising the number of neurons needed to represent information. If
the number of available neurons is high enough, the representational
power is largely superfluous, as the number of patterns ever
experienced by the system will never approach the available capacity,
and therefore dense codes usually have high statistical redundancy.
The price to pay for the potential (but unused) high information
content of each pattern is that the number of such patterns that an
associative memory can store is unnecessarily low. The mapping
between a dense representation and an output can be complex (a
linearly nonseparable function), therefore requiring multilayer
networks and learning algorithms that are hard to implement
biologically. Even efficient supervised algorithms are prohibitively
slow, requiring many training trials and large amounts of the kind of
training data that is labelled with either an appropriate output or
reinforcement. Such data is often too risky, time consuming, or
expensive to obtain. Distributed representations in intermediate
layers of such networks ensure a kind of automatic generalisation
[Hinton et al., 1986]. However, this often manifests itself as unwanted
interference between patterns. A further serious problem is that no
new associations can be added without retraining the network with the
complete training set.
Sparse codes combine advantages of local and dense codes while avoiding most of their drawbacks. Codes with small activity ratios can still have sufficiently high representational capacity, while the number of input-output pairs that can be stored in an associative memory is far greater for sparse than for dense patterns [Meunier and Nadal, 1995]. This is achieved by decreasing the amount of information in the representation of any individual stored pattern. As a much larger fraction of all input-output functions are linearly separable using sparse coding, a single supervised layer with simple learning rules, following perhaps several unsupervised layers, is more likely to be sufficient for learning target outputs, avoiding problems associated with supervised training in multilayer networks. As generalisation takes place only between overlapping patterns, new associations will not interfere with previous associations to nonoverlapping patterns.
Sparseness can also be defined with respect to components. A scene may be encoded in a distributed representation while, at the same time, object features may be represented locally. The number of simultaneously presented items decreases as activity ratio increases because the addition of active units eventually results in activation of ``ghost'' subsets, corresponding to items that were not intended to be activated.