It has turned out quite recently that a simpler document encoding method than the one that was used in the Publications might produce even better results.
It is a standard practice in information retrieval (IR)
[Salton and McGill, 1983] to encode documents with vectors, in which each
component corresponds to a different word, and the value of the
component reflects the frequency of occurrence of the word in the
document. If word occurs
times in document j,
is defined to be equal to
, and
is the unit vector corresponding to the
kth vector component, it is possible to code the document j by
A problem with this encoding method is that if the vocabulary is
very large the dimensionality of the vector is also high. In
the Publications this problem was solved by eliminating some of the most common
and some of the rarest words, and by clustering the words into word
categories. (Actually an extended version of Equation 11 was
used, where probabilistic information about the similarity of use of
different words was incorporated into the coefficients .)
An alternative approach for reducing the dimensionality is simply to
reduce the dimensionality of the vectors that in effect
represent the words. A simple method is to project them randomly to a
space of a lower dimensionality [Ritter and Kohonen, 1989]. It has turned out in
experiments that using the reduced-dimensional version of
Equation 11 without any contextual information results in
better separability of different Usenet discussion areas (accuracy of
the newsgroup separation was 69.1 %) than using either the current
WEBSOM method (accuracy 62.6 %), or a method where vectors estimated
based on contextual information are used in the place of the
[Gallant et al., 1992, Gallant, 1994] (accuracy 65.4 %). The
experimental procedures have been described in more detail in
Publication 6. If this reduced-dimensional version of
Equation 11 is used, however, the fast processing of
documents by table lookups and subsequent convolutions, used in
Publications 3, 4, 5, and
6, would become impossible.
These preliminary results need more thorough validation. It may in any case be concluded that the methods for utilizing contextual information can still be improved.