Recent developments

Next: FURTHER DEVELOPMENTS Up: Full-text document collections Previous: Full-text document collections

Recent developments

It has turned out quite recently that a simpler document encoding method than the one that was used in the Publications might produce even better results.

It is a standard practice in information retrieval (IR) [Salton and McGill, 1983] to encode documents with vectors, in which each component corresponds to a different word, and the value of the component reflects the frequency of occurrence of the word in the document. If word occurs times in document j, is defined to be equal to , and is the unit vector corresponding to the kth vector component, it is possible to code the document j by

A problem with this encoding method is that if the vocabulary is very large the dimensionality of the vector is also high. In the Publications this problem was solved by eliminating some of the most common and some of the rarest words, and by clustering the words into word categories. (Actually an extended version of Equation 11 was used, where probabilistic information about the similarity of use of different words was incorporated into the coefficients .)

An alternative approach for reducing the dimensionality is simply to reduce the dimensionality of the vectors that in effect represent the words. A simple method is to project them randomly to a space of a lower dimensionality [Ritter and Kohonen, 1989]. It has turned out in experiments that using the reduced-dimensional version of Equation 11 without any contextual information results in better separability of different Usenet discussion areas (accuracy of the newsgroup separation was 69.1 %) than using either the current WEBSOM method (accuracy 62.6 %), or a method where vectors estimated based on contextual information are used in the place of the [Gallant et al., 1992, Gallant, 1994] (accuracy 65.4 %). The experimental procedures have been described in more detail in Publication 6. If this reduced-dimensional version of Equation 11 is used, however, the fast processing of documents by table lookups and subsequent convolutions, used in Publications 3, 4, 5, and 6, would become impossible.

These preliminary results need more thorough validation. It may in any case be concluded that the methods for utilizing contextual information can still be improved.

Next: FURTHER DEVELOPMENTS Up: Full-text document collections Previous: Full-text document collections

Sami Kaski
Mon Mar 31 23:43:35 EET DST 1997