next up previous contents
Next: FURTHER DEVELOPMENTS Up: Full-text document collections Previous: Full-text document collections

Recent developments

It has turned out quite recently that a simpler document encoding method than the one that was used in the Publications might produce even better results.

It is a standard practice in information retrieval (IR) [Salton and McGill, 1983] to encode documents with vectors, in which each component corresponds to a different word, and the value of the component reflects the frequency of occurrence of the word in the document. If word tex2html_wrap_inline2453 occurs tex2html_wrap_inline2455 times in document j, tex2html_wrap_inline2459 is defined to be equal to tex2html_wrap_inline2461 , and tex2html_wrap_inline2463 is the unit vector corresponding to the kth vector component, it is possible to code the document j by

  equation608

A problem with this encoding method is that if the vocabulary is very large the dimensionality of the vector is also high. In the Publications this problem was solved by eliminating some of the most common and some of the rarest words, and by clustering the words into word categories. (Actually an extended version of Equation 11 was used, where probabilistic information about the similarity of use of different words was incorporated into the coefficients tex2html_wrap_inline2459 .)

An alternative approach for reducing the dimensionality is simply to reduce the dimensionality of the vectors tex2html_wrap_inline2463 that in effect represent the words. A simple method is to project them randomly to a space of a lower dimensionality [Ritter and Kohonen, 1989]. It has turned out in experiments that using the reduced-dimensional version of Equation 11 without any contextual information results in better separability of different Usenet discussion areas (accuracy of the newsgroup separation was 69.1 %) than using either the current WEBSOM method (accuracy 62.6 %), or a method where vectors estimated based on contextual information are used in the place of the tex2html_wrap_inline2463 [Gallant et al., 1992, Gallant, 1994] (accuracy 65.4 %). The experimental procedures have been described in more detail in Publication 6. If this reduced-dimensional version of Equation 11 is used, however, the fast processing of documents by table lookups and subsequent convolutions, used in Publications 3, 4, 5, and 6, would become impossible.

These preliminary results need more thorough validation. It may in any case be concluded that the methods for utilizing contextual information can still be improved.


next up previous contents
Next: FURTHER DEVELOPMENTS Up: Full-text document collections Previous: Full-text document collections

Sami Kaski
Mon Mar 31 23:43:35 EET DST 1997