We introduce a method for assigning colors to displays of cluster structures of high-dimensional data, such that the perceptual differences of the colors reflect the distances in the original data space as faithfully as possible. The cluster structure is first discovered with the Self-Organizing Map (SOM), and then a new nonlinear projection method is applied to map the cluster structure into the CIELab color space. The projection method preserves best the local data distances that are the most important ones, while the global order is still discernible from the colors, too. This allows the method to conform flexibly to the available color space. The output space of the projection need not necessarily be the color space, however. Projections onto, say, two dimensions can be visualized as well.
Back to my online publications
Self-Organizing Maps (SOMs) are widely used in engineering and data-analysis tasks, but so far rarely in very large-scale problems. The reason is the amount of computation: while small SOMs can be computed starting from the basic principles, rapid computation of large maps of high-dimensional data requires special methods. Winner search, finding the position of a data sample on the map, is the computational bottleneck: comparison between the data vector and all of the model vectors of the map is required. In this paper a method is proposed for reducing the amount of computation by restricting the search to certain small-dimensional subspaces of the original space. The method is suitable for applications in which the map can be computed off-line, for instance in data monitoring, classification, and information retrieval. In a case study with the WEBSOM system that organizes text document collections on a SOM, the amount of computation was reduced to about 14% of the original, and even to 6.6% when approximations were utilized.
Back to my online publications
When the SOM is applied to the mapping of documents, one can represent them statistically by their weighted word frequency histograms or some reduced representations of the histograms that can be regarded as data vectors. We have made such a SOM of about seven million documents, viz. of all of the patent abstracts in the world that have been written in English and are available in electronic form. The map consists of about one million models (nodes). Keywords or key texts can be used to search for the most relevant documents first. New effective coding and computational schemes of the mapping are described.
Back to my online publications
With the WEBSOM method a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing. Interesting documents can be located on the map using a content-directed search. Each document is encoded as a histogram of word categories which are formed by the Self-Organizing Map (SOM) algorithm based on the similarities in the contexts of the words. The encoded documents are organized on another Self-Organizing Map, a document map, on which nearby locations contain similar documents. Special consideration is given to the computation of very large document maps which is possible with general-purpose computers if the dimensionality of the word category histograms is first reduced with a random mapping method and if computationally efficient algorithms are used in computing the SOMs.
Keywords: Data mining; Information retrieval; Self-Organising Map; SOM; WEBSOM
Back to my online publications
The Self-Organizing Map (SOM) can be used for forming overviews of multivariate data sets and for visualizing them on graphical map displays. Each map location represents certain kinds of data items and the value of a variable in the representations can be visualized in the corresponding locations on the map display. Such component plane displays contain all the information needed for interpreting the map but information about the relations of the variables remains implicit. We have developed methods that visualize explicitly the contribution of each variable in the organization of the map at different locations. It is also possible to measure the contribution of each variable in the cluster structure within an area of the map to summarize, for instance, the characteristics of clusters.
Back to my online publications
When the data vectors are high-dimensional it is computationally infeasible to use data analysis or pattern recognition algorithms which repeatedly compute similarities or distances in the original data space. It is therefore necessary to reduce the dimensionality before, for example, clustering the data. If the dimensionality is very high, like in the WEBSOM method which organizes textual document collections on a Self-Organizing Map, then even the commonly used dimensionality reduction methods like the principal component analysis may be too costly. It will be demonstrated that the document classification accuracy obtained after the dimensionality has been reduced using a random mapping method will be almost as good as the original accuracy if the final dimensionality is sufficiently large (about 100 out of 6000). In fact, it can be shown that the inner product (similarity) between the mapped vectors follows closely the inner product of the original vectors.
Back to my online publications
WEBSOM is a novel method for organizing document collections onto map displays to enhance the interactive browsing and retrieval of the documents. The map is organized automatically according to the contents of the full-text documents by the Self-Organizing Map algorithm. The map display provides a visual overview of the whole document collection. The overview, the map display, aids in the exploration since similar documents are located close to each other. In this paper we describe the WEBSOM system in a statistically oriented fashion and discuss its relations to other methods. Particular emphasis is put on how effective the methods are in treating large document collections. The two-phase architecture of the WEBSOM system makes it possible to build contextual information about the relations of words off-line into a word category representation, which can then be utilized rapidly on-line, when the documents are being encoded. The construction of large map displays from the encoded document representations is a computationally intensive operation when done in a straightforward manner. There exist, however, several effective computational shortcuts.
Back to my online publications
The Adaptive-Subspace SOM (ASSOM) is a modular neural-network architecture, the modules of which learn to identify input patterns subject to some simple transformations. The learning process is unsupervised, competitive, and related to that of the traditional SOM (Self-Organizing Map). Each neural module becomes adaptively specific to some restricted class of transformations, and modules close to each other in the network become tuned to similar features in an orderly fashion. If different transformations exist in the input signals, different subsets of ASSOM units become tuned to these transformation classes.
Back to my online publications
Formulation of suitable search expressions for information retrieval from large full-text databases may currently require considerable efforts. Changing the scope of the search when, e.g., too many or too few hits have been obtained, requires re-formulation of the search expression. For an alternative scheme we suggest an explorative full-text information retrieval method, where the Self-Organizing Map (SOM) algorithm is used to order documents based on their full textual contents. The visualized order can then be utilized for an {\em explorative} search or exploration of novel knowledge areas, whereby the scope can be changed interactively. The ordering of the documents is achieved by a two-level analysis: First, word categories are extracted from the text by a ``semantic'' SOM. Second, the textual context of the documents is encoded on the basis of the histograms of words formed on the word category map. Back to my online publications
Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, WEBSOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (WWW-based) interface, documents in interesting areas of the map can be browsed. The browsing can also be interactively extended to related topics, which appear in nearby areas on the map. Along with the method we present a case study of its use.
Back to my online publications
On January 19, 1996 we published in the Internet a demo of how to use Self-Organizing Maps (SOMs) for the organization of large collections of full-text files. Later we added other newsgroups to the demo. It can be found at the address http://websom.hut.fi/websom/. In the present paper we describe the main features of this system, called the WEBSOM, as well as some newer developments of it.
Back to my online publications
In exploratory analysis of high-dimensional data the self-organizing map can be used to illustrate relations between the data items. We have developed two measures for comparing how different maps represent these relations. The other combines an index of discontinuities in the mapping from the input data set to the map grid with an index of the accuracy with which the map represents the data set. This measure can be used for determining the goodness of single maps. The other measure has been used to directly compare how similarly two maps represent relations between data items. Such a measure of the dissimilarity of maps is useful, e.g., for analyzing the sensitivity of maps to variations in their inputs or in the learning process. Also the similarity of two data sets can be compared indirectly by comparing the maps that represent them.
Back to my online publications
Availability of large full-text document collections in electronic form has created a need for intelligent information retrieval techniques. Especially the expanding World Wide Web presupposes methods for systematic exploration of miscellaneous document collections. In this paper we introduce a new method, the WEBSOM, for this task. Self-Organizing Maps (SOMs) are used to represent documents on a map that provides an insightful view of the text collection. This view visualizes similarity relations between the documents, and the display can be utilized for orderly exploration of the material rather than having to rely on traditional search expressions. The complete WEBSOM method involves a two-level SOM architecture comprising of a word category map and a document map, and means for interactive exploration of the data base.
Back to my online publications
The self-organizing map (SOM) is a method that represents statistical data sets in an ordered fashion, as a natural groundwork on which the distributions of the individual indicators in the set can be displayed and analyzed. As a case study that instructs how to use the SOM to compare states of economic systems, the standard of living of different countries is analyzed using the SOM. Based on a great number (39) of welfare indicators the SOM illustrates rather refined relationships between the countries two-dimensionally. This method is directly applicable to the financial grading of companies, too.
Back to my online publications