Next: METHODS FOR EXPLORATORY DATA Up: No Title Previous: LIST OF SYMBOLS AND

INTRODUCTION

It is relatively easy to give answers to well-specified questions about the statistical nature of a well-understood data set, like ``how large should an aeroplane cockpit be to accommodate any one of the 95% of the potential pilots?'' In this example the data may be assumed to be normally distributed and it is straightforward to estimate the threshold. The more data there is available the more accurate an answer can be given.

If, on the other hand, the data is not well-understood and the problem is not well-specified, an increase in the amount of data may even have the opposite effect. This holds for multivariate data in particular. If the goal is simply to try to make sense out of a data set to generate sensible hypotheses or to find some interesting novel patterns, it paradoxically seems that the more data there is available the more difficult it is to understand the data set. The structures are hidden among the large amounts of multivariate data. When exploring a data set for new insights, only methods that discover and illustrate effectively the structures in the data can be of help. Such methods, applied to large data sets, are the topic of this work.

A data-driven search for statistical insights and models is traditionally called exploratory data analysis [Hoaglin, 1982, Jain and Dubes, 1988, Tukey, 1977, Velleman and Hoaglin, 1981] in statistical literature. The process of making statistical inferences often consists of an exploratory, data-driven phase, followed by a confirmatory phase in which the reproducibility of the results is investigated. There thus exists a wealth of applications in which data sets need to be summarized to gain insight into them; the goal in this work is to present a data set in a form that is easily understandable but that at the same time preserves as much of the essential information in the data as possible.

Exploratory data analysis methods can be used as tools in knowledge discovery in databases (KDD) [Fayyad, 1996, Fayyad et al., 1996a, Fayyad et al., 1996c, Simoudis, 1996]. In this relatively recently established field the emphasis is on the whole interactive process of knowledge discovery, discovery of novel patterns or structures in the data. The process consists of a multitude of steps starting from setting up the goals to evaluating the results, and possibly reformulating the goals based on the results. Data mining is one step in the discovery process, a step in which suitable tools from many other disciplines including exploratory data analysis are used to find interesting patterns in the data. Depending on the goals of the data mining process essentially any kinds of pattern recognition [Devijver and Kittler, 1982, Fu, 1974, Fukunaga, 1972, Therrien, 1989, Schalkoff, 1992], machine learning [Forsyth, 1989, Langley, 1996, Michalski, 1983], and multivariate analysis [Cooley and Lohnes, 1971, Hair, Jr. et al., 1984, Kendall, 1975] algorithms may be useful; for recent examples, cf. Fayyad et al. (1996b). An essential novelty in the field then lies in emphasizing the discovery of previously unknown structures from vast databases, and in emphasizing the importance of considering the whole process.

In this work exploratory data analysis methods which illustrate the structures in data sets, are applied to large databases. The tool in this endeavor will be the self-organizing map (SOM) [Kohonen, 1982, Kohonen, 1995c]. Some properties that distinguish the SOM from the other data mining tools are that it is numerical instead of symbolic, nonparametric, and capable of learning without supervision. The numerical nature of the method enables it to treat numerical statistical data naturally, and to represent graded relationships. Because the method does not require supervision and is nonparametric, used here in the sense that no assumptions about the distribution of the data need to be made, it may even find quite unexpected structures from the data.

In this thesis the relation of the SOM to some other data visualization and clustering methods is first analyzed in Section 6. Then, recipes on how to use the SOM in exploratory data analysis are given in Section 7. The areas of application that are treated in the Publications are introduced in Section 8, and finally two recent developments in the methodology are discussed in Section 9.

Next: METHODS FOR EXPLORATORY DATA Up: No Title Previous: LIST OF SYMBOLS AND

Sami Kaski
Mon Mar 31 23:43:35 EET DST 1997