next up previous contents
Next: Data encoding Up: Data preprocessing Previous: Focusing on a subset

Removing erroneous data

Errors in the data must be removed. If the data is downloaded from a database as a query, the result is likely to include erroneous data because of the lack of database integrity. Erroneous data must be filtered using a priori knowledge of the problem domain and common sense. For example, in databases, missing values are usually presented as zeros. Zeros are typical errors due to the lack of database integrity. These kind of errors show up in the probability density function presentation as peaks at zero possibly outside the normal range of the variable. In the case of uncertainty, these kind of values can be replaced with ``don't care'' values. In training of a SOM, input vectors with missing values can be used [36]. Another approach is to remove the vectors from the training set if they have missing values. This has the negative side effect of reducing the training set size.



Jaakko Hollmen
Fri Mar 8 13:44:32 EET 1996