Randomization Methods for Assessing Data Mining Results on Matrices
This page contains the datasets and the implementations of the methods described in the following papers:
Markus Ojala:
Assessing Data Mining Results on Matrices with Randomization.
In ICDM'10: Proceeding of the 10th IEEE International Conference on Data Mining, pp. 959-964.
Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, Heikki Mannila:
Randomization Methods for Assessing Data Analysis Results on Real-Valued Matrices.
In Statistical Analysis and Data Mining, 2(4):209-230, 2009.
In these papers, a data mining result is considered to be interesting, if it is not explained by the row and column value distributions. Here, we give randomization methods to produce random matrices approximately sharing the row and column statistics with the original matrix.
Updates
- 2010-07-01: Implementations of new SwapConstrained method available (ICDM 2010).
- 2008-12-16: Implementations of SAM journal paper available.
Datasets
All generated, artificial datasets in a zip archive (SAM paper):
Links to pages where the real datasets used in the experiments can be downloaded:
Implementations
The randomization methods are implemented in Java 1.5. The methods are integrated with Matlab, thus version 1.5 is required from Java virtual machine that Matlab is using. If you have a Matlab version older than 7.5, i.e, 2007b, Matlab has to be changed to use newer JVM, see
Matlab support for more help. To increase the heap space for the JVM, see
Matlab support.
SwapConstrained implementation (newer, use this if no special reasons for GeneralMetropolis)
In the ICDM 2010, a new SwapConstrained method was given that needs no manual tuning of parameters and can support matrices containing:
- similar / dissimilar features,
- nominal / integer / real / missing values,
- non-smooth / non-Gaussian value distributions,
- full / sparse structure.
To use the methods, download and unzip the following archive, see README.txt and call "help swap" and "help discretize" in Matlab to start using the methods.
SAM implementations (less versatile SwapDiscretized, includes GeneralMetropolis)
In this paper, methods for randomizing real-valued matrices with similar features were given. To use the methods, download and unzip the following archives, start Matlab and call "help randomizeMatrix" for more information. Consult also the SAM article for reference.
Page maintained by Markus.Ojala at tkk.fi,
last updated Monday, 20-Dec-2010 08:30:25 EET