|
|
We consider a simple randomization technique for producing random datasets that have the same row and column margins with the given dataset. Then one can test the significance of a data mining result by computing the results of interest on the randomized instances and comparing them against the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and rankings. To generate random datasets with given mar- gins, we use variations of a Markov chain approach, which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well- known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is a random artifact, while for other datasets the discovered structure conveys meaningful information.
The code is available.
The code is available.
We consider bucket orders, i.e., total orders with ties. They can be used to capture the essential order information without overfitting the data: they form a useful concept class between total orders and arbitrary partial orders. We address the question of finding a bucket order for a set of items, given pairwise precedence information between the items. We also discuss methods for computing the pairwise precedence data. We describe simple and efficient algorithms for finding good bucket orders. Several of the algorithms have a provable approximation guarantee, and they scale well to large datasets. We provide experimental results on artificial and a real data that show the usefulness of bucket orders and demonstrate the accuracy and efficiency of the algorithms.
This paper looks at the seriation problem in paleontology. Given a collection of fossil sites, a set of taxa, and the presence/absence information for all taxa, find a good ordering for the sites. We describe a probabilistic model for the seriation problem, and show how MCMC techniques can be used to obtain estimates for the ordering of the sites, taxon lifetimes, etc. Compared to the spectral method described in another paper, the MCMC method gives better estimates of the uncertainty in the results, but is much slower. The code for the methods is available.
A collection of papers on constraints in pattern discovery and on the related concept of inductive databases.
We consider the problem of approximation the frequency of a query, given a collection of frequent itemsets. We study the algorithm that truncates the inclusion-exclusion sum to include only the frequencies of known itemsets, give a bound for its performance on disjunctions of attributes that is smaller than the previously known bound, and show that this bound is in fact achievable. We also show how to generalize the algorithm to approximate arbitrary Boolean queries.
Sequence segmentation and dimensionality reduction have been used as methods for studying high-dimensional sequences: they both reduce the complexity of the representation of the original data. In this paper we study the interplay of these two techniques. We formulate the problem of segmenting a sequence while modeling it with a basis of small size, thus essentially reducing the dimension of the input sequence. We give three di erent algorithms for this problem: all combine existing methods for sequence segmentation and dimensionality reduction. For two of the proposed algorithms we prove guarantees for the quality of the solutions obtained. We describe experimental results on synthetic and real datasets, including data on exchange rates and genomic sequences. Our experiments show that the algorithms indeed discover underlying structure in the data, including both segmental structure and interdependencies between the dimensions. The code for the methods is available.
This paper looks at the seriation problem in paleontology. Given a collection of fossil sites, a set of taxa, and the presence/absence information for all taxa, find a good ordering for the sites. The biological background knowledge that is used is that the species become extant, live for a certain period, and then become extinct; i.e., in error-free data the correct ordering is characterized as the ordering giving the consecutive ones property for the matrix. Real data, however, has lots of noise, and finding the optimal ordering is a hard problem. We show that spectral methods give very good results. Basically, one constructs a similarity matrix for the sites, computes the Laplacian, and uses one of the eigenvectors as the ordering criterion. The code is available.
The code is available.
We consider the problem of approximation the frequency of a query, given a collection of frequent itemsets. We study the algorithm that truncates the inclusion-exclusion sum to include only the frequencies of known itemsets, give a bound for its performance on disjunctions of attributes that is smaller than the previously known bound, and show that this bound is in fact achievable. We also show how to generalize the algorithm to approximate arbitrary Boolean queries.
Sequence segmentation and dimensionality reduction have been used as methods for studying high-dimensional sequences: they both reduce the complexity of the representation of the original data. In this paper we study the interplay of these two techniques. We formulate the problem of segmenting a sequence while modeling it with a basis of small size, thus essentially reducing the dimension of the input sequence. We give three di erent algorithms for this problem: all combine existing methods for sequence segmentation and dimensionality reduction. For two of the proposed algorithms we prove guarantees for the quality of the solutions obtained. We describe experimental results on synthetic and real datasets, including data on exchange rates and genomic sequences. Our experiments show that the algorithms indeed discover underlying structure in the data, including both segmental structure and interdependencies between the dimensions. The code for the methods is available.
This paper looks at the seriation problem in paleontology. Given a collection of fossil sites, a set of taxa, and the presence/absence information for all taxa, find a good ordering for the sites. The biological background knowledge that is used is that the species become extant, live for a certain period, and then become extinct; i.e., in error-free data the correct ordering is characterized as the ordering giving the consecutive ones property for the matrix. Real data, however, has lots of noise, and finding the optimal ordering is a hard problem. We show that spectral methods give very good results. Basically, one constructs a similarity matrix for the sites, computes the Laplacian, and uses one of the eigenvectors as the ordering criterion. The code is available.
The code is available.
The code is available.
Here are links to some papers that previously were unlinked in the full list of publications.