Jefrey Lijffijt

Bootstrap Test

The statistical test for comparing two corpora as proposed in

Jefrey Lijffijt, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila. Significance testing of word frequencies in corpora. Forthcoming.

and presented in

Jefrey Lijffijt, Tanja Säily, Terttu Nevalainen. Chi-square test considered harmful: Better methods for testing the significance of word frequencies. ICAME 33, 30 May - 3 June, Leuven, Belgium, 2012. (Presentation)

is now available for R and Matlab.

The syntax is bootstraptest(data1, data2, N), where data1 should be a vector of normalised frequencies of a linguistic unit in Corpus 1 (one value per text) and data2 should be a vector of normalised frequencies of the same unit in Corpus 2 (one value per text). N is the number of bootstrap samples and determines the smallest possible p-value, that is, the test can only output p-values greater than or equal to 1/(1+N). For example N = 9,999 would be a good choice in practice.

NB. The frequency of the unit does not have to be normalised against the number of tokens, it is often preferable to normalise against the number of opportunities. For more info see, e.g., Sean Wallis's blog.

NNB. The Matlab code is more optimised than the R code, but suggestions for optimising the code are very welcome.