(aside image)

Bootstrap Test

The statistical test for comparing two corpora as proposed in

and presented in is now available for R and Matlab.

The syntax is bootstraptest(data1, data2, N), where data1 should be a vector of normalised frequencies of a linguistic unit in Corpus 1 (one value per text) and data2 should be a vector of normalised frequencies of the same unit in Corpus 2 (one value per text). N is the number of bootstrap samples and determines the smallest possible p-value, that is, the test can only output p-values greater than or equal to 1/(1+N). For example N = 9,999 would be a good choice in practice.

NB. The frequency of the unit does not have to be normalised against the number of tokens, it is often preferable to normalise against the number of opportunities. For more info see, e.g., Sean Wallis's blog.

NNB. The Matlab code is more optimised than the R code, but suggestions for optimising the code are very welcome.