## Bootstrap Test

The statistical test for comparing two corpora as proposed in

- Jefrey Lijffijt, Terttu Nevalainen, Tanja Säily, Panagiotis
Papapetrou, Kai Puolamäki, and Heikki Mannila.
**Significance testing of
word frequencies in corpora**. *Forthcoming*.

and presented in

- Jefrey Lijffijt, Tanja Säily, Terttu Nevalainen.
**Chi-square test considered harmful: Better methods for testing the significance of word frequencies.** ICAME 33, 30 May - 3 June, Leuven, Belgium, 2012. (Presentation)

is now available for

R and

Matlab.

The syntax is *bootstraptest(data1, data2, N)*, where *data1*
should be a vector of normalised frequencies of a linguistic unit in Corpus 1
(one value per text) and *data2* should be a vector of normalised
frequencies of the same unit in Corpus 2 (one value per text). *N* is
the number of bootstrap samples and determines the smallest possible p-value,
that is, the test can only output p-values greater than or equal to 1/(1+N).
For example *N = 9,999* would be a good choice in practice.

NB. The frequency of the unit does not have to be normalised against the
number of tokens, it is often preferable to normalise against the number of
opportunities. For more info see, e.g., Sean
Wallis's blog.

NNB. The Matlab code is more optimised than the R code, but suggestions for
optimising the code are very welcome.