Big and Rich Data in English Corpus Linguistics, Methods and Explorations · Studies in Variation, Contacts and Change in English, to appear
We demonstrate the use of the types2 tool to explore, visualize, and assess the significance of variation in word frequencies. Based on accumulation curves and the statistical technique of permutation testing, this freely available tool is especially well suited to the study of types and hapax legomena, which are common measures of morphological productivity and lexical diversity. We have developed a new version of the tool that provides improved linking between the visualizations, metadata, and corpus texts, which facilitates the analysis of rich data.
The new version of our tool is demonstrated using two data sets extracted from the Corpora of Early English Correspondence (CEEC) and the British National Corpus (BNC), both of which are rich in sociolinguistic metadata. We show how to use our software to analyse such data sets, and how the new version of our tool can turn the results into interactive web pages with visualizations that are linked to the underlying data and metadata. Our paper illustrates how the linked data facilitates exploring and interpreting the results.
Turo Hiltunen, Joseph McVeigh, and Tanja Säily (Eds.): Big and Rich Data in English Corpus Linguistics, Methods and Explorations, volume 19 of Studies in Variation, Contacts and Change in English, Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki, Helsinki, Finland, to appear