Tanja Säily · Jukka Suomela

types2: Exploring word-frequency differences in corpora

Big and Rich Data in English Corpus Linguistics, Methods and Explorations · Studies in Variation, Contacts and Change in English, to appear

authors’ version

Abstract

We demonstrate the use of the types2 tool to explore, visualize, and assess the significance of variation in word frequencies. Based on accumulation curves and the statistical technique of permutation testing, this freely available tool is especially well suited to the study of types and hapax legomena, which are common measures of morphological productivity and lexical diversity. We have developed a new version of the tool that provides improved linking between the visualizations, metadata, and corpus texts, which facilitates the analysis of rich data.

The new version of our tool is demonstrated using two data sets extracted from the Corpora of Early English Correspondence (CEEC) and the British National Corpus (BNC), both of which are rich in sociolinguistic metadata. We show how to use our software to analyse such data sets, and how the new version of our tool can turn the results into interactive web pages with visualizations that are linked to the underlying data and metadata. Our paper illustrates how the linked data facilitates exploring and interpreting the results.

Publication

Turo Hiltunen, Joseph McVeigh, and Tanja Säily (Eds.): Big and Rich Data in English Corpus Linguistics, Methods and Explorations, volume 19 of Studies in Variation, Contacts and Change in English, Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki, Helsinki, Finland, to appear

Links

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.