types2: Type and Hapax Accumulation Curves

Jukka Suomela, 2016

Overview Documentation References


types2 is a freely available corpus tool for comparing the frequencies of words, types, and hapax legomena across subcorpora. The tool uses accumulation curves and the statistical technique of permutation testing to compare the subcorpora with a “typical” corpus of a similar size, in order to visualize the frequencies and to identify statistically significant findings.

The software is written by Jukka Suomela, and the system is designed and developed in collaboration with Tanja Säily. The sample data sets are provided by Tanja Säily. Please see the paper “types2: Exploring word-frequency differences in corpora” for more information on how to use the tool.


Usage and examples

Source code and installation instructions

Sample data

Live examples

Workflow in brief

  1. Copy the database template from template/types.sqlite to db/types.sqlite
  2. Populate the database db/types.sqlite with your input data
  3. Run bin/types-run to perform data analysis
  4. Run bin/types-web to create the web user interface
  5. Open web/index.html in web browser

Quick start

    git clone https://github.com/suomela/types.git
    git clone https://github.com/suomela/types-examples.git
    cd types
    mkdir db
    cp ../types-examples/bnc-input/db/types.sqlite db/types.sqlite
    bin/types-run --citer=100000 --piter=100000
    open web/index.html