ICS-E4020 exercise: Miscellaneous

Overview MI1 MI2


Task MI1 — optional

Subdirectory: mi1 (no template provided).

Choose at least three different tasks that you have solved in this course with efficient multithreaded programs. For example, you can use your solutions to tasks MF2, CP4, IS1, and SO1.

Get access to the CSC's Pouta cloud service, following these instructions.

Benchmark your programs in Pouta. Study the scalability at least in the following settings:

Put the raw benchmark data in your Github repository (in any format of your choice). Analyse the speedups and write a report. At minimum, your report has to contain figures that clearly show how the speedups that you get on different machines compare to each other, and how they compare to linear speedups.

Task MI2 — optional

Subdirectory: mi2 (no template provided).

Use your solutions to the correlated pairs exercises to detect similarities and dissimilarities between various written languages.

Here are four data files that you can use directly:

These files contain lines of the following form:

Here “LANGUAGE” is the ISO 639-3 language code, “NGRAM” is a combination of letters that has occurred at least once in the words written in this language, and “COUNT” is the total number of such occurrences in our text collection. File “ngram.txt” contains all n-grams, i.e., combinations that consist of exactly n letters, plus some shorter letter combinations that are a word by itself.

All NGRAMs consist of plain ASCII lower-case letters a, b, …, z.

These files are derived from Tatoeba, which is an open data set of sentences written in numerous languages — the original data set is freely available for download under the CC-BY license. We have chosen 44 languages, taken all sentences written in these languages, filtered the data to remove garbage, transliterated all text to plain ASCII equivalents (throwing away e.g. all diacritical marks), and calculated the n-gram counts.

In this task you will write a program that can read any of the four data sets, as well as any other data sets that are given in the same form. Name your program “ngrams” and design it so that you can run it as follows:

    ./ngrams N FILE

For example:

    ./ngrams 3 3gram.txt

Your program has to do at least the following:

You are also encouraged (but not required) to perform further analysis on the correlation matrix and, e.g., try to find a good clustering of the languages based on their similarities.

The output of your algorithm can be in any form that is convenient for human consumption. Your report should also contain a couple of words of analysis regarding how meaningful the results are. For example, are the isolated languages and isolated language pairs that your program identifies meaningful from the perspective of what you know (or can find out) about these languages?