Subdirectory: mi1 (no template provided).
Get access to the CSC's Pouta cloud service, following these instructions.
Benchmark your programs in Pouta. Study the scalability at least in the following settings:
Put the raw benchmark data in your Github repository (in any format of your choice). Analyse the speedups and write a report. At minimum, your report has to contain figures that clearly show how the speedups that you get on different machines compare to each other, and how they compare to linear speedups.
Subdirectory: mi2 (no template provided).
Use your solutions to the correlated pairs exercises to detect similarities and dissimilarities between various written languages.
Here are four data files that you can use directly:
These files contain lines of the following form:
Here “LANGUAGE” is the ISO 639-3 language code, “NGRAM” is a combination of letters that has occurred at least once in the words written in this language, and “COUNT” is the total number of such occurrences in our text collection. File “ngram.txt” contains all n-grams, i.e., combinations that consist of exactly n letters, plus some shorter letter combinations that are a word by itself.
All NGRAMs consist of plain ASCII lower-case letters a, b, …, z.
These files are derived from Tatoeba, which is an open data set of sentences written in numerous languages — the original data set is freely available for download under the CC-BY license. We have chosen 44 languages, taken all sentences written in these languages, filtered the data to remove garbage, transliterated all text to plain ASCII equivalents (throwing away e.g. all diacritical marks), and calculated the n-gram counts.
In this task you will write a program that can read any of the four data sets, as well as any other data sets that are given in the same form. Name your program “ngrams” and design it so that you can run it as follows:
./ngrams N FILE
./ngrams 3 3gram.txt
Your program has to do at least the following:
Using your most efficient implementation to the correlated pairs exercises, calculate the correlation between all pairs of rows in this array.
In essence, the goal is to find which written languages are (on this fairly superficial level of simplified ngrams) similar to each other, and which languages are dissimilar.
You are also encouraged (but not required) to perform further analysis on the correlation matrix and, e.g., try to find a good clustering of the languages based on their similarities.
The output of your algorithm can be in any form that is convenient for human consumption. Your report should also contain a couple of words of analysis regarding how meaningful the results are. For example, are the isolated languages and isolated language pairs that your program identifies meaningful from the perspective of what you know (or can find out) about these languages?