In this task, we are given *m* input vectors, each with *n* numbers. Our task is to calculate the correlations between all pairs of input vectors.

Input: vector *i* = row *i* of the image.

Output: red pixel at (*i*, *j*) = positive correlation between rows *i* and *j*, blue pixel = negative correlation.

We will implement a command line tool `pngcorrelate`

that you can use to find pairwise correlations between pixel rows in PNG images. It outputs the result as another PNG image, similar to what you see in the above examples. The program accepts images with colours, but automatically converts them to greyscale before analysing correlations.

The program can be used as follows:

./pngcorrelate INPUT OUTPUT1 OUTPUT2

where INPUT is the input file, OUTPUT1 will be a greyscale version of the input file, and OUTPUT2 will show the pairwise correlations.

For example, if you have a PNG file called *example.png* in the current directory, you can find the pairwise correlations as follows:

./pngcorrelate example.png result1.png result2.png

You can find an almost-working implementation in our shared code repository, in the subdirectories `cp*`

. There is one directory per task; the templates are almost identical, with minor changes in e.g. benchmarks.

The only part that is missing is the implementation of the subroutine `correlate()`

in file `cp*/cp.cc`

. This is the function that does the actual calculations. See `cp*/cp.h`

for the description of the interface.

Once you have implemented the function correctly, you should be able to run `make`

to compile your code, `make test`

to check that it works correctly (at least for some test cases) and `make benchmark`

to see how well it performs.

You need to implement the following function:

void correlate(int ny, int nx, const float* data, float* result)

Here `data`

is a pointer to the input matrix, with `ny`

rows and `nx`

columns. For all `0 <= y < ny`

and `0 <= x < nx`

, the element at row `y`

and column `x`

is stored in `data[x + y*nx]`

.

The function has to solve the following task: for all `i`

and `j`

with `0 <= j <= i < ny`

, calculate the correlation coefficient between row `i`

of the input matrix and row `j`

of the input matrix, and store the result in `result[i + j*ny]`

.

Note that the correlations are symmetric, so we will only compute the upper triangle of the result matrix. You can leave the lower triangle `i < j`

undefined.

The arrays `data`

and `result`

are already allocated by whoever calls this function; you do not need to do any memory management.

The input and output are given as single-precision floating point numbers (type `float`

). However, if you calculate the correlations in a naive manner using single-precision floating point numbers, it will easily produce very large rounding errors.

The test suite allows for an absolute error of at most 0.00001 (for CP9, at most 0.0001), and it tries out inputs with up to 10000 rows or columns. Your code has to pass all tests.

In tasks CP1–CP3 and CP8 you must use double-precision floating point numbers (type `double`

) for all intermediate results, and only round the final result to single precision. This should make it easy to avoid any issues with numerical stability.

In tasks CP4 and CP9, however, you are encouraged to take advantage of single-precision floating point numbers to speed up calculations. Be careful.

Subdirectory: **cp1**, file to edit: **cp1/cp.cc**.

Start with the template solution and complete it so that it works correctly.

Do not use multiple threads or vector instructions yet. Advanced cache blocking is not needed, either; just make sure that you process data elements in a reasonable order. Use double-precision floating point numbers (type `double`

) for all intermediate results.

If you run `make benchmark`

and you get running times in the following ballpark, you can be happy:

cp 4000 100 0.777 cp 4000 1000 7.685 cp 4000 2000 15.400 cp 4000 4000 30.721

Subdirectory: **cp2**, file to edit: **cp2/cp.cc**.

Parallelise your solution to task CP1 with OpenMP.

Subdirectory: **cp3**, file to edit: **cp3/cp.cc**.

Use vector instructions to improve the performance of your solution to task CP2.

Subdirectory: **cp4**, file to edit: **cp4/cp.cc**.

Use cache blocking to improve the memory layout for your solution to task CP3. You are also encouraged to use single-precision floating point numbers to speed up calculations.

For reference, here is an example of what can be achieved with the classroom computers using double-precision floating point numbers:

cp 4000 100 0.041 cp 4000 1000 0.267 cp 4000 2000 0.546 cp 4000 4000 1.085

And here is an example of a solution that uses single-precision floating point numbers:

cp 4000 100 0.034 cp 4000 1000 0.144 cp 4000 2000 0.270 cp 4000 4000 0.541

Subdirectory: **cp5** (no template provided).

How close your solution to CP4 is to the theoretical and practical capabilities of the hardware?

- How many floating point operations (multiplications and/or additions) do you do in your code in total? How many floating point operations does your code do per clock cycle and per CPU core?
- Write benchmark code that achieves as many floating point operations per cycle as possible. Your program does not need to do anything useful. Make sure that you use vector instructions, you do not have any memory accesses that might be a bottleneck, and there are enough opportunities for instruction-level parallelism.
- Read the specifications of the CPU: how many floating point operations is it supposed to be capable of doing per clock cycle and per CPU core, under optimal conditions?

Compare the above results to each other. Is there still some room for improvements in your code? Is the bottleneck now in arithmetic operations, memory lookups, or elsewhere?

If you have access to a computer with a Haswell CPU, compile your solution to CP4 there. Study the assembly code. Is the compiler generating FMA instructions? If not, can you rewrite your code and/or adjust the compiler settings so that you benefit from the FMA instructions? Measure the performance of your code and compare it with the theoretical limitations of Haswell CPUs.

Your submission has to contain at least a written report, as a PDF file with name `report.pdf`

.

Subdirectory: **cp6** (no template provided).

Use your correlated pairs implementation (e.g., CP4) to detect similarities and dissimilarities between various written languages.

Here are four data files that you can use directly: 1gram.txt, 2gram.txt, 3gram.txt, and 4gram.txt. These files contain lines of the following form:

- LANGUAGE NGRAM COUNT

Here “LANGUAGE” is the ISO 639-3 language code, “NGRAM” is a combination of letters that has occurred at least once in the words written in this language, and “COUNT” is the total number of such occurrences in our text collection. File “*n*gram.txt” contains all *n*-grams, i.e., combinations that consist of exactly *n* letters, plus some shorter letter combinations that are a word by itself. All NGRAMs consist of plain ASCII lower-case letters a, b, …, z.

These files are derived from Tatoeba, which is an open data set of sentences written in numerous languages — the original data set is freely available for download under the CC-BY license. We have chosen 44 languages, taken all sentences written in these languages, filtered the data to remove garbage, transliterated all text to plain ASCII equivalents (throwing away e.g. all diacritical marks), and calculated the n-gram counts.

In this task you will write a program that can read any of the four data sets, as well as any other data sets that are given in the same form. Name your program “ngrams” and design it so that you can run it as follows:

./ngrams N FILE

For example:

./ngrams 3 3gram.txt

Your program has to do **at least** the following:

- Read all NGRAMs with precisely N letters from input file FILE; you can ignore shorter and longer NGRAMs.
- Construct an array in which there is one row for each LANGUAGE that occurs in FILE, and one column for each possible NGRAM. For example, if N = 3, then there should be precisely 26
^{3}= 17576 columns in the array, as there are 26 possible letters in the range a, b, …, z. - Using your correlated pairs implementation, calculate the correlation between all pairs of rows in this array. In essence, the goal is to find which written languages are (on this fairly superficial level of simplified ngrams) similar to each other, and which languages are dissimilar.
- Find all “isolated languages”, i.e., languages that have a low correlation with any other language in the data set. For example, you can output top-5 languages that are as far from any other language as possible.
- Find all “isolated language pairs”, i.e., pairs (
*i*,*j*) of languages such that for any third language*k*that is different from*i*and*j*, the correlation between*i*and*j*is (much) higher than the correlation between*i*and*k*or the correlation between*j*and*k*.

You are also encouraged (but not required) to perform further analysis on the correlation matrix and, e.g., try to find a good clustering of the languages based on their similarities.

The output of your algorithm can be in any form that is convenient for human consumption. Please also submit a report that contains a couple of words of analysis regarding how meaningful the results are. For example, are the isolated languages and isolated language pairs that your program identifies meaningful from the perspective of what you know (or can find out) about these languages?

Your submission has to contain the source code of your program, plus a written report, as a PDF file with name `report.pdf`

.

Subdirectory: **cp7** (no template provided).

Try to use Strassen's algorithm to speed up matrix multiplications. Can you improve on your solution to CP4 this way?

Your submission has to contain the source code of your implementation, plus a written report, as a PDF file with name `report.pdf`

.

Subdirectory: **cp8**, file to edit: **cp8/cp.cu**.

Using CUDA, write a working solution that solves the problem on the GPU.

The interface is identical to what we have had in the previous tasks. The only difference is that you will write your code in `cp.cu`

instead of `cp.cc`

. This file will be compiled and linked with NVCC. Thus you can use both normal C++ code and CUDA constructions in this file.

Your implementation does not need to be efficient. However, it has to work correctly and it has to do most of the computation on GPU. Use double-precision floating point numbers (type `double`

) for all intermediate results.

Subdirectory: **cp9**, file to edit: **cp9/cp.cu**.

Using CUDA, write an efficient working solution that solves the problem on the GPU. You are also encouraged to use single-precision floating point numbers to speed up calculations.

For reference, here is an example of what can be achieved with the classroom computers using GPUs:

cp 4000 100 0.052 cp 4000 100 0.020 cp 4000 1000 0.125 cp 4000 1000 0.095 cp 4000 2000 0.209 cp 4000 2000 0.179 cp 4000 4000 0.383 cp 4000 4000 0.349

Note that the benchmarking program runs all tests twice. The first invocation includes one-time overhead related to CUDA API (approx. 30ms).

One reasonable way to calculate correlations is the following:

- First normalise the input rows so that each row has the arithmetic mean of 0.
- Then normalise the input rows so that for each row the sum of the squares of the elements is 1.
- Let
*X*be the normalised input matrix. - Calculate the (upper triangle of the) matrix product
*Y*=*XX*^{T}.

Now matrix *Y* contains all pairwise correlations. The only computationally-intensive part is the computation of the matrix product; the normalisations can be done in linear time in the input size.

In your code, you probably have nested `for`

loops such that the length of the inner loop varies. With the default schedule, `#pragma omp for`

may not perform that well: one thread may execute only short loops and another thread may execute only long loops. The following directive may be helpful:

#pragma omp for schedule(static, 1)

See the OpenMP specification for a detailed description.

You can use the data type `double4_t`

defined in `common/vector.h`

; this is a vector of 4 double-precision floating point numbers. To allocate memory for an array of `double4_t`

vectors with the correct alignment, you can use the memory allocation routine `double4_alloc`

from `common/vector.cc`

.

You can first convert all of your input data (possibly after normalisation) into `double4_t`

vectors: for each row, store the first 4 elements in one `double4_t`

, the next 4 elements in another `double4_t`

etc., and add some zeroes for padding if the number of columns is not a multiple of 4. See this drawing for some help with a suitable memory layout.

To find pairwise correlations, you can first calculate a vector `double4_t s`

where `s[i]`

is the result restricted to columns `i`

, `i+4`

, `i+8`

, … Finally, calculate the sum `s[0] + s[1] + s[2] + s[3]`

.

Try to improve the arithmetic intensity of your algorithm: do more multiplications per memory access.

For example, instead of considering individual rows of the input matrix, you can consider **blocks** of 3 rows. For each pairs of such blocks, compute all 3 × 3 dot products in parallel. The number of arithmetic operations remains the same. However, the amount of data transferred from the memory is reduced: you now only need to read 6 rows of input to produce 9 elements of output, while in the naive solution you will read 2 rows of input to produce 1 element of output; this is a factor-3 improvement.

Of course if you choose blocks that are too large, you will have difficulties keeping all intermediate results in CPU registers. Experiment with different block sizes.

For numerical stability with single-precision floating point numbers, it may be a good idea to first normalise the input matrix, and after that calculate the matrix product. If you first calculate dot products and normalise afterwards, you may easily accumulate large errors.

You want to do cache blocking on two levels:

- Each
**block**copies a small part of data (e.g. 1000 values?) from the global memory to the**shared memory**, and then it does lots of useful arithmetic operations with this data. - Each
**thread**copies a tiny part of data (e.g. 10 values?) from the shared memory to the**registers**, and then it does lots of useful arithmetic operations with this data.

Here is one concrete approach that you can try to use (see this drawing for an illustration):

- The output matrix is divided in “
*large squares*” of size*mk*×*mk*. - Each large square is divided in “
*small squares*” of size*k*×*k*. - The input is divided in “
*major steps*” that consist of*m*columns. - Each major step is divided in “
*minor steps*” that consist of*1*column. - There is one thread per small square.
- There is one block per large square.
- Major steps and large squares are related to cache blocking in shared memory.
- Minor steps and small squares are related to cache blocking in registers.
- Data storage per block:
- “A”: keep 2 ×
*mk*×*m*elements of input in the shared memory (cache).

- “A”: keep 2 ×
- Data storage per thread:
- “B”: keep 2 ×
*k*elements of input in the registers (cache). - “X”: keep
*k*×*k*elements of partial output in the registers.

- “B”: keep 2 ×
- Major step, per block:
- Reads 2 ×
*mk*×*m*elements of input to A. - There are
*m*×*m*threads in the block, so each thread only needs to read 2*k*elements.

- Reads 2 ×
- Minor step, per thread:
- Read 2 ×
*k*× 1 elements of input from A to B. - Calculate
*k*×*k*pairwise products and update X.

- Read 2 ×
- Finally, each thread will have its own part of the output in X. Note that you do not need to use global memory or shared memory to store any intermediate results; everything is kept in the registers until we have got the final result ready to be written to the output.

Whenever you have problems with the performance, check the following numbers:

- The number of useful arithmetic operations per each value that you copy from the global memory to the shared memory. This should be a lot more than 10.
- The number of useful arithmetic operations per each value that you copy from the shared memory to registers. This should be a lot more than 1.
- How much shared memory is allocated for each block? How many blocks can be active simultaneously? Remember that there is only 48KB of shared memory in total per multiprocessor, and there are only 2 multiprocessor in our GPUs. If you use lots of shared memory per block, you can have few blocks active simultaneously.
- How many threads there are in each block? Remember that everything happens in warps, which consists of 32 threads, so the block size should be at least 32 threads.
- How many warps are ready for execution simultaneously? This should be a lot more than 10. This may be limited by the shared memory (only a few blocks of threads per multiprocessor active simultaneously). This may also be limited by the grid size (only a few blocks available in total).

For fine-tuning, you may also want to pay attention to the memory access patterns. If you need to copy data from the global memory to the shared memory, it is helpful to organise reading so that within a block, thread number `i`

is accessing memory slot number `x+i`

, for some value `x`

that only depends on the block index and not on the thread index.

It may be also useful to check that you are not running out of GPU registers. The following command may be helpful there:

cuobjdump -sass -arch sm_30 cp.o | grep STL

If it prints lots of lines of the assembly code, then the compiler ran out of GPU registers and some of the local variables had to be kept in the main memory.