ICS-E4020 exercise: Correlated Pairs

Overview Examples Implementation Details Tasks Hints

Overview

In this task, we are given m input vectors, each with n numbers. Our task is to calculate the correlations between all pairs of input vectors.

Examples

Input: vector i = row i of the image.

Output: red pixel at (ij) = positive correlation between rows i and j, blue pixel = negative correlation.

Input Output

Input Output

Input Output

Input Output

Input Output

Implementation

We will implement a command line tool pngcorrelate that you can use to find pairwise correlations between pixel rows in PNG images. It outputs the result as another PNG image, similar to what you see in the above examples. The program accepts images with colours, but automatically converts them to greyscale before analysing correlations.

Usage

The program can be used as follows:

    ./pngcorrelate INPUT OUTPUT1 OUTPUT2

where INPUT is the input file, OUTPUT1 will be a greyscale version of the input file, and OUTPUT2 will show the pairwise correlations.

For example, if you have a PNG file called example.png in the current directory, you can find the pairwise correlations as follows:

    ./pngcorrelate example.png result1.png result2.png

Template

You can find an almost-working implementation in our shared code repository, in the subdirectories cp*. There is one directory per task; the templates are almost identical, with minor changes in e.g. benchmarks.

The only part that is missing is the implementation of the subroutine correlate() in file cp*/cp.cc. This is the function that does the actual calculations. See cp*/cp.h for the description of the interface.

Once you have implemented the function correctly, you should be able to run make to compile your code, make test to check that it works correctly (at least for some test cases) and make benchmark to see how well it performs.

Detailed specification

You need to implement the following function:

    void correlate(int ny, int nx, const float* data, float* result)

Here data is a pointer to the input matrix, with ny rows and nx columns. For all 0 <= y < ny and 0 <= x < nx, the element at row y and column x is stored in data[x + y*nx].

Correct output

The function has to solve the following task: for all i and j with 0 <= j <= i < ny, calculate the correlation coefficient between row i of the input matrix and row j of the input matrix, and store the result in result[i + j*ny].

Note that the correlations are symmetric, so we will only compute the upper triangle of the result matrix. You can leave the lower triangle i < j undefined.

The arrays data and result are already allocated by whoever calls this function; you do not need to do any memory management.

Numerical stability

The input and output are given as single-precision floating point numbers (type float). However, if you calculate the correlations in a naive manner using single-precision floating point numbers, it will easily produce very large rounding errors.

The test suite allows for an absolute error of at most 0.00001 (for CP9, at most 0.0001), and it tries out inputs with up to 10000 rows or columns. Your code has to pass all tests.

In tasks CP1–CP3 and CP8 you must use double-precision floating point numbers (type double) for all intermediate results, and only round the final result to single precision. This should make it easy to avoid any issues with numerical stability.

In tasks CP4 and CP9, however, you are encouraged to take advantage of single-precision floating point numbers to speed up calculations. Be careful.

Other rules

Please do not try to use std::valarray in your code.

Tasks

Task CP1

Subdirectory: cp1, file to edit: cp1/cp.cc.

Start with the template solution and complete it so that it works correctly.

Do not use multiple threads or vector instructions yet. Advanced cache blocking is not needed, either; just make sure that you process data elements in a reasonable order. Use double-precision floating point numbers (type double) for all intermediate results.

If you run make benchmark and you get running times in the following ballpark, you can be happy:

    cp      4000    100     0.777
    cp      4000    1000    7.685
    cp      4000    2000    15.400
    cp      4000    4000    30.721

Task CP2

Subdirectory: cp2, file to edit: cp2/cp.cc.

Parallelise your solution to task CP1 with OpenMP.

Task CP3

Subdirectory: cp3, file to edit: cp3/cp.cc.

Use vector instructions to improve the performance of your solution to task CP2.

Task CP4

Subdirectory: cp4, file to edit: cp4/cp.cc.

Use cache blocking to improve the memory layout for your solution to task CP3. You are also encouraged to use single-precision floating point numbers to speed up calculations.

For reference, here is an example of what can be achieved with the classroom computers using double-precision floating point numbers:

    cp      4000    100     0.041
    cp      4000    1000    0.267
    cp      4000    2000    0.546
    cp      4000    4000    1.085

And here is an example of a solution that uses single-precision floating point numbers:

    cp      4000    100     0.034
    cp      4000    1000    0.144
    cp      4000    2000    0.270
    cp      4000    4000    0.541

Task CP5 — challenging

Subdirectory: cp5 (no template provided).

How close your solution to CP4 is to the theoretical and practical capabilities of the hardware?

Compare the above results to each other. Is there still some room for improvements in your code? Is the bottleneck now in arithmetic operations, memory lookups, or elsewhere?

If you have access to a computer with a Haswell CPU, compile your solution to CP4 there. Study the assembly code (see README.md for instructions). Is the compiler generating FMA instructions? If not, can you rewrite your code and/or adjust the compiler settings so that you benefit from the FMA instructions? Measure the performance of your code and compare it with the theoretical limitations of Haswell CPUs.

Your submission has to contain at least a written report, as a PDF file with name report.pdf.

Task CP6 — challenging

Subdirectory: cp6 (no template provided).

Use your correlated pairs implementation (e.g., CP4) to detect similarities and dissimilarities between various written languages.

Here are four data files that you can use directly: 1gram.txt, 2gram.txt, 3gram.txt, and 4gram.txt. These files contain lines of the following form:

Here “LANGUAGE” is the ISO 639-3 language code, “NGRAM” is a combination of letters that has occurred at least once in the words written in this language, and “COUNT” is the total number of such occurrences in our text collection. File “ngram.txt” contains all n-grams, i.e., combinations that consist of exactly n letters, plus some shorter letter combinations that are a word by itself. All NGRAMs consist of plain ASCII lower-case letters a, b, …, z.

These files are derived from Tatoeba, which is an open data set of sentences written in numerous languages — the original data set is freely available for download under the CC-BY license. We have chosen 44 languages, taken all sentences written in these languages, filtered the data to remove garbage, transliterated all text to plain ASCII equivalents (throwing away e.g. all diacritical marks), and calculated the n-gram counts.

In this task you will write a program that can read any of the four data sets, as well as any other data sets that are given in the same form. Name your program “ngrams” and design it so that you can run it as follows:

    ./ngrams N FILE

For example:

    ./ngrams 3 3gram.txt

Your program has to do at least the following:

You are also encouraged (but not required) to perform further analysis on the correlation matrix and, e.g., try to find a good clustering of the languages based on their similarities.

The output of your algorithm can be in any form that is convenient for human consumption. Please also submit a report that contains a couple of words of analysis regarding how meaningful the results are. For example, are the isolated languages and isolated language pairs that your program identifies meaningful from the perspective of what you know (or can find out) about these languages?

Your submission has to contain the source code of your program, plus a written report, as a PDF file with name report.pdf.

Task CP7 — challenging

Subdirectory: cp7 (no template provided).

Try to use Strassen's algorithm to speed up matrix multiplications. Can you improve on your solution to CP4 this way?

Your submission has to contain the source code of your implementation, plus a written report, as a PDF file with name report.pdf.

Task CP8

Subdirectory: cp8, file to edit: cp8/cp.cu.

Using CUDA, write a working solution that solves the problem on the GPU.

The interface is identical to what we have had in the previous tasks. The only difference is that you will write your code in cp.cu instead of cp.cc. This file will be compiled and linked with NVCC. Thus you can use both normal C++ code and CUDA constructions in this file.

Your implementation does not need to be efficient. However, it has to work correctly and it has to do most of the computation on GPU. Use double-precision floating point numbers (type double) for all intermediate results.

Task CP9

Subdirectory: cp9, file to edit: cp9/cp.cu.

Using CUDA, write an efficient working solution that solves the problem on the GPU. You are also encouraged to use single-precision floating point numbers to speed up calculations.

For reference, here is an example of what can be achieved with the classroom computers using GPUs:

    cp      4000    100     0.052
    cp      4000    100     0.020
    cp      4000    1000    0.125
    cp      4000    1000    0.095
    cp      4000    2000    0.209
    cp      4000    2000    0.179
    cp      4000    4000    0.383
    cp      4000    4000    0.349

Note that the benchmarking program runs all tests twice. The first invocation includes one-time overhead related to CUDA API (approx. 30ms).

Hints

Task CP1

One reasonable way to calculate correlations is the following:

Now matrix Y contains all pairwise correlations. The only computationally-intensive part is the computation of the matrix product; the normalisations can be done in linear time in the input size.

Task CP2

In your code, you probably have nested for loops such that the length of the inner loop varies. With the default schedule, #pragma omp for may not perform that well: one thread may execute only short loops and another thread may execute only long loops. The following directive may be helpful:

    #pragma omp for schedule(static, 1)

See the OpenMP specification for a detailed description.

Task CP3

You can use the data type double4_t defined in common/vector.h; this is a vector of 4 double-precision floating point numbers. To allocate memory for an array of double4_t vectors with the correct alignment, you can use the memory allocation routine double4_alloc from common/vector.cc.

You can first convert all of your input data (possibly after normalisation) into double4_t vectors: for each row, store the first 4 elements in one double4_t, the next 4 elements in another double4_t etc., and add some zeroes for padding if the number of columns is not a multiple of 4. See this drawing for some help with a suitable memory layout.

To find pairwise correlations, you can first calculate a vector double4_t s where s[i] is the result restricted to columns i, i+4, i+8, … Finally, calculate the sum s[0] + s[1] + s[2] + s[3].

Task CP4

Try to improve the arithmetic intensity of your algorithm: do more multiplications per memory access.

For example, instead of considering individual rows of the input matrix, you can consider blocks of 3 rows. For each pairs of such blocks, compute all 3 × 3 dot products in parallel. The number of arithmetic operations remains the same. However, the amount of data transferred from the memory is reduced: you now only need to read 6 rows of input to produce 9 elements of output, while in the naive solution you will read 2 rows of input to produce 1 element of output; this is a factor-3 improvement.

Of course if you choose blocks that are too large, you will have difficulties keeping all intermediate results in CPU registers. Experiment with different block sizes.

For numerical stability with single-precision floating point numbers, it may be a good idea to first normalise the input matrix, and after that calculate the matrix product. If you first calculate dot products and normalise afterwards, you may easily accumulate large errors.

Task CP9

You want to do cache blocking on two levels:

Here is one concrete approach that you can try to use (see this drawing for an illustration):

Whenever you have problems with the performance, check the following numbers:

For fine-tuning, you may also want to pay attention to the memory access patterns. If you need to copy data from the global memory to the shared memory, it is helpful to organise reading so that within a block, thread number i is accessing memory slot number x+i, for some value x that only depends on the block index and not on the thread index.

It may be also useful to check that you are not running out of GPU registers. The following command may be helpful there:

    cuobjdump -sass -arch sm_30 cp.o | grep STL

If it prints lots of lines of the assembly code, then the compiler ran out of GPU registers and some of the local variables had to be kept in the main memory.