ICS-E4020 exercise: Correlated Pairs

Overview Examples Implementation Details Tasks Hints

Overview

In this task, we are given m input vectors, each with n numbers. Our task is to calculate the correlations between all pairs of input vectors.

Examples

Input: vector i = row i of the image.

Output: red pixel at (ij) = positive correlation between rows i and j, blue pixel = negative correlation.

Input Output

Input Output

Input Output

Input Output

Input Output

Implementation

We will implement a command line tool pngcorrelate that you can use to find pairwise correlations between pixel rows in PNG images. It will outputs the result as another PNG image, similar to what you see in the above examples. The program accepts images with colours, but automatically converts them to greyscale before analysing correlations.

Usage

The program can be used as follows:

    ./pngcorrelate INPUT OUTPUT1 OUTPUT2

where INPUT is the input file, OUTPUT1 will be a greyscale version of the input file, and OUTPUT2 will show the pairwise correlations.

For example, if you have a PNG file called example.png in the current directory, you can find the pairwise correlations as follows:

    ./pngcorrelate example.png result1.png result2.png

Template

You can find an almost-working implementation in our shared code repository, in the subdirectories cp*. There is one directory per task; the templates are almost identical, with minor changes in e.g. benchmarks.

The only part that is missing is the implementation of the subroutine correlate() in file cp*/cp.cc. This is the function that does the actual calculations. See cp*/cp.h for the description of the interface.

Once you have implemented the function correctly, you should be able to run make to compile your code, make test to check that it works correctly (at least for some test cases) and make benchmark to see how well it performs.

Detailed specification

You need to implement the following function:

    void correlate(int ny, int nx, const float* data, float* result)

Here data is a pointer to the input matrix, with ny rows and nx columns. For all 0 <= y < ny and 0 <= x < nx, the element at row y and column x is stored in data[x + y*nx].

Correct output

The function has to solve the following task: for all i and j with 0 <= j <= i < ny, calculate the correlation coefficient between row i of the input matrix and row j of the input matrix, and store the result in result[i + j*ny].

Note that the correlations are symmetric, so we will only compute the upper triangle of the result matrix. You can leave the lower triangle i < j undefined.

The arrays data and result are already allocated by whoever calls this function; you do not need to do any memory management.

Numerical stability

The input and output are given as single-precision floating point numbers (type float). However, if you calculate the correlations in a naive manner using single-precision floating point numbers, it will easily produce very large rounding errors. Please take one of the following approaches:

The test suite allows for an absolute error of at most 0.00001, and it tries out inputs with up to most 10000 rows or columns. Your code has to pass all tests.

Other rules

Please do not try to use std::valarray in your code.

Tasks

Task CP1

Subdirectory: cp1, file to edit: cp1/cp.cc.

Start with the template solution and complete it so that it works correctly.

Do not use multiple threads or vector instructions yet. Advanced cache blocking is not needed, either; just make sure that you process data elements in a reasonable order.

Examples of a sufficient performance with classroom computers:

If you run make benchmark and you get running times in the following ballpark, you can be happy:

    cp      4000    100     0.777
    cp      4000    1000    7.685
    cp      4000    2000    15.400
    cp      4000    4000    30.721

Task CP2

Subdirectory: cp2, file to edit: cp2/cp.cc.

Parallelise your solution to task CP1 with OpenMP.

Examples of a sufficient performance with classroom computers:

Task CP3

Subdirectory: cp3, file to edit: cp3/cp.cc.

Use vector instructions to improve the performance of your solution to task CP2.

Examples of a sufficient performance with classroom computers:

Task CP4

Subdirectory: cp4, file to edit: cp4/cp.cc.

Use cache blocking to improve the memory layout for your solution to task CP3.

Examples of a sufficient performance with classroom computers:

For reference, here is an example of what can be achieved with the classroom computers using double-precision floating point numbers:

    cp      4000    100     0.041
    cp      4000    1000    0.267
    cp      4000    2000    0.546
    cp      4000    4000    1.085

And here is an example of a solution that uses single-precision floating point numbers:

    cp      4000    100     0.034
    cp      4000    1000    0.144
    cp      4000    2000    0.270
    cp      4000    4000    0.541

Task CP5 — optional

Subdirectory: cp5 (no template provided).

How close your solution to CP4 is to the theoretical and practical capabilities of the hardware?

Compare the above results to each other. Is there still some room for improvements in your code? Is the bottleneck now in arithmetic operations, memory lookups, or elsewhere?

Task CP6 — optional

Subdirectory: cp6 (no template provided).

If you have access to a computer with a Haswell CPU, compile your solution to CP4 there. Study the assembly code (see README.md for instructions).

Is the compiler generating FMA instructions? If not, can you rewrite your code and/or adjust the compiler settings so that you benefit from the FMA instructions?

Measure the performance of your code and compare it with the theoretical limitations of Haswell CPUs.

Task CP7 — optional

Subdirectory: cp7 (no template provided).

Try to use Strassen's algorithm to speed up matrix multiplications. Can you improve on your solution to CP4 this way?

Task CP8 — worth 10 points

Subdirectory: cp8, file to edit: cp8/cp.cu.

Using CUDA, write a working solution that solves the problem on the GPU.

The interface is identical to what we have had in the previous tasks. The only difference is that you will write your code in cp.cu instead of cp.cc. This file will be compiled and linked with NVCC. Thus you can use both normal C++ code and CUDA constructions in this file.

Your implementation does not need to be efficient. However, it has to work correctly and it has to do most of the computation on GPU.

Task CP9 — worth 10 points

Subdirectory: cp9, file to edit: cp9/cp.cu.

Using CUDA, write an efficient working solution that solves the problem on the GPU.

In this task, you are encouraged to use single-precision floating-point numbers (type float) in all calculations. The tester is more permissive — you can have an absolute error of at most 0.0001. As long as make test does not complain, you are doing fine in terms of the numerical accuracy.

Examples of a sufficient performance with classroom computers:

For reference, here is an example of what can be achieved with the classroom computers using GPUs:

    cp      4000    100     0.052
    cp      4000    100     0.020
    cp      4000    1000    0.125
    cp      4000    1000    0.095
    cp      4000    2000    0.209
    cp      4000    2000    0.179
    cp      4000    4000    0.383
    cp      4000    4000    0.349

Note that the benchmarking program runs all tests twice. The first invocation includes one-time overhead related to CUDA API (approx. 30ms).

Hints

Task CP1

One reasonable way to calculate correlations is the following:

Now matrix Y contains all pairwise correlations. The only computationally-intensive part is the computation of the matrix product; the normalisations can be done in linear time in the input size.

Task CP2

In your code, you probably have nested for loops such that the length of the inner loop varies. With the default schedule, #pragma omp for may not produce that well: one thread may execute only short loops and another thread may execute only long loops. The following directive may be helpful:

    #pragma omp for schedule(static, 1)

See the OpenMP specification for a detailed description.

Task CP3

You can use the data type double4_t defined in common/vector.h; this is a vector of 4 double-precision floating point numbers. To allocate memory for an array of double4_t vectors with the correct alignment, you can use the memory allocation routine double4_alloc from common/vector.cc.

You can first convert all of your input data (possibly after normalisation) into double4_t vectors: for each row, store the first 4 elements in one double4_t, the next 4 elements in another double4_t etc., and add some zeroes for padding if the number of columns is not a multiple of 4. See this drawing for some help with a suitable memory layout.

To find pairwise correlations, you can first calculate a vector double4_t s where s[i] is the result restricted to columns i, i+4, i+8, … Finally, calculate the sum s[0] + s[1] + s[2] + s[3].

Task CP4

Try to improve the arithmetic intensity of your algorithm: do more multiplications per memory access.

For example, instead of considering individual rows of the input matrix, you can consider blocks of 3 rows. For each pairs of such blocks, compute all 3 × 3 dot products in parallel. The number of arithmetic operations remains the same. However, the amount of data transferred from the memory is reduced: you now only need to read 6 rows of input to produce 9 elements of output, while in the naive solution you will read 2 rows of input to produce 1 element of output; this is a factor-3 improvement.

Of course if you choose blocks that are too large, you will have difficulties keeping all intermediate results in CPU registers. Experiment with different block sizes.

Task CP9

You want to do cache blocking on two levels:

Here is one concrete approach that you can try to use (see this drawing for an illustration):

Whenever you have problems with the performance, check the following numbers:

For fine-tuning, you may also want to pay attention to the memory access patterns. If you need to copy data from the global memory to the shared memory, it is helpful to organise reading so that within a block, thread number i is accessing memory slot number x+i, for some value x that only depends on the block index and not on the thread index.

It may be also useful to check that you are not running out of GPU registers. The following command may be helpful there:

    cuobjdump -sass -arch sm_30 cp.o | grep STL

If it prints lots of lines of the assembly code, then the compiler ran out of GPU registers and some of the local variables had to be kept in the main memory.