In this task, we are given m input vectors, each with n numbers. Our task is to calculate the correlations between all pairs of input vectors.
Input: vector i = row i of the image.
Output: red pixel at (i, j) = positive correlation between rows i and j, blue pixel = negative correlation.
We will implement a command line tool
pngcorrelate that you can use to find pairwise correlations between pixel rows in PNG images. It outputs the result as another PNG image, similar to what you see in the above examples. The program accepts images with colours, but automatically converts them to greyscale before analysing correlations.
The program can be used as follows:
./pngcorrelate INPUT OUTPUT1 OUTPUT2
where INPUT is the input file, OUTPUT1 will be a greyscale version of the input file, and OUTPUT2 will show the pairwise correlations.
For example, if you have a PNG file called example.png in the current directory, you can find the pairwise correlations as follows:
./pngcorrelate example.png result1.png result2.png
You can find an almost-working implementation in our shared code repository, in the subdirectories
cp*. There is one directory per task; the templates are almost identical, with minor changes in e.g. benchmarks.
The only part that is missing is the implementation of the subroutine
correlate() in file
cp*/cp.cc. This is the function that does the actual calculations. See
cp*/cp.h for the description of the interface.
Once you have implemented the function correctly, you should be able to run
make to compile your code,
make test to check that it works correctly (at least for some test cases) and
make benchmark to see how well it performs.
You need to implement the following function:
void correlate(int ny, int nx, const float* data, float* result)
data is a pointer to the input matrix, with
ny rows and
nx columns. For all
0 <= y < ny and
0 <= x < nx, the element at row
y and column
x is stored in
data[x + y*nx].
The function has to solve the following task: for all
0 <= j <= i < ny, calculate the correlation coefficient between row
i of the input matrix and row
j of the input matrix, and store the result in
result[i + j*ny].
Note that the correlations are symmetric, so we will only compute the upper triangle of the result matrix. You can leave the lower triangle
i < j undefined.
result are already allocated by whoever calls this function; you do not need to do any memory management.
The input and output are given as single-precision floating point numbers (type
float). However, if you calculate the correlations in a naive manner using single-precision floating point numbers, it will easily produce very large rounding errors.
The test suite allows for an absolute error of at most 0.00001 (for CP9, at most 0.0001), and it tries out inputs with up to 10000 rows or columns. Your code has to pass all tests.
In tasks CP1–CP3 and CP8 you must use double-precision floating point numbers (type
double) for all intermediate results, and only round the final result to single precision. This should make it easy to avoid any issues with numerical stability.
In tasks CP4 and CP9, however, you are encouraged to take advantage of single-precision floating point numbers to speed up calculations. Be careful.
Please do not try to use
std::valarray in your code.
Subdirectory: cp1, file to edit: cp1/cp.cc.
Start with the template solution and complete it so that it works correctly.
Do not use multiple threads or vector instructions yet. Advanced cache blocking is not needed, either; just make sure that you process data elements in a reasonable order. Use double-precision floating point numbers (type
double) for all intermediate results.
If you run
make benchmark and you get running times in the following ballpark, you can be happy:
cp 4000 100 0.777 cp 4000 1000 7.685 cp 4000 2000 15.400 cp 4000 4000 30.721
Subdirectory: cp2, file to edit: cp2/cp.cc.
Parallelise your solution to task CP1 with OpenMP.
Subdirectory: cp3, file to edit: cp3/cp.cc.
Use vector instructions to improve the performance of your solution to task CP2.
Subdirectory: cp4, file to edit: cp4/cp.cc.
Use cache blocking to improve the memory layout for your solution to task CP3. You are also encouraged to use single-precision floating point numbers to speed up calculations.
For reference, here is an example of what can be achieved with the classroom computers using double-precision floating point numbers:
cp 4000 100 0.041 cp 4000 1000 0.267 cp 4000 2000 0.546 cp 4000 4000 1.085
And here is an example of a solution that uses single-precision floating point numbers:
cp 4000 100 0.034 cp 4000 1000 0.144 cp 4000 2000 0.270 cp 4000 4000 0.541
Subdirectory: cp5 (no template provided).
How close your solution to CP4 is to the theoretical and practical capabilities of the hardware?
Compare the above results to each other. Is there still some room for improvements in your code? Is the bottleneck now in arithmetic operations, memory lookups, or elsewhere?
If you have access to a computer with a Haswell CPU, compile your solution to CP4 there. Study the assembly code (see README.md for instructions). Is the compiler generating FMA instructions? If not, can you rewrite your code and/or adjust the compiler settings so that you benefit from the FMA instructions? Measure the performance of your code and compare it with the theoretical limitations of Haswell CPUs.
Your submission has to contain at least a written report, as a PDF file with name
Subdirectory: cp6 (no template provided).
Use your correlated pairs implementation (e.g., CP4) to detect similarities and dissimilarities between various written languages.
Here “LANGUAGE” is the ISO 639-3 language code, “NGRAM” is a combination of letters that has occurred at least once in the words written in this language, and “COUNT” is the total number of such occurrences in our text collection. File “ngram.txt” contains all n-grams, i.e., combinations that consist of exactly n letters, plus some shorter letter combinations that are a word by itself. All NGRAMs consist of plain ASCII lower-case letters a, b, …, z.
These files are derived from Tatoeba, which is an open data set of sentences written in numerous languages — the original data set is freely available for download under the CC-BY license. We have chosen 44 languages, taken all sentences written in these languages, filtered the data to remove garbage, transliterated all text to plain ASCII equivalents (throwing away e.g. all diacritical marks), and calculated the n-gram counts.
In this task you will write a program that can read any of the four data sets, as well as any other data sets that are given in the same form. Name your program “ngrams” and design it so that you can run it as follows:
./ngrams N FILE
./ngrams 3 3gram.txt
Your program has to do at least the following:
You are also encouraged (but not required) to perform further analysis on the correlation matrix and, e.g., try to find a good clustering of the languages based on their similarities.
The output of your algorithm can be in any form that is convenient for human consumption. Please also submit a report that contains a couple of words of analysis regarding how meaningful the results are. For example, are the isolated languages and isolated language pairs that your program identifies meaningful from the perspective of what you know (or can find out) about these languages?
Your submission has to contain the source code of your program, plus a written report, as a PDF file with name
Subdirectory: cp7 (no template provided).
Try to use Strassen's algorithm to speed up matrix multiplications. Can you improve on your solution to CP4 this way?
Your submission has to contain the source code of your implementation, plus a written report, as a PDF file with name
Subdirectory: cp8, file to edit: cp8/cp.cu.
Using CUDA, write a working solution that solves the problem on the GPU.
The interface is identical to what we have had in the previous tasks. The only difference is that you will write your code in
cp.cu instead of
cp.cc. This file will be compiled and linked with NVCC. Thus you can use both normal C++ code and CUDA constructions in this file.
Your implementation does not need to be efficient. However, it has to work correctly and it has to do most of the computation on GPU. Use double-precision floating point numbers (type
double) for all intermediate results.
Subdirectory: cp9, file to edit: cp9/cp.cu.
Using CUDA, write an efficient working solution that solves the problem on the GPU. You are also encouraged to use single-precision floating point numbers to speed up calculations.
For reference, here is an example of what can be achieved with the classroom computers using GPUs:
cp 4000 100 0.052 cp 4000 100 0.020 cp 4000 1000 0.125 cp 4000 1000 0.095 cp 4000 2000 0.209 cp 4000 2000 0.179 cp 4000 4000 0.383 cp 4000 4000 0.349
Note that the benchmarking program runs all tests twice. The first invocation includes one-time overhead related to CUDA API (approx. 30ms).
One reasonable way to calculate correlations is the following:
Now matrix Y contains all pairwise correlations. The only computationally-intensive part is the computation of the matrix product; the normalisations can be done in linear time in the input size.
In your code, you probably have nested
for loops such that the length of the inner loop varies. With the default schedule,
#pragma omp for may not perform that well: one thread may execute only short loops and another thread may execute only long loops. The following directive may be helpful:
#pragma omp for schedule(static, 1)
See the OpenMP specification for a detailed description.
You can use the data type
double4_t defined in
common/vector.h; this is a vector of 4 double-precision floating point numbers. To allocate memory for an array of
double4_t vectors with the correct alignment, you can use the memory allocation routine
You can first convert all of your input data (possibly after normalisation) into
double4_t vectors: for each row, store the first 4 elements in one
double4_t, the next 4 elements in another
double4_t etc., and add some zeroes for padding if the number of columns is not a multiple of 4. See this drawing for some help with a suitable memory layout.
To find pairwise correlations, you can first calculate a vector
double4_t s where
s[i] is the result restricted to columns
i+8, … Finally, calculate the sum
s + s + s + s.
Try to improve the arithmetic intensity of your algorithm: do more multiplications per memory access.
For example, instead of considering individual rows of the input matrix, you can consider blocks of 3 rows. For each pairs of such blocks, compute all 3 × 3 dot products in parallel. The number of arithmetic operations remains the same. However, the amount of data transferred from the memory is reduced: you now only need to read 6 rows of input to produce 9 elements of output, while in the naive solution you will read 2 rows of input to produce 1 element of output; this is a factor-3 improvement.
Of course if you choose blocks that are too large, you will have difficulties keeping all intermediate results in CPU registers. Experiment with different block sizes.
For numerical stability with single-precision floating point numbers, it may be a good idea to first normalise the input matrix, and after that calculate the matrix product. If you first calculate dot products and normalise afterwards, you may easily accumulate large errors.
You want to do cache blocking on two levels:
Here is one concrete approach that you can try to use (see this drawing for an illustration):
Whenever you have problems with the performance, check the following numbers:
For fine-tuning, you may also want to pay attention to the memory access patterns. If you need to copy data from the global memory to the shared memory, it is helpful to organise reading so that within a block, thread number
i is accessing memory slot number
x+i, for some value
x that only depends on the block index and not on the thread index.
It may be also useful to check that you are not running out of GPU registers. The following command may be helpful there:
cuobjdump -sass -arch sm_30 cp.o | grep STL
If it prints lots of lines of the assembly code, then the compiler ran out of GPU registers and some of the local variables had to be kept in the main memory.