CS-E4580: Debugging your code

Debug builds Correctness Debugging tools Performance Performance tools

Debug builds

The makefiles directly support the following options that may help with debugging. Note that the grading scripts also make use of some of these.

Debugging CPU code

You can disable optimisations by building with DEBUG=1, for example:

cd mf1
make clean
make DEBUG=1
make test

You can also enable AddressSanitizer, which helps to catch many memory access errors; see below for more information. To do that, build with DEBUG=2:

cd mf1
make clean
make DEBUG=2
make test

You can also enable the C++ standard library debug mode, which helps to catch e.g. out-of-bounds accesses with standard containers:

cd mf1
make clean
make DEBUG=3
make test

You can disable modern CPU instructions by building with ARCH=1:

cd mf1
make clean
make ARCH=1
make test

Debugging GPU code

In CUDA tasks, you can also use make DEBUG=1 for debug builds. This will compile with nvcc -g -G so that you can easily debug your code with cuda-gdb.

Other debugging options are not supported for CUDA code.

Remember to clean up!

After experimenting with the debug builds, please always remember to run make clean afterwards.

Correctness issues

In general, always try to isolate the problem first! Figure out what is the smallest, simplest code that still does something unexpected. Remember that you are not limited to use our makefiles and test scripts, but you can always develop e.g. your own unit tests.

Also make sure that you are using Maari-A computers. You do not need to be physically there, you can use ssh.

Strange bugs, segmentation faults, etc.?

First try AddressSanitizer — see below for more information.

Still unexplained segmentation faults?

It might be a stack overflow. Unfortunately, a stack overflow is typically reported as a segmentation fault. In the classroom computers, the stack size limit is approx. 8MB. Do not allocate large arrays on the stack. If you need to allocate storage for megabytes of data, use the heap.

Different behaviour with vs. without make?

It might be a stack overflow (see above). Gnu Make accidentally changes the stack size limits so you might see a stack overflow (segmentation fault) when you run your code directly, yet it might work fine if you run it with make.

Random results? Strange results?

You might be reading wrong parts of the memory. Try AddressSanitizer — see below for more information.

You might be reading memory that is not initialised. Try malloc debugging options — see below for more information.

My CUDA code does not seem to work at all?

Check for errors. Wrap all CUDA API calls in error-checking macros, and also check for errors after each kernel launch. For example, you can define a macro like this:

#define CHECK_CUDA_ERROR(call) do { \
    cudaError_t result_ = (call); \
    if (result_ != cudaSuccess) { \
        fprintf(stderr, #call " failed: %s\n", \
                cudaGetErrorString(result_)); \
        exit(1); \
    } \
} while(0)

And use it like this:

CHECK_CUDA_ERROR(cudaMalloc((void**)&x, n));
CHECK_CUDA_ERROR(cudaMalloc((void**)&y, n));
kernel<<<dimGrid, dimBlock>>>(params);

Debugging tools

Using AddressSanitizer — and understanding its output

To enable AddressSanitizer, run make clean and make DEBUG=2.

Here is a quick tutorial that hopefully helps. I am using a buggy implementation of MF1 as an example. I have included only relevant parts of the messages:

$ make DEBUG=2 test
... ERROR: AddressSanitizer: stack-buffer-overflow on ...
READ of size 4 at ...
    #0 0x4041c4 in ... mf.cc:32
    #5 0x401bd8 in ... mf-test ...
Address ... is located ... at offset 36 ...
  This frame has 2 object(s):
    [32, 36) 'in'

So program mf-test crashed, because we were trying to read at an address that is outside the bounds of any array. Furthermore, it happened to be right after the end of array in, so probably we were simply trying to read beyond the end of array. This happened in the code at line 32 of mf.cc, which in my case contained:

float v = in[i + j * nx];

Now it is easy to figure out precisely what went wrong; it turns out that the index i was one too large in certain cases.

Debugging bugs related to uninitialised memory

In C and C++, memory allocation functions typically do not guarantee that memory is initialised with zeros. However, it is easy to forget to initialise newly allocated memory, and in many cases your program may accidentally work correctly as newly allocated memory often happens to contain all zeros. To better detect bugs related to the use of uninitialised memory accesses on Linux, try to run tests with the following command:

make testdebug

This will set the environment variable MALLOC_PERTURB_ to value 191; this is convenient, as it will fill newly allocated memory with the value 64. If you interpret such values as doubles or floats, you will get reasonable values (other choices might accidentally hide some errors if you get e.g. very small values).

GDB quick start

We will use MF1 here as an example. One part of its test suite is a small program called mf-test which will call mf function with some small inputs. Let us compile it with debug options and open it in the debugger, using GDB with its text user interface:

$ make clean
$ make DEBUG=1
$ gdb -tui mf-test

Then use command b to set a breakpoint in function mf and start the program with command r; it will execute until it reaches function mf:

(gdb) b mf
Breakpoint 1 at ...: file mf.cc, line ...
(gdb) r
Starting program: .../mf-test
Breakpoint 1 ...

You should have a split-screen text user interface, with the upper half of the screen showing your source code and the lower half showing the GDB console. You can scroll the source code with arrow keys; highlighting shows which line will be executed next. You can now e.g. execute code line by line with command n:

(gdb) n
(gdb) n

At any point you can also show the contents of any variable with command p:

(gdb) p ny
... = 1
(gdb) p in[0]
... = 0

We can continue execution until we reach the breakpoint again with c:

(gdb) c

Breakpoint 1 ...

Other highly useful commands include bt for showing the stack backtrace (who called us), and q for quitting GDB. See the manual for more information; the split-screen text user interface is documented in section TUI.

Performance issues

Please always read the task-specific hints first!

Always try to understand which part of the code is the bottleneck, and why. See below for some tools and techniques that can help you to answer these questions.

Poor performance, strange benchmarks?

Make sure there is no other load on the machine that you use for benchmarking. Try uptime and top to see what is the current load and who is running what there. To find a computer with a low load, see ppc-helpers.

Did you remember to run make clean after experimenting with debug builds?

Better code seems to run slower?

Read the assembly code of the relevant part, see below for mode details.

Here is one example of a seemingly counterintuitive issue you may encounter: branch predication vs. prediction.

Performance analysis tools

perf quick start

Here we use a CP4 implementation as an example. To get a number of useful statistics related to the performance of your implementation, try to run e.g. this command:

perf stat -d ./cp-benchmark 4000 4000 10

Or, for some further details, try this:

perf stat -e task-clock,cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-misses,LLC-loads,cache-references,cache-misses ./cp-benchmark 4000 4000 10

See the perf manual for more information on the usage. To get more meaningful results, it is often a good idea to switch of hyperthreading. Then, for example, the number of instructions per cycle per thread is much easier to interpret.

In the perf output, these numbers are often very helpful:

perf record & report

To identify the most critical parts of your code, you can also try e.g.:

perf record ./cp-benchmark 4000 4000 10

This will create a data file perf.data that you can then study with a simple text-based user interface:

perf report

Select the relevant function, select “Annotate”, and it should take you directly to the most performance-critical part in the assembly code. (You can ignore warnings related to the lack of access to kernel address maps.)

Switch off hyperthreading

You can disable hyperthreading and run your code with 4 threads, one thread per CPU, e.g. as follows:

OMP_PROC_BIND=true OMP_NUM_THREADS=4 ./cp-benchmark 4000 4000 10

Naturally you can also combine this with perf:

OMP_PROC_BIND=true OMP_NUM_THREADS=4 perf stat -e task-clock,cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-misses,LLC-loads,cache-references,cache-misses ./cp-benchmark 4000 4000 10

Reading assembly code

The makefiles provide two ways of outputting the assembly code produced by the compiler. For example, if you want to see the compiled version of mf.cc, try the following commands:

make mf.asm1
make mf.asm2

Then open the file mf.asm1 or mf.asm2 in your text editor. Both of these try to produce somewhat readable assembly code, but it may depend on your luck which of these is more readable in your case.

Usually the assembly code is very long, and the most challenging part is finding the relevant part of it quickly. Here is one trick that you can use:

Another trick is to simply search for the relevant instruction. For example, you can search for vmulps to find all places in which you are multiplying float8_t vectors.

Once you have identified what instructions are executed in the performance-critical part of your code, refer to the Instruction tables to see what is the throughput and latency of those instructions. Recall that “Ivy Bridge” is the relevant CPU architecture in our case.


Here is a simple example that shows how you can measure how long each part of the code takes:

#include <sys/time.h>
#include <iostream>

static double get_time() {
    struct timeval tm;
    gettimeofday(&tm, NULL);
    return static_cast<double>(tm.tv_sec) + static_cast<double>(tm.tv_usec) / 1E6;

static double calculate1() {...}
static double calculate2() {...}

int main() {
    double t0 = get_time();
    double v1 = calculate1();
    double t1 = get_time();
    std::cout << "calculate1() took " << t1 - t0 << " seconds" << std::endl;
    double v2 = calculate2();
    double t2 = get_time();
    std::cout << "calculate2() took " << t2 - t1 << " seconds" << std::endl;

    // And we are also printing out the values that we calculated to make
    // sure the compiler can't simply optimise the relevant part away
    std::cout << v1 << " " << v2 << std::endl;

Please remember to remove (or comment out) all debugging printout and benchmarking code before you submit your final version for grading!

Profiling CUDA code

You can use nvprof to quickly check how long each CUDA kernel takes. For example, try this with your CP9 solution:

nvprof ./cp-benchmark 4000 4000 10

It will run the program as usual, but also print a lot of additional information; the first part of the output is relevant here:

==27273== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 94.37%  4.38112s        10  438.11ms  436.83ms  439.17ms  dotprod(int, int, float const *, float*)
  2.12%  98.621ms        10  9.8621ms  9.8375ms  9.9301ms  [CUDA memcpy HtoD]
  2.10%  97.332ms        10  9.7332ms  9.7240ms  9.7427ms  [CUDA memcpy DtoH]
  1.41%  65.331ms        10  6.5331ms  6.4901ms  6.5679ms  normalise(int, int, float*)

==27273== API calls:

In this implementation, the key operations are:

We can compare these numbers with the total running time of 465ms.