For each colour component, the value of each pixel is the median of all pixel values within a sliding window of dimensions (2*k*+1) × (2*k*+1).

We will implement a command line tool `pngmf`

that you can use to do median filtering in PNG files.

The program can be used as follows:

./pngmf WINDOW INPUT OUTPUT1 OUTPUT2

where WINDOW is the parameter *k* that determines the size of the sliding window, INPUT is the input file, OUTPUT1 will be the result of median filtering, and OUTPUT2 will be a file that shows the differences between INPUT and OUTPUT1.

For example, if you have a PNG file called *example.png* in the current directory, you can do median filtering with parameter *k* = 10 as follows:

./pngmf 10 example.png result1.png result2.png

You can find an almost-working implementation in our shared code repository, in the subdirectories `mf*`

. There is one directory per task; the templates are almost identical, with minor changes in e.g. benchmarks.

The only part that is missing is the implementation of the subroutine `mf()`

in file `mf*/mf.cc`

. This is the function that actually does median filtering. See `mf*/mf.h`

for the description of the interface.

Once you have implemented the function correctly, you should be able to run `make`

to compile your code, `make test`

to check that it works correctly (at least for some test cases) and `make benchmark`

to see how well it performs.

You need to implement the following function:

void mf(int ny, int nx, int hy, int hx, const float* in, float* out)

Here `in`

and `out`

are pointers to `float`

arrays, each of them contains `ny*nx`

elements (these will be floating point numbers between 0.0 and 1.0). The arrays are already allocated by whoever calls this function; you do not need to do any memory management.

The original value of pixel (x,y) is given in `in[x + nx*y]`

and its new value will be stored in `out[x + nx*y]`

.

In the output, pixel (x,y) will contain the median of all pixels with coordinates (i,j) where

`0 <= i < nx`

,`0 <= j < ny`

,`x - hx <= i <= x + hx`

,`y - hy <= j <= y + hy`

.

This is the *sliding window*. Note that the window will contain at most `(2*hx+1) * (2*hy+1)`

pixels, but near the boundaries there will be fewer pixels.

For our purposes, the median of the vector **a** = (*a*_{1}, *a*_{2}, …, *a*_{n}) is defined as follows:

- Let
*x*_{1},*x*_{2}, …,*x*_{n}be the values of**a**sorted in a non-decreasing order. - If
*n*= 2*k*+ 1, then the median of**a**is*x*_{k+1}. - If
*n*= 2*k*, then the median of**a**is (*x*_{k}+*x*_{k+1})/2.

Check the course home page to see which tasks you are supposed to solve each week.

Subdirectory: **mf1**, file to edit: **mf1/mf.cc**.

Start with the template solution and complete it so that it works correctly. Your implementation has to be correct, reasonably efficient, and written in C or C++.

You can use a naive algorithm that computes the median separately for each pixel, with a linear-time algorithm. The time complexity will be *O*(*nk*^{2}) for *n* pixels and for a window of size *k* × *k*.

A single-threaded solution is sufficient. Study the performance of your implementation for different input sizes and different window sizes.

If you run `make benchmark`

and you get running times in the following ballpark, you can be happy:

mf 1000 1000 1 0.115 mf 1000 1000 2 0.322 mf 1000 1000 5 1.222 mf 1000 1000 10 3.995

Subdirectory: **mf2**, file to edit: **mf2/mf.cc**.

Take your solution to task MF1 and parallelise it with OpenMP. Compare the performance with your solution MF1, for different numbers of threads.

Subdirectory: **mf3**, file to edit: **mf3/mf.cc**.

Make your median filter implementation as efficient as possible, also for large window sizes. Here is an example of what can be achieved with the classroom computers:

$ make benchmark2 ... mf 4000 4000 10 0.873 mf 4000 4000 20 1.396 mf 4000 4000 50 2.680 mf 4000 4000 100 4.747

Your code has to run on the CPU (i.e., do not use CUDA), but otherwise you are encouraged to use all techniques that you can imagine. See below for hints.

Subdirectory: **mf4**, file to edit: **mf4/mf.cu**.

Use CUDA to implement a median filter for GPU.

The interface is identical to what we have had in the previous tasks. The only difference is that you will write your code in `mf.cu`

instead of `mf.cc`

. This file will be compiled and linked with NVCC. Thus you can use both normal C++ code and CUDA constructions in this file.

The command `make test`

first runs the test program `mf-test`

. If there are errors, the test program will also print the line number of the test that failed. You can then easily find the relevant part in `mf-test.cc`

, and see which of the tests failed.

If the performance is poor, make sure that you use a fast selection algorithm to find the median. Do not sort; quickselect and other good selection algorithms are much faster than sorting, and there is already a good implementation in the C++ standard library, so you do not need to reinvent it.

Here are some examples of possible techniques that you can use to solve task MF3:

- Partition the image in smaller, partially overlapping
**blocks**. For example, if your windows size is 21 × 21, a reasonable block size might be 50 × 50. Solve the median filter problem separately for each block; place the blocks so that each output pixel comes from exactly one block. Be careful with the boundaries. - Within each block, sort the input data and replace the original values by their ordinals, breaking ties arbitrarily — for example, if the input data was (1.0, 1.0, 0.1, 0.9, 0.5, 0.2, 1.0), replace it with (4, 5, 0, 3, 2, 1, 6). Now you have a much simpler task: instead of finding a median of floating point numbers, you only need to find the median of small,
**distinct**integers. - You can now represent the sliding window as a
**bit vector**: for example, if your block size is 50 × 50, then your sliding window can be represented as a bit vector with 2500 bits. Do not recompute the bit vector from scratch for every window position — if you move the window from (x,y) to (x+1,y), you only need to reset 2*k*+1 bits and set 2*k*+1 bits. - Exploit
**bit-parallelism**and the special CPU instructions that help with bit manipulation. To find the median from the bit vector efficiently, you may want to have a look at e.g. compiler intrinsics __builtin_popcountll (POPCNT instruction), __builtin_ctzll (TZCNT instruction), and _pdep_u64 (PDEP instruction). Some caching of pre-computed values may be helpful, too. - Pay attention to the
**locality**of memory references. For example, does it make more sense to slide the window from left to right or from top to bottom? - You can use
**OpenMP**to process multiple blocks in parallel.