Find the best way to partition the given figure in two parts: a monochromatic rectangle and a monochromatic background. The objective is to minimise the sum of squared errors.

We will implement a command line tool `pngsegment`

that you can use to do image segmentation in PNG files.

The program can be used as follows:

./pngsegment INPUT OUTPUT1 OUTPUT2

where INPUT is the input file and OUTPUT1 and OUTPUT2 will show the segmentation: OUTPUT1 will draw the approximation of the image as a monochromatic rectangle and a monochromatic background, while OUTPUT2 will show the location of the rectangle overlaid with the original figure.

For example, if you have a PNG file called *example.png* in the current directory, you can do image segmentation as follows:

./pngsegment example.png result1.png result2.png

You can find an almost-working implementation in our shared code repository, in the subdirectories `is*`

. There is one directory per task; the templates are identical.

The only part that is missing is the implementation of the subroutine `segment()`

in file `is*/is.cc`

. This is the function that actually does segmentation. See `is*/is.h`

for the description of the interface.

Once you have implemented the function correctly, you should be able to run `make`

to compile your code, `make test`

to check that it works correctly (at least for some test cases) and `make benchmark`

to see how well it performs.

You need to implement the following function:

Result segment(int ny, int nx, const float* data)

Here `data`

is a colour image with `ny*nx`

pixels, and each pixel consists of three colour components, red, green, and blue. In total, there are `ny*nx*3`

floating point numbers in the array `data`

.

The colour components are numbered `0 <= c < 3`

, x coordinates are numbered `0 <= x < nx`

, y coordinates are numbered `0 <= y < ny`

, and the value of this colour component is stored in `data[c + 3 * x + 3 * nx * y]`

.

The function has to return an instance of the following structure that indicates the optimal segmentation:

struct Result { int y0; int x0; int y1; int x1; float outer[3]; float inner[3]; };

The first four fields indicate the **location** of the rectangle. The upper left corner of the rectangle is at coordinates (`x0, y0`

), and the lower right corner is at coordinates (`x1-1, y1-1`

). That is, the width of the rectangle is `x1-x0`

pixels and the height is `y1-y0`

pixels. The coordinates have to satisfy `0 <= y0 < y1 <= ny`

and `0 <= x0 < x1 <= nx`

.

The last two fields indicate the **colour** of the background and the rectangle. Field `outer`

contains the three colour components of the background and field `inner`

contains the three colour components of the rectangle.

For each pixel (`x,y`

) and colour component `c`

, we define the error `error(y,x,c)`

as follows:

- Let
`colour(y,x,c) = data[c + 3 * x + 3 * nx * y]`

. - If (
`x,y`

) is located outside the rectangle:`error(y,x,c) = outer[c] - colour(y,x,c)`

. - If (
`x,y`

) is located inside the rectangle:`error(y,x,c) = inner[c] - colour(y,x,c)`

.

The total **cost** of the segmentation is the **sum of squared errors**, that is, the sum of `error(y,x,c) * error(y,x,c)`

over all `0 <= c < 3`

and `0 <= x < nx`

and `0 <= y < ny`

.

Your task is to find a segmentation that minimises the total cost.

The input and output are given as single-precision floating point numbers (type `float`

). However, please use double-precision floating point numbers (type `double`

) for all intermediate results, and only round the final result to single precision.

Check the course home page to see which tasks you are supposed to solve each week.

Subdirectory: **is1**, file to edit: **is1/is.cc**.

Implement an algorithm for image segmentation. Use the naive algorithm that tries out all possible locations `0 <= y0 < y1 <= ny`

and `0 <= x0 < x1 <= nx`

for the rectangle and finds the best one.

However, your implementation should be reasonably efficient. In particular, make sure that for each possible location of the rectangle you only need to perform O(1) operations to evaluate how good this position is. That is, the total complexity for an *n* × *n* figure has to be O(*n*^{4}). See the hints for suggestions.

You are supposed to use both multiple threads and vector instructions to parallelise your code. You can do it, for example, as follows:

- Multiple threads with OpenMP: Different threads try out different candidate locations.
- Vector instructions: Store the colour image so that each pixel is a 4-element vector, with 3 colour components and a zero for padding. This way you can do all calculations for all colour components in parallel.

You can also try out different memory layouts and different ways to exploit vector instructions. For example, instead of storing 1 pixel per 1 vector, you can try to store 4 pixels per 3 vectors (one vector per one colour). However, please use double-precision floating point numbers in all arithmetic operations.

Subdirectory: **is2**, file to edit: **is2/is.cc**.

Design a more efficient algorithm that (at least in typical cases) does not need to try out all possible locations of the rectangle.

Subdirectory: **is3**, file to edit: **is3/is.cu**.

Use CUDA to implement an efficient algorithm that does image segmentation on GPU. You can use the naive approach from IS1, but make sure that it makes an efficient use of the GPU.

It is important that you can very efficiently calculate the cost of placing a rectangle at certain position. A little bit of math is needed to do this.

Let $P$ be the set of all $(x,y)$ coordinates. For each pixel $p \in P$, let $v_p$ be its colour in the input. That is, $v_p$ is a vector of size 3, where $v_{p,0}$ is the red component, $v_{p,1}$ is the green component, and $v_{p,2}$ is the blue component.

Let us partition $P$ in two disjoint sets, $X \cup Y = P$ and $X \cap Y = \emptyset$. In our application, $X$ is the “rectangle” and $Y$ is the “background”, but this is not yet important here.

We choose colour $a$ for set $X$ and colour $b$ for set $Y$. Again, these are vectors of size 3.

For each pixel $p \in X$ and colour component $c$ the error is $a_c - v_{p,c}$, and for each pixel $p \in Y$ and colour component $c$ the error is $b_c - v_{p,c}$. The total cost over all pixels and all colour components is therefore
$$
f(X,Y,a,b)
= \sum_c \sum_{p \in X} \left(a_c - v_{p,c}\right)^2 + \sum_c \sum_{p \in Y} \left(b_c - v_{p,c}\right)^2,
$$
and our task is to find $(X,Y,a,b)$ that **minimises the cost** $f(X,Y,a,b)$. Equally well we can **maximise the utility**
$$
g(X,Y,a,b) = - f(X,Y,a,b) + \sum_c \sum_{p \in P} v_{p,c}^2
$$

For any $Z \subseteq P$, let us introduce the shorthand notation $$ v_{Z,c} = \sum_{p \in Z} v_{p,c}. $$ We can rewrite the cost as follows: $$ \begin{split} f(X,Y,a,b) &= \sum_c \left( \sum_{p \in X} a_c^2 - \sum_{p \in X} 2a_c v_{p,c} + \sum_{p \in X} v_{p,c}^2 + \sum_{p \in Y} b_c^2 - \sum_{p \in Y} 2b_c v_{p,c} + \sum_{p \in Y} v_{p,c}^2 \right) \\ &= \sum_c \left( |X| \cdot a_c^2 + |Y| \cdot b_c^2 - 2a_c v_{X,c} - 2b_c v_{Y,c} + \sum_{p \in P} v_{p,c}^2 \right). \end{split} $$ Therefore the utility that we want to maximise is simply $$ g(X,Y,a,b) = \sum_c \left( 2a_c v_{X,c} + 2b_c v_{Y,c} - |X| \cdot a_c^2 - |Y| \cdot b_c^2 \right). $$

To maximise $g(X,Y,a,b)$, we try all possible partitions $(X,Y)$, and for each partition we find the best colours $(a^*,b^*)$ that maximise $g(X,Y,a^*,b^*)$. By differentiation, it is easy to see that the optimal colours are the arithmetic means $$ a^*_c = \frac{v_{X,c}}{|X|}, \quad b^*_c = \frac{v_{Y,c}}{|Y|}. $$ We use the shorthand $h(X,Y) = g(X,Y,a^*,b^*)$ for the utility of partition $(X,Y)$ for the best choice of colours. Note that this function takes the following very simple form: $$ h(X,Y) = \sum_c \left( \frac{v_{X,c}^2}{|X|} + \frac{v_{Y,c}^2}{|Y|} \right). $$

Therefore to find an optimal segmentation, it is sufficient to check all possible rectangles $X$ and calculate $h(X,Y)$. To do that, we only need to know $v_{X,c}$ and $v_{Y,c}$. Naturally, $v_{Y,c} = v_{P,c} - v_{X,c}$. Hence to calculate $h(X,Y)$ efficiently, we only need to know how to calculate $v_{X,c}$ efficiently, for any rectangle $X$.

Note that $v_{X,c}$ is simply the sum of colour component $c$ for all pixels in rectangle $X$. With a little bit of preprocessing, we can calculate this sum in constant time.

Let us write $s(x_0,y_0,x_1,y_1,c) = v_{R,c}$, where $R = [x_0,x_1-1] \times [y_0,y_1-1]$ is the rectangle with the corners $(x_0,y_0)$ and $(x_1,y_1)$. Then by the inclusion–exclusion principle (see the illustration below), $$ \begin{split} s(x_0,y_0,x_1,y_1,c) {}={} &s(0,0,x_1,y_1,c) \\ {}-{} &s(0,0,x_0,y_1,c) \\ {}-{} &s(0,0,x_1,y_0,c) \\ {}+{} &s(0,0,x_0,y_0,c). \end{split} $$ So we only need to precompute $s(0,0,x,y,c)$ for all $x$, $y$, and $c$. With this precomputed information, we can then calculate $h(X,Y)$ for any rectangle $X$ in constant time. Vector types will help us to do all work conveniently for all colour components in parallel.

For further performance improvements, it may be helpful to organise the loops so that the outer loop tries out different sizes of the rectangle, and the inner loop tries all possible positions of the rectangle. Then $1/|X|$ and $1/|Y|$ remain constant in the inner loop.

If you are comfortable with studying the assembly code produced by the compiler, these observations related to branch predication may also be helpful.

Here is one approach: First use a coarse grid, with e.g. 10 × 10 pixels per grid cell. Then try all possible ways to place the rectangle in this coarse grid. Each coarse location represents a set of approx. 10000 fine-grained locations (so it is much faster to try out all possible coarse locations). For each coarse location, calculate the following two estimates (efficiently, in constant time):

**An upper bound:**at most how much is the cost if I place the rectangle in some of the fine-grained locations that are represented by this coarse-grained location?**A lower bound:**at least how much is the cost if I place the rectangle in some of the fine-grained locations that are represented by this coarse-grained location?

After these calculations, you can hopefully rule out a majority of coarse-grained locations: you know that there are other locations where the cost is **at most** some value *s*, so you can then ignore all coarse-grained locations where the cost is **larger than** *s*.

Finally, zoom in to those coarse-grained locations that you have not yet ruled out.

You can also apply the grid idea recursively.