Implement a parallel sorting algorithm that outperforms single-threaded `std::sort`

.

We will implement a benchmark tool `so-test`

that performs the following steps:

- Generates test data.
- Sorts it with your sorting algorithm, and prints the running time.
- Sorts it with
`std::sort`

, and prints the running time. - Verifies that the results are identical.

The benchmark tool can be used as follows:

./so-test N

where N is the number of elements in the test data. The tool will run 5 tests, all with the same number of elements but with different input data distributions: 0 = large random elements, 1 = small random elements, 2 = constant input, 3 = increasing values, and 4 = decreasing values.

If you run, for example

./so-test 100000000

you will get an output that looks similar to the following:

so 0 100000000 2.135 7.957 so 1 100000000 1.574 2.325 so 2 100000000 1.357 1.735 so 3 100000000 1.079 1.713 so 4 100000000 1.136 1.233

Here column 2 identifies the input data distribution; the first row is the most interesting case, with large random elements. Column 4 is the running time of your implementation and column 5 is the running time of `std::sort`

.

You can find a working implementation in our shared code repository, in the subdirectories `so*`

. There is one directory per task.

You only need to implement the subroutine `psort()`

in file `so*/so.cc`

. See `so*/so.h`

for the description of the interface.

Once you have implemented the function correctly, you should be able to run `make`

to compile your code, `make test`

to check that it works correctly (at least for some test cases) and `make benchmark`

to see how well it performs.

You need to implement the following function:

void psort(int n, data_t* data)

Here `n`

is the size of the input array `data`

. All input elements are of type `data_t`

, i.e., `unsigned long long`

, i.e., 64-bit unsigned integers.

Your function should have the same behaviour as:

std::sort(data, data+n);

That is, you need to sort this in place: modify the input array so that it will be in a sorted order. You are free to allocate some additional memory to keep temporary data if needed.

In your implementation, you are permitted and encouraged to use **all single-threaded features of the C++ standard library**. In particular, you can freely use the single-threaded `std::sort`

as a subroutine in your code, as well as all other features of the algorithms library.

For multi-threading, you can use the basic primitives of OpenMP, as usual.

Check the course home page to see which tasks you are supposed to solve each week.

Subdirectory: **so1**, file to edit: **so1/so.cc**.

Implement an efficient parallel sorting algorithm, using the basic idea of **merge sort**.

If you run `make benchmark2`

and you get running times in the following ballpark, you can be happy:

so 0 100000000 2.135 7.957 so 1 100000000 1.574 2.325 so 2 100000000 1.357 1.735 so 3 100000000 1.079 1.713 so 4 100000000 1.136 1.233

Subdirectory: **so2**, file to edit: **so2/so.cc**.

Implement an efficient parallel sorting algorithm, using the basic idea of **quicksort**.

If you run `make benchmark2`

and you get running times in the following ballpark, you can be happy:

so 0 100000000 2.161 7.973 so 1 100000000 0.772 2.313 so 2 100000000 0.118 1.805 so 3 100000000 0.686 1.750 so 4 100000000 1.483 1.273

Subdirectory: **so3**, file to edit: **so3/so.cu**.

Use CUDA to implement an efficient algorithm for sorting on GPU, using the basic idea of **merge sort**.

Subdirectory: **so4**, file to edit: **so4/so.cu**.

Use CUDA to implement an efficient algorithm for sorting on GPU, using the basic idea of **quicksort**.

If you have *p* threads available, you can split your input data in *p* blocks. Sort each block with e.g. `std::sort`

, simultaneously in parallel. Then merge the results. For example, first merge *p/2* pairs of blocks in parallel so that you are left with *p/2* sorted blocks, and repeat this until you are left with 1 sorted block.

Helpful OpenMP features:

`#pragma omp parallel num_threads(p)`

`omp_get_max_threads()`

`omp_get_thread_num()`

Write a fairly normal recursive quicksort implementation. Then parallelise it using the following ideas:

- In the recursion, process both halves simultaneously in parallel.
- Keep track of the number of available threads.
- When you are left with only 1 thread or few elements, resort to
`std::sort`

.

However, please note that the way in which you partition the array may matter a lot!

Helpful OpenMP features:

`#pragma omp single`

`#pragma omp task`