Blue = compare, red = swap

The animation is available both as a video clip and as an interactive application.

In our lecture course Programming Parallel Computers, one of the exercises was to implement an efficient parallel version of quicksort. Here is a simple approach that works fairly well in our test environment:

- Step 1: recursively partition the array in 8 parts (using up to 4 threads).
- Step 2: sort each part in parallel with a sequential quicksort implementation (using 8 threads).

Our students quickly learned that it is very important to **choose the pivots carefully**. The total running time will depend on the size of the *largest* part, and sloppiness in the choice of pivot gets easily amplified in recursive partitioning. Randomly chosen pivots do not work at all — in step 1, each pivot has to be a good estimate of the median so that workload in step 2 is well balanced.

However, there was another aspect that also mattered a lot for some input distributions: **precisely how do we partition the array**. This page tries to explain why this is the case.

The key primitive that most students used is **partitioning the array in 2 parts**. There are two simple and natural strategies:

**Algorithm A**— maintain the invariant that the range [0,*a*) is smaller than pivot and the range [*b*,*n*) is larger than pivot:- Increase
*a*if possible. - Otherwise decrease
*b*if possible. - Otherwise swap elements
*a*and*b*−1. - Repeat until done.

- Increase
**Algorithm B**— maintain the invariant that the range [0,*a*) is smaller than pivot and the range [*a*,*b*) is larger than pivot:- If needed, swap elements
*a*and*b*and increase*a*. - Increase
*b*. - Repeat until done.

- If needed, swap elements

Either of these algorithms seems equally good. Most of the time is spent in step 2 (sorting), so the details of how to implement step 1 (partitioning) do not seem that important, as long as it works correctly and the pivots are chosen carefully.

However, some of our students observed that if we use algorithm B in step 1, then step 2 may become much slower for some benign input distributions:

- If we use algorithm A to do partitioning, then sorting is fast for both increasing and decreasing inputs.
- If we use algorithm B to do partitioning, then sorting is
**fast for increasing inputs**but**slow for decreasing inputs**.

With algorithm B, the overall performance for decreasing inputs was roughly as bad as for random inputs, and this holds even if step 2 is implemented using std::sort from the standard library. The standard library implementation of std::sort is very efficient for both increasing and decreasing inputs — therefore it seemed strange that this no longer holds if we do partitioning with algorithm B.

The animation above explains the performance difference. It shows what happens if we follow these strategies and choose pivots fairly well but not perfectly — here we use roughly a **51–49 split** in each step. If we use algorithm A, minor sloppiness in the choice of the pivots does not hurt too much: monotone input implies that each part will be almost monotone after recursive partitioning. For algorithm B this does not hold: some parts will be highly non-monotone and therefore some of the parallel invocations of the sequential sorting algorithms in step 2 will take a long time.

Animations created by Jukka Suomela, using SDL, Emscripten, FFmpeg, and Bitstream Vera Fonts.