Blue = compare, red = swap
In our lecture course Programming Parallel Computers, one of the exercises was to implement an efficient parallel version of quicksort. Here is a simple approach that works fairly well in our test environment:
Our students quickly learned that it is very important to choose the pivots carefully. The total running time will depend on the size of the largest part, and sloppiness in the choice of pivot gets easily amplified in recursive partitioning. Randomly chosen pivots do not work at all — in step 1, each pivot has to be a good estimate of the median so that workload in step 2 is well balanced.
However, there was another aspect that also mattered a lot for some input distributions: precisely how do we partition the array. This page tries to explain why this is the case.
The key primitive that most students used is partitioning the array in 2 parts. There are two simple and natural strategies:
Either of these algorithms seems equally good. Most of the time is spent in step 2 (sorting), so the details of how to implement step 1 (partitioning) do not seem that important, as long as it works correctly and the pivots are chosen carefully.
However, some of our students observed that if we use algorithm B in step 1, then step 2 may become much slower for some benign input distributions:
With algorithm B, the overall performance for decreasing inputs was roughly as bad as for random inputs, and this holds even if step 2 is implemented using std::sort from the standard library. The standard library implementation of std::sort is very efficient for both increasing and decreasing inputs — therefore it seemed strange that this no longer holds if we do partitioning with algorithm B.
The animation above explains the performance difference. It shows what happens if we follow these strategies and choose pivots fairly well but not perfectly — here we use roughly a 51–49 split in each step. If we use algorithm A, minor sloppiness in the choice of the pivots does not hurt too much: monotone input implies that each part will be almost monotone after recursive partitioning. For algorithm B this does not hold: some parts will be highly non-monotone and therefore some of the parallel invocations of the sequential sorting algorithms in step 2 will take a long time.