OTOH the single-threaded ones are interesting, especially the stable_sort and intro_sort for cases where usage of spreadsort is not possible.
I challenge you to find a sorting problem where it isn't possible to use spreadsort. string_sort can sort anything that std::sort can; you just need to convert the sort key comparison sequence into a sequence of bytes and feed string_sort the correct functors. That said, it may not be efficient to sort some problems with string_sort, and it'll probably be too much work to be worth the effort for many problems.
This is what I meant. Sorry for the lack of precision. I'll add it to my performance tests, it might be an interesting variant as the algorithm is really fast.
A short summary:
100000000 uint64_t elements already sorted: Boost parallel sort : 0.064965 secs
Asynchronous parallel sort : 0.0167134 secs (same with other algorithms)
Asynchronous provides a special optimization for this case.
If this optimization is practical to incorporate in Boost parallel sort, it is a common case and your solution seems noticeably faster, so I would like Francisco to consider it. That said, .06s isn't bad.
Sure.
I added this one: 100000000 uint64_t elements reverse sorted
Boost parallel sort : 0.581381 secs
Asynchronous parallel sort : 0.0478701 secs
Asynchronous provides a special optimization for this case. I this the library should do it too. This one is pretty common and a popular DOS attack.
This optimization would be nice to have, if it's cheap.
A parallel reverse is in O(n) so it's cheap compared to a sort.
100000000 uint64_t elements randomly filled
Asynchronous parallel quickspreadsort: 0.587432 secs Asynchronous quick_intro_sort : 0.728393 secs
This mixing of quicksort degrading into a spreadsort works here best.
I find this algorithm interesting;
Frankly, there is nothing special. It's simply a parallel partition, which degrades into a "normal" parallel merge sort (in this case spreadsort) after a while.
have you considered using MSD radix-based sorting to break up the data for the threads in parallel? For example, take T threads, have each thread take 1/T of the data and bucket it into N MSD-based buckets (just like Spreadsort), where T < N <= ~2^11, then merge the equivalent buckets, and as you merge them hand them off to another thread to sort. This will probably divide better if you find the max and min before bucketing using a parallel min/max search.
I admit I didn't. But it sounds really interesting and I think I'll have a go at it. I'm not a sort expert though. I approached this from the parallel perspective.
10000000 strings randomly filled
OMP parallel sort : 1.05803 secs Boost parallel sort : 0.933055 secs
OMP parallel stable sort : 1.12269 secs Boost sample sort : 0.889216 secs Boost parallel stable sort : 1.56564 secs
Asynchronous parallel quickspreadsort: 0.788856 secs Asynchronous quick_intro_sort : 0.893652 secs
Asynchronous parallel stable sort : 1.23495 secs Asynchronous boost::stable sort : 1.21817 secs
Similar results.
Let's move on to big objects:
1562500 elements of size 512 randomly filled H E A V Y C O M P A R I S O N
OMP parallel sort : 0.308923 secs Boost parallel sort : 0.342992 secs
Asynchronous parallel_quicksort : 0.246709 secs Asynchronous quick_intro_sort : 0.269666 secs
Both quicksorts are best, with a slight advantage to intro_sort.
The light comparison is similar.
Do we have your permission to copy and/or adapt your quick_intro_sort algorithm into the Boost.Sort library?
Please feel free to copy whatever might be useful. The library has a Boost license as it's intended for inclusion in Boost.I also took the freedom of adding the single-thread versions of Fernando's algorithms as basis for my parallel versions. The algorithm is quite simple. It's the same as quick_spreadsort, a quicksort which degrades into Francisco's intro_sort after some iterations. I just found it interesting to add a wrapper to Francisco's algorithm for the mini-review. It was just a quick test (no time for cleverer algorithms) but I'm pretty satisfied with the result and I'll keep these versions.
How is the worst-case performance relative to Francisco's library?
The sort versions have O(n log n) like any merge sort. The quicksort have O(n log n) to O(n^2) as expected from a quicksort. Cheers, Christophe