Antony,
Implementation is not tuned.
For example, take a look here https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr... and here https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr... . Memory allocations occur at those points, however they could be easily avoided. Here's a typical use case of `size_bins` function: https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/str....
Using a stack based buffer would significantly improve sort times on small datasets.
You have a good point that a stack-based buffer would eliminate memory allocation for the bin_sizes (thanks!). On large data sets, there are very few memory allocations, as it quickly hits the maximum size that can be used. My optimization has concentrated on minimizing memory overhead and optimizing speed on large data sets. Would you agree that a fixed stack-based array of bin_sizes, passed down into the recursive calls, would provide a good trade-off between speed and memory overhead? It will still need memory allocation for the bin_cache, unless it computes the worst-case size based on the size of the input data type and the maximum number of bins per iteration. That could end up being a substantial chunk of RAM to put on the stack (10s of kilobytes!) for the non-functor versions of integer_sort and float_sort. It would be tricky to add for the functor versions. string_sort can iterate to an incredible depth in the worst-case, so it just doesn't seem practical to put the bin_cache on the stack there. Do you think I should add the bin_cache to the stack where feasible for integer_sort and float_sort? It may add significant complication.
Not in its current state. And not as a separate library. This must be a part of Boost.Algorithm library.
That's how I originally planned to add it, but apparently that doesn't work so well with modular boost. I'm open to that possibility.
This library must be accepted only after some good optimization. As a result, I'd like to see this library outperforming std::sort even on small datasets (~150 elements). At the current point std::sort and sort from the subject library differ on some constant value (see graph/windows_integer_sort.htm or graph/windows_string_sort.htm), not O(n*log(n)) vs O(n*log(k/s+s)).
n * (k/s + s) is the worst-case, rarely encountered for integer_sort for normal data. What ends up happening on most data is that it tends to run around log(n)/s expensive radix-based based iterations, plus around s fast comparison-based iterations. So It works out to comparing log(n) vs C1*log(n)/s + s. This ends up looking like a constant, though it does get a bit faster as the "+s" term get proportionately smaller with large n (and the asymptote in this regime is a constant). Eventually n gets close to k and the worst-case guarantee kicks in. See the left side of graph/osx_integer_sort.htm for examples of this faster bucket sorting. Note that in the speed vs size graphs each increment in the normal number of radix-based iterations causes a jump in relative performance: this can be seen in windows_integer_sort between 2^23 and 2^24, and 2^13 to 2^14. string_sort shows similar changes in relative performance vs size as the number of radix-sorted bytes increases, though I should note that for string_sort, the worst-case is much superior to comparison-based sorting, as each comparison may run over the entire length of the string, where radix-based sorting will only pass over each character once (or twice, if you include the scan for all characters being identical). I didn't use this worst-case for comparison-based sorting when making the graphs, but I could if you'd like to see spectacular speedups.