Re: [boost] [review] Formal review period for Sort library continuing through November 19

17 Nov 2014

      Steven Ross wrote:
...
...
I’m not convinced that string_sort, float_sort and integer_sort need to
be separate interfaces. I believe that the function to extract “digits”
from the values for the MSD part of the algorithm can be captured in a
fully generic way (that should be applicable to all radix-like sorting
algorithms), coupled with two additional, optional parameters (pre-
transform and post-transform) specifying functions for doing the
temporary “treat floats as integers” trick.
string_sort is this fully generic approach.  integer_sort and float_sort
can be seen as specializations that allow bit-level granularity of the
radix operations
Isn’t that a matter of defining indexing adapters for those types, taking 
a template parameter for the bitfield size? I mean, I would expect any 
radix-like sorting algorithm to take a function of type key -> index -> 
digit as an optional parameter (defaulting to operator[] if not 
provided). For floats and integers that function could be templated by 
either the number of bits per digit or the number of desired buckets 
(which is equivalent under exponentiation). Actually, that could be done 
for strings and other bitstream-like types as well.
...
and work well for keys that fit into a machine word.
Could you clarify the above remark? Why does the fully generic 
string_sort not work well for keys that fit into a machine word?
...
If
you can think of some way to enable the sort-floats-as-ints trick as an
option without harming performance for non-floats, I'd be very interested.
Default to the trivial (pre|post) transformation, and specialize the 
algorithm that maps the transformations over the (input|output) range to 
be a no-op for the trivial transformation. The compiler will eliminate 
the function call.
...
...
1. Value type can be treated as a sequence of digits/characters.
Find me a type that you don't think this applies to,
For starters, any product type (tuple or struct) containing variable-
length subkeys. As an example, consider book records from a library 
database that need to be ordered first by author and then by title. If 
you insist on doing that with a (MSD) radix sort, you have two options:

 1. Use a stable radix sort and make multiple passes over the data. In 
    most cases, this would be less efficient than just doing a single pass 
    with a comparison sort, not to mention that spreadsort isn’t stable.
 2. Hack into the implementation details of the algorithm and recurse into 
    the next key to be sorted by, before the buckets of the previous key 
    are destroyed.

Case-agnostic string sorting is another example in which radix sorts 
don’t excel (even though you can make it work if you extend the interface 
considerably). This has been mentioned before.

In general, radix sorts only “shine” for types that can be injectively 
mapped to the scalar bitfields while preserving order (even if you can 
make it “work” for some other cases). Many common types and orders fit in 
that category, but there are infinitely many types and orders with 
different semantics.
...
and I'll show you how
it can.  If std::sort can sort it, so can string_sort if you use the
available functors properly.  Maybe I should explain this in the
documentation?
No, you should explicitly mention the opposite. Your algorithm has value, 
not because it can be used for all imaginable scenarios but because it 
has something to offer in particular use cases. No library needs to be 
for everyone (not even std::sort, in fact). I really do think your 
algorithm has value but you need to be very precise about what it can do 
and what it cannot. Users will only appreciate your library more if you 
do so.
...
I will agreed to the extent that if performance isn't very critical
relative to the code complexity of creating an efficient radix key
expansion equivalent to the comparison operation, then people probably are
better off with std::sort.
...
2. Many keys to sort and/or very long keys.
I agree, if most of your sorting runtime is going to spent on lists of <
1000 elements, this library is of little use, though it should also do
little harm due to the fallback to std::sort.  I'll edit the docs in
develop to make that more clear.
To be honest I am sceptical that the algorithm even adds much value under 
100K keys. So far, Frank Gennari has reported a 1.8x speedup for 100M 
keys. Now, the effect probably would have been more dramatic if the 
algorithm had been used for long strings rather than 32-bit integers, but 
that’s still at 100M keys. The difference between an O(n log n) algorithm 
and an O(nk) algorithm doesn’t grow particularly fast either (except for 
very small n). Introsort/quicksort cache locality is also about as good 
as MSD radix sort cache locality. I think the real added value of your 
algorithm lies in the millions of keys and beyond (which is nothing to be 
ashamed of!).

Please don’t take this as an attack, I really honestly believe that your 
spreadshort library has value. I just think that you should present it 
with a bit more modesty with regard to scope of application.

-Julian

Re: [boost] [review] Formal review period for Sort library continuing through November 19

Julian Gonggrijp