Steven Ross wrote:
integer_sort and float_sort assume they are operating on something that can efficiently be treated as a single word. They use bitshifting and subtraction to find the difference between the most significant bits of two elements. The bitshifting paradigm is what makes those algorithms difficult to generalize to very large keys.
Why is the bitshifting hard-coded into a separate version of the algorithm, rather than wrapping it in the Get_char parameter function? (By the way, please forgive me for keeping firing such questions at you. It’s just that your remarks raise questions which I think are interesting. You are not obliged to defend your design, as far as I’m concerned.)
Could you clarify the above remark? Why does the fully generic string_sort not work well for keys that fit into a machine word?
It's not quite as efficient; it works 8 bits at a time instead of their 11, and doesn't use subtraction to eliminate bits that are identical between the min and max.
If I understand correctly, the substraction is a trick to reduce the number of buckets. Wouldn’t that be useful for strings as well? And why wouldn’t the number of bits per pass be parametric (with sensible defaults of course)?
1. Value type can be treated as a sequence of digits/characters.
Find me a type that you don't think this applies to,
For starters, any product type (tuple or struct) containing variable- length subkeys. As an example, consider book records from a library database that need to be ordered first by author and then by title. If you insist on doing that with a (MSD) radix sort, you have two options:
1. Use a stable radix sort and make multiple passes over the data. In most cases, this would be less efficient than just doing a single pass with a comparison sort, not to mention that spreadsort isn’t stable. 2. Hack into the implementation details of the algorithm and recurse into the next key to be sorted by, before the buckets of the previous key are destroyed.
Note: I am open to writing stable versions of these algorithms, potentially with TimSort or some other efficient stable fallback sort. All that is required for stability of the MSD radix portion is having a full copy of the data (just like LSD radix sort uses). The library should be faster without the complex cache-missing swaps the in-place version uses.
It is up to you whether you add a stable version; I would vote to accept your library without it, too. Note that while a full copy of the data may speed up the algorithm, that does not necessarily mean that it is suddenly a good idea to make multiple passes with it.
That said, here's another 2 options, the first of which is quite efficient:
3) If there is a character that is not allowed in the substrings (ideally null character), then you can just return 0 when you hit the end of a substring, and if the illegal character is not null, then shift any values below that character up by 1 in the Get_char functor to fit this change in. Embedded nulls usually are not a design requirement, in which case you can just use null to signal the end of a substring. 4) If all your possible character values are accounted for, then you can add an extra character every step to signal whether the substring is over. Example returns from Get_char with this approach on a string with 2 substrings: """a"-> 0, 1, 'a', 0 "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'
If this is a serious issue, I can add substring end signaling into the API.
To the contrary. Please don’t do any such invasive thing to your algorithm just to add one class of use cases. You algorithm works for a large class of use cases out of the box and you can make many people happy with it as it is.
If you can define a strict weak ordering that is usable in std::sort, you can define a synthetic radix as I've just given an example of.
You can invent all kinds of workarounds and invasive additions to account for additional use cases. That doesn’t mean it would be a good idea. There is no single sophistication to spreadsort that will make it work for *all* use cases that every comparison sort works out of the box for. I think it would be better to appreciate the elegance of the algorithm for the use cases that it easily supports, than to try to approach the generality of a comparison sort. Apart from that, at this point it seems clear to me that you agree that spreadsort (and radix sorting in general) does not work for product types with variable length subkeys *out of the box*, contrary to (pure) comparison sorts. I must emphasise that this is still just one example that I could easily come up with; you should not assume that it is the only exception. Since spreadsort does not work out of the box for all use cases, you should not pretend that it does in your documentation, either.
Case-agnostic string sorting is another example in which radix sorts don’t excel (even though you can make it work if you extend the interface considerably). This has been mentioned before.
The functor example works efficiently in this case, as mentioned before. It's significantly faster than std::sort using boost::algorithm::ilexicographical_compare for case-insensitive sorting: 1.21s for 40MB of strings vs 2.56s for std::sort.
In that case I retract that example. As I said I only quickly scanned the previous discussion, so I hereby apologise for this oversight and any other oversights I might have made. -Julian