Vladimir,
On Mon, Nov 10, 2014 at 9:07 AM, Vladimir Prus
On 11/10/2014 04:54 PM, Steven Ross wrote:
I did an algorithm search in 2008 when preparing the boost submission. ALR came out, fairly similar to my paper. I have not seen much attention paid in academia to hybrid radix sorting.
Okay, so assuming a third party were to implement a sorting algorithm for Boost. Why they would choose your paper over the 2008 paper?
What 2008 paper? ALR came out later in 2002. The reason is better worst-case performance. Please see the alrbreaker example for an example distribution which causes conventional MSD radix sorts to perform very poorly; overhead starts to dominate as they radix sort small chunks of data. This library avoids that problem by switching to std::sort once it has broken up the problem into small enough chunks that overhead will be a major issue.
I'm worried about "similar" and "adaptation" above. If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
Since you've read the overview, how about just looking through the code and seeing if it works as described, or testing it? Can you ever be sure a concrete implementation correctly applies an abstract concept without looking at the implementation? These implementations are fast for the same reason ALR and AFS are. They have superior worst-case performance because they fall back to std::sort at a size that optimizes worst-case performance. The only other difference is that these algorithms scan to see if they can shrink the range each iteration, which improves performance for non-worst-case distributions with little impact on worst-case distributions.
Which "these algorithms"? ALR and AFS? If so, it would mean they are better?
These implementations (integer_sort, float_sort, string_sort) scan to see if they can shrink the range; integer_sort and float_sort find the maximum and minimum using a fast scan across the chunk of data being sorted, and MSD radix sort between the actual max and min. This reduces the size of the problem in many common cases while adding only minimal overhead in the worst case. string_sort does something similar, checking to see if all characters are identical, and if so, it skips on to the next character, only sorting once it encounters an actual difference. This is useful in cases such as hierarchical strings, where some chunks of the key are all the same. Again, the overhead of this scan is minimal.