Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19

10 Nov 2014

      Vladimir,

On Mon, Nov 10, 2014 at 9:07 AM, Vladimir Prus
<vladimir@codesourcery.com> wrote:
...
On 11/10/2014 04:54 PM, Steven Ross wrote:
...
I did an algorithm search in 2008 when preparing the boost submission.
ALR came out, fairly similar to my paper.  I have not seen much
attention paid in academia to hybrid radix sorting.
Okay, so assuming a third party were to implement a sorting algorithm for
Boost.
Why they would choose your paper over the 2008 paper?
What 2008 paper?  ALR came out later in 2002.
The reason is better worst-case performance.  Please see the
alrbreaker example for an example distribution which causes
conventional MSD radix sorts to perform very poorly; overhead starts
to dominate as they radix sort small chunks of data.  This library
avoids that problem by switching to std::sort once it has broken up
the problem into small enough chunks that overhead will be a major
issue.
...
...
...
...
I'm worried about "similar" and "adaptation" above. If I see that
function
whatever_sort
implements an algorithm from paper N, with 200 citations, then I can
assume
enough
people reviewed the algorithm. If that function implements an
adaptation,
then
I'm not really sure what complexity I can expect.
Since you've read the overview, how about just looking through the
code and seeing if it works as described, or testing it?  Can you ever
be sure a concrete implementation correctly applies an abstract
concept without looking at the implementation?  These implementations
are fast for the same reason ALR and AFS are.  They have superior
worst-case performance because they fall back to std::sort at a size
that optimizes worst-case performance.  The only other difference is
that these algorithms scan to see if they can shrink the range each
iteration, which improves performance for non-worst-case distributions
with little impact on worst-case distributions.
Which "these algorithms"? ALR and AFS? If so, it would mean they are better?
These implementations (integer_sort, float_sort, string_sort) scan to
see if they can shrink the range; integer_sort and float_sort find the
maximum and minimum using a fast scan across the chunk of data being
sorted, and MSD radix sort between the actual max and min.  This
reduces the size of the problem in many common cases while adding only
minimal overhead in the worst case.
string_sort does something similar, checking to see if all characters
are identical, and if so, it skips on to the next character, only
sorting once it encounters an actual difference.  This is useful in
cases such as hierarchical strings, where some chunks of the key are
all the same.  Again, the overhead of this scan is minimal.