Re: [boost] [review] Formal review period for Sort library

12 Nov 2014

      Thanks for your feedback Phil.
...
Is a "faster" serial sort algorithm useful?
-------------------------------------------
In 2008, Luke Simonson said:
    anyone interested in runtime of an application that
    spends a lot of time sorting large data volumes is
    going to do it on an up-to-date machine, which means
    4, 8, 16 or 32 cores
    I think std::sort is a straw man to compare yourself
    to, not because it isn't good at what it does, but
    because the target use case in question of very large
    data volumes is better suited to a parallel algorithm.
At that time I didn't entirely agree as I had some embedded applications
that could benefit from faster sorting of perhaps tens of thousands of
items.  But in the intervening five years, even phones have gained
multiple CPUs.  Today, if I needed faster sorting my first thought would
be to parallelise it.
Sure, you can gain speedup by parallelizing, but if you can use half as
many cores to get the same speedup by using a faster subsort, then isn't
that helpful?
On phones, the biggest issue is usually battery usage, which is why many of
them rarely use more than 1 CPU core at a time.  On servers, frameworks
like MapReduce or other unshared memory techniques are commonly used to
distribute the problem in many smaller pieces across separate processes.
This library provides a faster subsort that will work with every standard
way to parallelize a CPU-based sorting problem except LSD radix sorting.
Parallel sorting can be valuable, but because of all the potential
applications, variations between memory, runtime, and CPU usage
constraints, and the level of shared-memory parallelization, there will be
large variation across applications in which solution is best, complicating
any library that attempts to meet these various needs.  I'm not
fundamentally opposed to adding library calls for the specific case of
shared-memory parallel sorting, but it would add substantial complexity,
and not cover all parallel-sorting needs.
...
Code Quality
------------
Back in 2008, the code had issues like calls to malloc() and free()
and a lack of exception safety.  No doubt it has improved, but perhaps
any keen reviewers could look for any remaining issues.
There is now just a vector::resize of the bin_cache in the develop branch.
I looked at eliminating that by using (more) stack memory, but it slowed
down the library.
...
Skipping common prefixes
------------------------
One undeniable complexity advantage of this algorithm compared to
std::sort<string> is that it doesn't need to re-compare common
string prefixes once it knows that they are equal.  Intuitively,
one would expect a substantial speedup from this change if the
input contains many strings with common prefixes.  But intuition
is not always correct.  I did some experiments using the pathnames
of all the files on my computer as the test data; these certainly
have many common prefixes e.g. /usr/local/include/boost/.  I
wrote a modified quicksort that would skip common prefixes (see
http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh )
and concluded that actually the intuition is wrong; comparing
bytes can be done 8-at-a-time on a 64-bit system, and the cost of
constantly re-comparing strings like /usr/local/include/boost/
must be small.  So the intuition that Steven's algorithm would
be useful in applications where strings have long common prefixes
should not be followed without first measuring it.
Highly compressible data is one such application, and it works well there
in bzip, but the length of the common prefixes were as much as hundreds of
bytes.
With more conventional data, you are correct, it isn't much of an
advantage.  I added that feature more to minimize the disadvantage created
by otherwise doing a full radix sort on each byte, when nothing is
different, which is a common issue with radix sorts on strings (if there
are better solutions, I'm open to them).  The end result was consistently
faster than comparison-based sorting when fed real data, including the
hierarchical strings I tested with, but had no relative advantage on short
hierarchies like what you describe vs normal strings.
...
Does anyone sort a vector<int> ?
--------------------------------
I don't think I've ever needed to sort a vector<int>; there is normally
some associated data, e.g. I'm sorting a vector<struct> or vector<struct*>
where one field of the struct is an int key.  Or the key has multiple
fields like an (x,y) pair, or there is some extra feature of the
comparison
such as being case-insensitive.
The proposed API does provide a reasonable way to sort these more complex
containers, i.e. by providing additional functors that (for string_sort)
return the nth character of the key string.  But what we don't know is
whether its performance benefits relative to introsort also apply in these
cases: as far as I've seen, the performance results are all for simple
vector<int>-like cases.  One could easily imagine that the time taken for
copying would dominate, diluting the algorithmic benefits.
If you run the tune.pl script that is in the base directory of the library,
you'll see the speedup on key+value sorts.  They actually do less swaps
than a comparison-based algorithm, because there are less iterations, so
large chunks of data sort faster relative to std::sort with this approach
than small chunks of data.  Some excerpts from my latest tune.pl run (with
no args) on Ubuntu:
Verifying integer_sort
runtime: 87.596183
std::sort time: 165.384112
speedup: 88.80%
Verifying float_sort
runtime: 85.212854
std::sort time: 182.711132
speedup: 114.42%
Verifying string_sort
runtime: 115.193192
std::sort time: 250.349607
speedup: 117.33%
Verifying integer_sort with separate key and data
runtime: 405.510518
std::sort time: 991.96294
speedup: 144.62%
Verifying string_sort with all functors
runtime: 139.453782
std::sort time: 311.68992
speedup: 123.51%
Verifying boost::sort on its custom-built worst-case distribution
runtime: 17.710653
std::sort time: 17.606037
speedup: -0.59%

I encourage all reviewers to run this script themselves to see how fast
this library is on their system (the README describes how to do this, or
you can just call ./tune.pl -small [-windows] from the libs/sort directory
and wait a minute).
A standard trick with a large data payload is to use a pointer to the data
to minimize the amount of memory that must be swapped, so sorting with a
really large data payload should be fairly unusual.

Re: [boost] [review] Formal review period for Sort library

Steven Ross