Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 1

14 Nov 2014

      Niall,

On Thu Nov 13 2014 at 6:44:50 AM Niall Douglas <s_sourceforge@nedprod.com>
wrote:
...
On 13 Nov 2014 at 5:50, Steven Ross wrote:
...
...
For your situation you just sort, so a combined graph showing a
comparative to other algorithms makes much more sense.
I've created an integer_sort graph here (click the graph tab at the
bottom):
https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-
hWmJY996qvocaw-ylnUDG0/edit?usp=sharing
That is indeed a very useful graph which definitely should be near
the first page of your docs. At a glance one can tell whether one
should be interested in your algorithm or not.
Three items: I'd put everything in CPU cycles instead of seconds.
This removes to some extent the effect of fast CPUs, or indeed of
Intel CPUs, and lets the reader more quickly decide if this algorithm
is worth using. I would assume you'll be mostly memory bound anyway,
I'd like to clarify one point: for all these graphs, I changed
the min_sort_size from 3000 to 2 in
include/boost/sort/detail/constants.hpp, so that the algorithm won't just
fall back to std::sort when the list is too small to sort efficiently using
Spreadsort.  If I had used a min_sort_size of 1000 (or 3000) the left side
of the graph would just be flat, as it would fall back to std::sort.

I'm not sure how you count CPU cycles.  CLOCKS_PER_SEC on my system is
simply 1 million.  I can assume some particular speed for them, but as
cache misses probably dominate, the CPU speed is probably less important
than bus latency and cache sizes.  I think stating time in cpu seconds
while also stating the cpu enables relative comparisons.  I've checked in
the code in the develop branch in these files:
example/parallelint.cpp
example/parallelstring.cpp
And added graphs as you suggested (except for cycle counts) in the "Graphs"
tab of this document:
https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-hWmJY996qvocaw-...

I built them like this:
b2 --tune linkflags="-lboost_system -lboost_thread" parallelstring

and ran them like this:
./randomgen 16 16 1024
./parallelstring 1 100000 -std;./parallelstring 1 100000;./parallelstring 4
100000 -std;./parallelstring 4 100000
The two 16s mean use a 16-bit range on both the top and the bottom of the
distribution (in other words, evenly distributed random 32-bit values).
The 1024 means to write 1024 32-bit values to disk (4096 bytes).
The 100000 means to loop 100000 times, to reduce the relative impact of
thread overhead.
The integer sorting is over 32-bit ints.  The string sorting uses the same
random data, but reads it in as strings.  This has the interesting effect
of creating random-length random strings with a mean length of 43 bytes.
The only reasonable way to put them both on the same graph is to show the
number of bytes sorted, as opposed to the number of array elements.
I was quite surprised to see that while the crossover point for
integer_sort is around 1000 elements (give or take), the crossover point
for string_sort is around 30 elements.
I can create a separate, lower min_sort_size for string_sort, though I
think a better value for the threshold would be around (1 <<
(8*sizeof(unsigned_char_type))), which is what the algorithm uses
internally, and works out to 256 for conventional char strings, as creating
more bins (1 per possible value) than there are items to sort rapidly
becomes inefficient.  Does anyone have a preference on this subject?
...
The second thing is that I would separate 4 threaded from single
threaded graphs as too much information confuses. You could use the
opportunity to put string sort on the same graph as int sort for
single threaded.
done
...
And thirdly, I'd use more than 10,000 elements as I understand that's
where your algorithm really pays off. You can then say "at > 100,000
elements my algorithm is X.Y times the performance of std::sort".
Actually, the algorithm does best at intervals of 2^(C*max_splits +
log_mean_bin_size) elements, which works out to 2^13 and 2^24, and cases
where it can use bucketsort because the range is limited.  Since the data
is 32-bit evenly distributed random in this case, it does best at about
2^13, and then a little better later at about 2^24 if I included it in the
graph.  In other words, the vast majority of the speedup this algorithm
provides for data that can't be bucketsorted is achieved by n=10,000.
...
I think for me at least Steven my only last remaining problem is the
library name. Boost.Sort implies a library with at least three useful
sorting algorithms not in the STL. Either you add those, or rename
the library. I'd much prefer the former.
<http://lists.boost.org/mailman/listinfo.cgi/boost>
What are peoples' preferences with regards to what algorithms to offer?
Here are main possibilities I'm willing to consider:
1) in-place Spreadsort, as already exists in the library.
2) copy-based (2x RAM) Spreadsort, which may be faster in general, and
should be faster for mostly sorted data.  This would also provide a stable
sort option at a modest cost in speed.
3) LSD radix sort (2x RAM).  It's better in some cases.
4) boost::thread or OpenMP based parallel implementations of any included
algorithm.

My preference is just #1.  Doing #4 right is probably quite a bit of work,
and I'm not sure how generally useful it really would be (as they are many
ways to run code in parallel).