Niall,
On Thu Nov 13 2014 at 6:44:50 AM Niall Douglas
On 13 Nov 2014 at 5:50, Steven Ross wrote:
For your situation you just sort, so a combined graph showing a comparative to other algorithms makes much more sense.
I've created an integer_sort graph here (click the graph tab at the bottom): https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd- hWmJY996qvocaw-ylnUDG0/edit?usp=sharing
That is indeed a very useful graph which definitely should be near the first page of your docs. At a glance one can tell whether one should be interested in your algorithm or not.
Three items: I'd put everything in CPU cycles instead of seconds. This removes to some extent the effect of fast CPUs, or indeed of Intel CPUs, and lets the reader more quickly decide if this algorithm is worth using. I would assume you'll be mostly memory bound anyway,
I'd like to clarify one point: for all these graphs, I changed the min_sort_size from 3000 to 2 in include/boost/sort/detail/constants.hpp, so that the algorithm won't just fall back to std::sort when the list is too small to sort efficiently using Spreadsort. If I had used a min_sort_size of 1000 (or 3000) the left side of the graph would just be flat, as it would fall back to std::sort. I'm not sure how you count CPU cycles. CLOCKS_PER_SEC on my system is simply 1 million. I can assume some particular speed for them, but as cache misses probably dominate, the CPU speed is probably less important than bus latency and cache sizes. I think stating time in cpu seconds while also stating the cpu enables relative comparisons. I've checked in the code in the develop branch in these files: example/parallelint.cpp example/parallelstring.cpp And added graphs as you suggested (except for cycle counts) in the "Graphs" tab of this document: https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-hWmJY996qvocaw-... I built them like this: b2 --tune linkflags="-lboost_system -lboost_thread" parallelstring and ran them like this: ./randomgen 16 16 1024 ./parallelstring 1 100000 -std;./parallelstring 1 100000;./parallelstring 4 100000 -std;./parallelstring 4 100000 The two 16s mean use a 16-bit range on both the top and the bottom of the distribution (in other words, evenly distributed random 32-bit values). The 1024 means to write 1024 32-bit values to disk (4096 bytes). The 100000 means to loop 100000 times, to reduce the relative impact of thread overhead. The integer sorting is over 32-bit ints. The string sorting uses the same random data, but reads it in as strings. This has the interesting effect of creating random-length random strings with a mean length of 43 bytes. The only reasonable way to put them both on the same graph is to show the number of bytes sorted, as opposed to the number of array elements. I was quite surprised to see that while the crossover point for integer_sort is around 1000 elements (give or take), the crossover point for string_sort is around 30 elements. I can create a separate, lower min_sort_size for string_sort, though I think a better value for the threshold would be around (1 << (8*sizeof(unsigned_char_type))), which is what the algorithm uses internally, and works out to 256 for conventional char strings, as creating more bins (1 per possible value) than there are items to sort rapidly becomes inefficient. Does anyone have a preference on this subject?
The second thing is that I would separate 4 threaded from single threaded graphs as too much information confuses. You could use the opportunity to put string sort on the same graph as int sort for single threaded.
done
And thirdly, I'd use more than 10,000 elements as I understand that's where your algorithm really pays off. You can then say "at > 100,000 elements my algorithm is X.Y times the performance of std::sort".
Actually, the algorithm does best at intervals of 2^(C*max_splits + log_mean_bin_size) elements, which works out to 2^13 and 2^24, and cases where it can use bucketsort because the range is limited. Since the data is 32-bit evenly distributed random in this case, it does best at about 2^13, and then a little better later at about 2^24 if I included it in the graph. In other words, the vast majority of the speedup this algorithm
provides for data that can't be bucketsorted is achieved by n=10,000.
I think for me at least Steven my only last remaining problem is the library name. Boost.Sort implies a library with at least three useful sorting algorithms not in the STL. Either you add those, or rename the library. I'd much prefer the former. http://lists.boost.org/mailman/listinfo.cgi/boost
What are peoples' preferences with regards to what algorithms to offer? Here are main possibilities I'm willing to consider: 1) in-place Spreadsort, as already exists in the library. 2) copy-based (2x RAM) Spreadsort, which may be faster in general, and should be faster for mostly sorted data. This would also provide a stable sort option at a modest cost in speed. 3) LSD radix sort (2x RAM). It's better in some cases. 4) boost::thread or OpenMP based parallel implementations of any included algorithm. My preference is just #1. Doing #4 right is probably quite a bit of work, and I'm not sure how generally useful it really would be (as they are many ways to run code in parallel).