Re: [boost] [SORT] Parallel Algorithms

23 Mar 2015

      On Fri, Mar 20, 2015 at 9:12 AM Francisco José Tapia <fjtapia@gmail.com>
wrote:
...
I have the final version of all the algorithms. I wrote 6 algorithms of
stable sort, and selected the fastest. On my computer is about 15% faster
than the GCC stable sort.
Could you send them please?  I'd like to verify the speedup you're seeing,
and compare vs TBB.  I'm ok if it just is comparable in speed to TBB and it
has additional advantages of some type.
...
I had problems too with an error in an algorithm mixed with an error in
the GCC dynamic memory system, Under certain sequence of request, the
system provide a non valid address and due this a segmentation fault. The
debug for to detect and correct had been a nightmare.
One thing to watch out for is undefined operations, such as type
conversions between int and float types; those aren't compiler bugs:
they're undefined (this problem hit me with float_sort, before someone
showed me how to convert safely).  I hope you've fixed the bug.
...
The algorithms provided are :
1.- sort : The algorithm is intro_sort. Don't need additional memory.
Performance similar to the GCC version and Windows version. Non stable
...
2.- parallel_sort : Decompose the global range in small ranges for to be
sorted with intro_sort. Don't need additional memory. Similar performance
than the TBB version and Microsoft version. Non Stable
Does it provide any advantage over those implementations, aside from being
open-source?
...
3.- stable_sort : The algorithm is called smart_merge_sort, and is about
10-15% faster than the GCC stable sort version, The additional memory is an
half of the memory used by the data.( The GCC stable sort use the same
memory than the data)
GCC uses no additional memory, or 100% additional memory?
Did you compare against the tbb stable sort I pointed you to?
...
4.- sample_sort : is a parallel stable sort. Have the same performance than
the TBB implementations, and much more faster than the GCC parallel
stable_sort. The additional memory is the same than the data
5.- parallel_stable_sort : This algorithm is based on the sample sort.
Increase the sample_sort time about a 5%, but the additional memory used is
an half of the memory used by the data.
That's interesting, so it's better from a memory perspective than the
competing algorithms, and not bad in speed?  That sounds like it's worth
serious consideration as an addition.
...
The functions are in the namespàce boost::sort All the algorithms are
exception safe, and use indirect sort when the size of the objects is big.
According to my benchmarks, the size limit for to change a indirect sort is
128 bytes.
The function format of the parallel algorithms are
algorithm ( first, last, compare , number_threads).
The number of threads is passed with an struct NThreads which receives an
unsigned, and return an unsigned. The default construct of this struct is
filled with the HW threads
At today I am testing on Windows with Visual Studio 2013 CTP 4, and
changing some things of the code due to the incompatibility with C++11
features
In a few days I will document all the functions of the code for generate a
doxygen documentation, and can send you a version of the code.
Then we must talk about how to design the benchmark. I think the starting
point can be your experience with your code
Please look at this code, and refactor/reuse what test code you can:
https://github.com/boostorg/sort
Specifically, look in the test directory, the example directory, and look
in tune.pl.  Are there any parameters you would want to tune for specific
systems (besides the number of threads)?
The basic idea is that randomgen writes out a randomized distribution to
disk, and then we run whatever algorithms we want to compare (writing their
results to disk), verify that the results are correct, then compare speed.
I'd prefer not to skip the disk intermediate step because having the files
on disk makes it easier to debug when something gives incorrect results.
We should compare with variable-length strings and ints, in addition to
more complicated data structures like what you've been testing with.
It's worth noting that it's quite common to use vectors for large
quantities of data in a class/struct, in which case the sorting would
already be indirect.
...
The C++ committee propose a format for to call the parallel functions
described in
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
it would be a good idea to examine, and decide if follow the proposal, if
we don't follow or if we provide the two interfaces for the parallel
algorithms. According with my first impression it's not very complex to
implement.
With your implementations, it looks like people have these choices:
1) how many threads to use
2) stable or not
3) memory or speed efficient
4) direct or indirect

Note that #4 may be influenced by #3, and that you may be able to determine
that automatically, but providing people a simple way to make those choices
is a good idea.  You could use the approach suggested by the committee if
it fits well, or another one if you have a better approach to this specific
problem.

Steve

Re: [boost] [SORT] Parallel Algorithms

Steven Ross