On Fri, Mar 20, 2015 at 9:12 AM Francisco José Tapia
I have the final version of all the algorithms. I wrote 6 algorithms of stable sort, and selected the fastest. On my computer is about 15% faster than the GCC stable sort.
Could you send them please? I'd like to verify the speedup you're seeing, and compare vs TBB. I'm ok if it just is comparable in speed to TBB and it has additional advantages of some type.
I had problems too with an error in an algorithm mixed with an error in the GCC dynamic memory system, Under certain sequence of request, the system provide a non valid address and due this a segmentation fault. The debug for to detect and correct had been a nightmare.
One thing to watch out for is undefined operations, such as type conversions between int and float types; those aren't compiler bugs: they're undefined (this problem hit me with float_sort, before someone showed me how to convert safely). I hope you've fixed the bug.
The algorithms provided are :
1.- sort : The algorithm is intro_sort. Don't need additional memory. Performance similar to the GCC version and Windows version. Non stable
2.- parallel_sort : Decompose the global range in small ranges for to be sorted with intro_sort. Don't need additional memory. Similar performance than the TBB version and Microsoft version. Non Stable
Does it provide any advantage over those implementations, aside from being open-source?
3.- stable_sort : The algorithm is called smart_merge_sort, and is about 10-15% faster than the GCC stable sort version, The additional memory is an half of the memory used by the data.( The GCC stable sort use the same memory than the data)
GCC uses no additional memory, or 100% additional memory? Did you compare against the tbb stable sort I pointed you to?
4.- sample_sort : is a parallel stable sort. Have the same performance than the TBB implementations, and much more faster than the GCC parallel stable_sort. The additional memory is the same than the data
5.- parallel_stable_sort : This algorithm is based on the sample sort. Increase the sample_sort time about a 5%, but the additional memory used is an half of the memory used by the data.
That's interesting, so it's better from a memory perspective than the competing algorithms, and not bad in speed? That sounds like it's worth serious consideration as an addition.
The functions are in the namespĂ ce boost::sort All the algorithms are exception safe, and use indirect sort when the size of the objects is big. According to my benchmarks, the size limit for to change a indirect sort is 128 bytes.
The function format of the parallel algorithms are
algorithm ( first, last, compare , number_threads).
The number of threads is passed with an struct NThreads which receives an unsigned, and return an unsigned. The default construct of this struct is filled with the HW threads
At today I am testing on Windows with Visual Studio 2013 CTP 4, and changing some things of the code due to the incompatibility with C++11 features
In a few days I will document all the functions of the code for generate a doxygen documentation, and can send you a version of the code.
Then we must talk about how to design the benchmark. I think the starting point can be your experience with your code
Please look at this code, and refactor/reuse what test code you can: https://github.com/boostorg/sort Specifically, look in the test directory, the example directory, and look in tune.pl. Are there any parameters you would want to tune for specific systems (besides the number of threads)? The basic idea is that randomgen writes out a randomized distribution to disk, and then we run whatever algorithms we want to compare (writing their results to disk), verify that the results are correct, then compare speed. I'd prefer not to skip the disk intermediate step because having the files on disk makes it easier to debug when something gives incorrect results. We should compare with variable-length strings and ints, in addition to more complicated data structures like what you've been testing with. It's worth noting that it's quite common to use vectors for large quantities of data in a class/struct, in which case the sorting would already be indirect.
The C++ committee propose a format for to call the parallel functions described in
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
it would be a good idea to examine, and decide if follow the proposal, if we don't follow or if we provide the two interfaces for the parallel algorithms. According with my first impression it's not very complex to implement.
With your implementations, it looks like people have these choices: 1) how many threads to use 2) stable or not 3) memory or speed efficient 4) direct or indirect Note that #4 may be influenced by #3, and that you may be able to determine that automatically, but providing people a simple way to make those choices is a good idea. You could use the approach suggested by the committee if it fits well, or another one if you have a better approach to this specific problem. Steve