Francisco,
On Tue Dec 16 2014 at 1:42:59 PM Francisco José Tapia
Hi Steven,
I thought all the program was checked, but obviously not. Now in the folder you have a shell script with the command for to compile and link with optimization ( by example make_benchmark_sort_03.sh), and a make_all.sh which compile all. Open a terminal in the folder, and compile and run
Great! Thanks. I ran everything, and the only failure was with benchmark_sort_indirect: ./benchmark_sort_indirect Size of element 8 Number of elements :100000000 terminate called after throwing an instance of 'std::system_error' what(): Enable multithreading to use std::thread: Operation not permitted Aborted (core dumped)
As suggested me, change the number generator and now use a Mersenne of the standard library.
Thanks
The stable sort is less used than the non stable sort, but sometimes is important. I developed because in my project have a proxy , which receives the operations over the elements defined by a key. The proxy sort, and send the operations to the corresponding thread, and it is not the same, read the value of a key and after delete that key, than delete the key and after try to read the value. I can't add a time mark for to know which arrived before
Fair enough.
As say in the previous message, the stable sort, with 1 thread need 2.7, and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4 HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )
This show countertree::merge_sort is a good for to use in a parallel SW and GCC std::stable_sort is bad.
with all the size of elements. But in the parallel process, when the size of the elements grow, the poor scalability appear. The next lines are are from Original benchmark_sort_05.cpp on my computer
With 128 bytes elements
Random elements, few repeated -----------------------
STL std::sort :2.9188 secs
countertree::intro_sort :2.87864 secs
parallel::sort :2.46113 secs
tbb::parallel_sort :1.86458 secs
countertree::parallel_sort:1.89263 secs
With 256 bytes elements
Random elements, few repeated ( rand() )-----------------------
STL std::sort :3.03143 secs
countertree::intro_sort :3.02733 secs
parallel::sort :2.23231 secs
tbb::parallel_sort :1.76186 secs
countertree::parallel_sort:1.56509 secs
As we can see the parallel process is ineffective with GCC parallel::sort with big objects.
Those don't match up with my results for 256 byte elements: STL std::sort :1.04718 secs countertree::intro_sort :1.14214 secs
The numbers I see are not as extreme, but your merge_sort looks good in comparison, especially with large elements. I will investigate alternatives to see if this is a truly speed-competitive offering. Have you considered Timsort (standard in Java and Python)? Timsort is a stable sort that is particularly good with sorted, mostly-sorted, and reverse-sorted data. It might be worth offering as an alternative stable sort for those concerned about mostly-sorted data. Your intro_sort looks reasonable overall, but it varies from comparable speed to 14% slower when run in parallel on random data, and is an order of magnitude slower on already-sorted data. As tbb is free, I just don't see a compelling case for your introsort implementation; tbb is a very hard competitor to beat. The GCC std::sort and countertree::intro_sort have similar performance parallel::sort :0.679178 secs tbb::parallel_sort :0.600001 secs countertree::parallel_sort:0.627144 secs And this is what I see for size 64: STL std::sort :1.6577 secs countertree::intro_sort :1.73023 secs parallel::sort :0.963199 secs tbb::parallel_sort :0.873805 secs countertree::parallel_sort:0.997228 secs I'm not sure the benefits you are seeing are portable.
About the merge_sort algorithm. It use additionally half of the memory used by the data. But if you don't use additional memory, the speed drops. I didn't examine in deep the Algorithm used in GCC, I only codified my own algorithm. I will examine, and I will try to reduce the memory used, without lost of speed and scalability.
The standard algorithm says it will lose speed when it reduces memory used; it'd be a nice feature to have, but if we can repeatably see 2X relative speedups without that feature, I think some people may find it useful.
About the server for to test. I don't have modern servers in my school. The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3 years old machine with 4 HW threads, and in the benchmarks is slower than the I3 and I5 machined mounted this year. If someone can check and provide us the result, I will be grateful. The version 4.9.1 of GCC have better optimization than the 4.8.2. If you are using Ubuntu, the version 14.10 have the 4.9.1 version as predefined.
A boost library should still be useful for people at least one compiler
version back; I feel fully justified testing on 4.8.2. Have you tested on Windows yet?