I forget to mention the Windows version of the benchmark. I supouse, they
results will be similar to the obtained in Linux. I will prepare, too
With all the information, we can decide the best solution if exist.
Thanks
Francisco Tapia
2014-12-22 22:56 GMT+01:00 Francisco José Tapia
Hi Steven,
I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort
I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .
I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.
For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.
I repeat this process with all the algorithms for to know the memory used by each algorithm.
Algorithm
Memory used
std::sort
1 565 456
countertree::intro_sort
1 565 280
GCC parallel::sort
2 346 868
tbb::parallel_sort
1 565 696
countertree::parallel_sort
1 565 468
std::stable_sort
1 957 232
countertree::merge_sort
1 955 908
countertree::circular_merge_sort
1 696 316
sfx::timsort
1 996 364
parallel::stable_sort
2 742 384
countertree::parallel_merge_sort
1 956 636
I checked the time with benchmark_sort_05, with copy all the structure in the operation = and in the copy constructor. The number of elements in each sort is 800000000/size of the element in bytes
8
bytes
16
bytes
32
bytes
48
bytes
64
bytes
128
bytes
256
bytes
std::stable_sort
16.99
10.36
9.75
9.62
8.85
8.89
10.25
countertree::merge_sort
16.96
10.38
8.13
7.44
7.10
6.67
9.01
countertree::circular_merge_sort
19.77
12.80
8.81
7.61
7.01
6.48
8.62
timsort
22.89
13.43
10.12
8.33
7.70
7.00
8.62
parallel::stable_sort
9.68
8.91
8.90
8.90
8.23
8.21
7.72
countertree::parallel_merge_sort
7.04
5.96
5.74
5.70
5.64
5.54
5.94
About the stable sort of tbb, I see, but I didn't have time to examine and check in deep. But be quiet. The benchmark_sort indirect shows
Size of element 48 Number of elements :16666666
parallel_sort :2.53964 sec
indirect_parallel_sort :5.70979 sec
parallel_stable_sort :5.73186 sec
indirect_parallel_stable_sort :5.59269 sec
Size of element 64 Number of elements :12500000
parallel_sort :2.59812 sec
indirect_parallel_sort :4.1553 sec
parallel_stable_sort :5.63771 sec
indirect_parallel_stable_sort :3.95266 sec
Size of element 128 Number of elements :6250000
parallel_sort :2.28648 sec
indirect_parallel_sort :2.34792 sec
parallel_stable_sort :5.62879 sec
indirect_parallel_stable_sort :2.30188 sec
Size of element 256 Number of elements :3125000
parallel_sort :2.05017 sec
indirect_parallel_sort :1.50661 sec
parallel_stable_sort :5.85409 sec
indirect_parallel_stable_sort :1.48561 sec
As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.
The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements
I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.
If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience
Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.
We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort.
I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.
At end, only say Happy Christmas, and New Year to the Boost community.
Yours
Francisco Tapia
fjtapia@gmail.com