Re: [boost] [SORT] Parallel Algorithms

22 Dec 2014

      I forget  to mention the Windows version of the benchmark. I supouse, they
results will be similar to the obtained in Linux. I will prepare, too

With all the information, we can decide the best solution if exist.

Thanks

Francisco Tapia

2014-12-22 22:56 GMT+01:00 Francisco José Tapia <fjtapia@gmail.com>:
...
Hi Steven,
I had been working about the memory used by the different algorithms,
specially the stable algorithms. I developed a low memory algorithm
circular_stable_sort, which use only a 10% more of memory, and the speed
decrease around 15% with small objects and with big objects is similar to
the merge_sort
I was looking too, the TimSort algorithm. I found a C++ implementation
in https://github.com/gfx/cpp-TimSort .
I inserted this algorithm in the benchmark programs. With small objects,
the speed is slower than the stable sort and merge_sort, and have a good
performance with big objects, greater than 64 bits. It's logic the decision
of Java and Python, because they must order objects, usually non contiguous
and bigger than the naked number.
For the measure of the memory used by the program I use the command
/usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
(100000000 uint64_t numbers), and comment all the invocations to the
algorithms , except one, compile, and run in a terminal with the command
/usr/bin/time -v benchmark_sort_04.
I repeat this process with all the algorithms for to know the memory
used by each algorithm.
Algorithm
Memory used
std::sort
1 565 456
countertree::intro_sort
1 565 280
GCC parallel::sort
2 346 868
tbb::parallel_sort
1 565 696
countertree::parallel_sort
1 565 468
std::stable_sort
1 957 232
countertree::merge_sort
1 955 908
countertree::circular_merge_sort
1 696 316
sfx::timsort
1 996 364
parallel::stable_sort
2 742 384
countertree::parallel_merge_sort
1 956 636
I checked the time with benchmark_sort_05, with copy all the structure
in the operation = and in the copy constructor. The number of elements in
each sort is 800000000/size of the element in bytes
8
bytes
16
bytes
32
bytes
48
bytes
64
bytes
128
bytes
256
bytes
std::stable_sort
16.99
10.36
9.75
9.62
8.85
8.89
10.25
countertree::merge_sort
16.96
10.38
8.13
7.44
7.10
6.67
9.01
countertree::circular_merge_sort
19.77
12.80
8.81
7.61
7.01
6.48
8.62
timsort
22.89
13.43
10.12
8.33
7.70
7.00
8.62
parallel::stable_sort
9.68
8.91
8.90
8.90
8.23
8.21
7.72
countertree::parallel_merge_sort
7.04
5.96
5.74
5.70
5.64
5.54
5.94
About the stable sort of tbb, I see, but I didn't have time to examine
and check in deep. But be quiet. The benchmark_sort indirect shows
Size of element 48 Number of elements :16666666
parallel_sort :2.53964 sec
indirect_parallel_sort :5.70979 sec
parallel_stable_sort :5.73186 sec
indirect_parallel_stable_sort :5.59269 sec
Size of element 64 Number of elements :12500000
parallel_sort :2.59812 sec
indirect_parallel_sort :4.1553 sec
parallel_stable_sort :5.63771 sec
indirect_parallel_stable_sort :3.95266 sec
Size of element 128 Number of elements :6250000
parallel_sort :2.28648 sec
indirect_parallel_sort :2.34792 sec
parallel_stable_sort :5.62879 sec
indirect_parallel_stable_sort :2.30188 sec
Size of element 256 Number of elements :3125000
parallel_sort :2.05017 sec
indirect_parallel_sort :1.50661 sec
parallel_stable_sort :5.85409 sec
indirect_parallel_stable_sort :1.48561 sec
As you can see with a 256 bytes of size the indirect
parallel_stable_sort is 3 times faster. And the indirect_sort is a 25%
faster.
The final solution, if at end is taken, will be a hybrid algorithm
depending of the size of the elements
I propose you the next. Let me 2 or 3 weeks and I will prepare a big
benchmark with all the algorithms, including the indirect version of the
stable and unstable sort. We will run the benchmarks over different
machines, and according the results, take a decision about if we propose
something to Boost, and which algorithm must be used in the hybrid version.
If, at end, we decide don't propose nothing, I only can say this had
been a nice and exciting experience
Mi intention is not to impose my code, I want provide to the final users
the best solution, if exist. With my code or with the code of other. I am
the first to clap the winner.
We are doing this because this parallel implementation don't need any
other library, only the standard. There are compilers without OpenMP or TBB
support, or simply, the user don't want to load them for to make a sort.
I have in mind, several improvements of the algorithms. I will codify
and test, and if run well, I include in the algorithms. I want, too,
simplify the interface of indirect_sort, for to be used in a easy way by
any sort algorithm. If you are interested, I will prepare too a brief
document with the explanation and the internal algorithm.
At end, only say Happy Christmas, and New Year to the Boost community.
Yours
Francisco Tapia
fjtapia@gmail.com