[SORT] Parallel Algorithms
Sorry by the delay, As promised, the sorting parallel algorithms On my computer (Quad Core) the GCC std::sort and stable_sort are slight faster than my version of intro_sort and merge_sort (stable sort is a synonym of merge_sort). In other computer with an Intel I3, my algorithms are slight faster than in the GCC version. Trying to optimize the parallel implementation. I modified the parallel version for to use the std::sort and std::stable_sort. The parallel_sort version response good, but the parallel_stable_sort increase the time around a 60% compared with the version with countertree::parallel_merge_sort . After many test, I discover the cause of the problem. The demonstration is in the test_stress_sort_01.cpp. In this program we have a big vector of 100 000 000 elements uint64_t. The program make parts of the same size for each HW core. On my computer 4 parts and 4 cores. The program execute the part sequentially checking the time, and after do the same but in parallel, each thread sort a part, and check the time. This is done with the std::stable sort and countertree::merge_sort. The data input is the same for all. The next lines are the result show max. OpenMP threads: 4 (4 processors) Sort of 100000000 elements in a vector --------------------- Random elements, few repeated ( rand() )----------------------- Stable Sort Part 0 secs :2.77955 Part 1 secs :2.75015 Part 2 secs :2.76068 Part 3 secs :2.76284 Parallel Sort secs :8.49796 Countertree stable sort Part 0 secs :2.72802 Part 1 secs :2.71614 Part 2 secs :2.72074 Part 3 secs :2.722 Parallel Sort secs :4.80692 Surprisingly the parallel sort with countertree::stable_sort is much more faster than the sort with std::stable_sort in the GCC 4.8 and 4.9 compiler. This is the secret of the speed of the countertree::parallel_merge_sort. This is very important, because when the size of the elements is small, the GCC and the countertree algorithms have similar performance. But when the size of the elements grow, the std::stable sort is slower than the countertree::merge_sort. This difference is increased in the parallel version. The next lines are part of benchmark_sort_05.cpp *************************************************************** Sort of N = 6250000 elements of size 128 ***************************************************************** Random elements, few repeated ( rand() )----------------------- STL std::sort :2.9188 secs countertree::intro_sort :2.87864 secs parallel::sort :2.46113 secs tbb::parallel_sort :1.86458 secs countertree::parallel_sort:1.89263 secs std::stable_sort :8.77963 secs countertree::merge_sort :5.70694 secs parallel::stable_sort :8.03946 secs countertree::parallel_merge_sort :4.90815 secs I have in mind a new version of the stable sort in order to improve the speed, reducing the number of internal copy of elements. I hope to have time in Christmas, but I can't guarantee. If the results are the expected I will send you You have too, a version of indirect_sort, which sort pointers to the elements, and at end sort the elements according the pointers. You can replace any sort algorithm by their indirect version, obtaining the same results, with the only difference of the time needed. You have a benchmark_indirect_sort.cpp where see the comparison of the time needed of several sort methods and their indirect version with elements of different size. This code is a preliminary version, done for my countertree library. If you consider useful, I will change the name space and document the code, the algorithms and the ideas obtained from the benchmarks. Please feel free for ask me any question or for to modify or change anything. The goal is to provide good sort methods to the users. In the zip file you have the code and a file with a description of each code. Yours
Franciso,
On Sun, Dec 14, 2014 at 1:03 PM, Francisco José Tapia
Sorry by the delay, As promised, the sorting parallel algorithms
On my computer (Quad Core) the GCC std::sort and stable_sort are slight faster than my version of intro_sort and merge_sort (stable sort is a synonym of merge_sort). In other computer with an Intel I3, my algorithms are slight faster than in the GCC version.
An I3 is low-end hardware. I recommend testing on an i7 or faster system.
This code is a preliminary version, done for my countertree library. If you consider useful, I will change the name space and document the code, the algorithms and the ideas obtained from the benchmarks.
Please feel free for ask me any question or for to modify or change anything. The goal is to provide good sort methods to the users.
In the zip file you have the code and a file with a description of each code.
First, thanks for sending this in a fairly easy to test format.
I suggest updating your docs to recommend single-line compilation (if
appropriate), which saves a step:
Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall
-fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations
-O3 -I../../../src benchmark_sort_03.cpp -fopenmp -s -lpthread -ltbb
-o benchmark_sort_03 -pthread
I was unable to run your indirect sort benchmark:
Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall
-fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations
-O3 -I../../../src -c benchmark_sort_indirect.cpp -o
benchmark_sort_indirect.o -pthread
Sort/Modified/benchmark/GCC/util_sort$ g++ -o benchmark_sort_indirect
benchmark_sort_indirect.o -pthread -fopenmp -s -lpthread -ltbb
Sort/Modified/benchmark/GCC/util_sort$ ./benchmark_sort_indirect
Size of element 8 Number of elements :100000000
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)
I'm actually quite interested in the generalized indirect sort idea (I
looked at your two functions). Do you think you could package that up
so it's a simple wrapper, where the user passes in the sort function
they want to use, and it takes care of changing the input and
comparison (if possible) to be indirect, running the sort, and then
swapping into place based on the results of the indirect sort? If
that can be done without significant loss of efficiency, it would seem
quite useful (I've had do indirect sorts multiple times, and it's
always some code). Even your two functions look useful on their own.
I suggest using the boost random number generator (or if you stick
with the c++11 requirement, which I don't recommend, the std::
equivalent):
#include
Hi Steven, I thought all the program was checked, but obviously not. Now in the folder you have a shell script with the command for to compile and link with optimization ( by example make_benchmark_sort_03.sh), and a make_all.sh which compile all. Open a terminal in the folder, and compile and run All the programs use the HW thread and use all available. All the programs had been done for to run in a Linux machine with 4 G of RAM. The number of element involved in the sort is defined at the beginning of the program with the #define NELEM. You can change this value, and compile again. As suggested me, change the number generator and now use a Mersenne of the standard library. The stable sort is less used than the non stable sort, but sometimes is important. I developed because in my project have a proxy , which receives the operations over the elements defined by a key. The proxy sort, and send the operations to the corresponding thread, and it is not the same, read the value of a key and after delete that key, than delete the key and after try to read the value. I can't add a time mark for to know which arrived before In order to clarify the information obtained from the benchmarks : In the folder Modified the GCC parallel:sort and parallel:merge_sort, and the countertree::parallel_sort and countertree::parallel_merge_sort use the std::sort and the std::merge_sort. The utility of this folder is the comparison with the same programs of the folder Original, the result show the problems described in this message. In the folder Original the GCC parallel:sort and parallel:merge_sort, use the std::sort and the std::merge_sort., and the countertree::parallel_sort use countertree::intro_sort and countertree::parallel_merge_sort use countertree::merge_sort. SCALABILITY : Some algorithms don't scale well when the number of threads grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows As say in the previous message, the stable sort, with 1 thread need 2.7, and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4 HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] ) This show countertree::merge_sort is a good for to use in a parallel SW and GCC std::stable_sort is bad. If you see the results of the countertree:parallel_merge_sort in the benchmarks in the folder Modified, are really bad compared with the same results in the folder Original. The difference is the internal algorithm. In the folder Modified use GCC std::stable_sort, and in the Original use countertree::merge_sort The GCC std::sort and countertree::intro_sort have similar performance with all the size of elements. But in the parallel process, when the size of the elements grow, the poor scalability appear. The next lines are are from Original benchmark_sort_05.cpp on my computer With 128 bytes elements Random elements, few repeated ----------------------- STL std::sort :2.9188 secs countertree::intro_sort :2.87864 secs parallel::sort :2.46113 secs tbb::parallel_sort :1.86458 secs countertree::parallel_sort:1.89263 secs With 256 bytes elements Random elements, few repeated ( rand() )----------------------- STL std::sort :3.03143 secs countertree::intro_sort :3.02733 secs parallel::sort :2.23231 secs tbb::parallel_sort :1.76186 secs countertree::parallel_sort:1.56509 secs As we can see the parallel process is ineffective with GCC parallel::sort with big objects. About the merge_sort algorithm. It use additionally half of the memory used by the data. But if you don't use additional memory, the speed drops. I didn't examine in deep the Algorithm used in GCC, I only codified my own algorithm. I will examine, and I will try to reduce the memory used, without lost of speed and scalability. About the server for to test. I don't have modern servers in my school. The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3 years old machine with 4 HW threads, and in the benchmarks is slower than the I3 and I5 machined mounted this year. If someone can check and provide us the result, I will be grateful. The version 4.9.1 of GCC have better optimization than the 4.8.2. If you are using Ubuntu, the version 14.10 have the 4.9.1 version as predefined. Yours Francisco Tapia
Francisco,
On Tue Dec 16 2014 at 1:42:59 PM Francisco José Tapia
Hi Steven,
I thought all the program was checked, but obviously not. Now in the folder you have a shell script with the command for to compile and link with optimization ( by example make_benchmark_sort_03.sh), and a make_all.sh which compile all. Open a terminal in the folder, and compile and run
Great! Thanks. I ran everything, and the only failure was with benchmark_sort_indirect: ./benchmark_sort_indirect Size of element 8 Number of elements :100000000 terminate called after throwing an instance of 'std::system_error' what(): Enable multithreading to use std::thread: Operation not permitted Aborted (core dumped)
As suggested me, change the number generator and now use a Mersenne of the standard library.
Thanks
The stable sort is less used than the non stable sort, but sometimes is important. I developed because in my project have a proxy , which receives the operations over the elements defined by a key. The proxy sort, and send the operations to the corresponding thread, and it is not the same, read the value of a key and after delete that key, than delete the key and after try to read the value. I can't add a time mark for to know which arrived before
Fair enough.
As say in the previous message, the stable sort, with 1 thread need 2.7, and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4 HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )
This show countertree::merge_sort is a good for to use in a parallel SW and GCC std::stable_sort is bad.
with all the size of elements. But in the parallel process, when the size of the elements grow, the poor scalability appear. The next lines are are from Original benchmark_sort_05.cpp on my computer
With 128 bytes elements
Random elements, few repeated -----------------------
STL std::sort :2.9188 secs
countertree::intro_sort :2.87864 secs
parallel::sort :2.46113 secs
tbb::parallel_sort :1.86458 secs
countertree::parallel_sort:1.89263 secs
With 256 bytes elements
Random elements, few repeated ( rand() )-----------------------
STL std::sort :3.03143 secs
countertree::intro_sort :3.02733 secs
parallel::sort :2.23231 secs
tbb::parallel_sort :1.76186 secs
countertree::parallel_sort:1.56509 secs
As we can see the parallel process is ineffective with GCC parallel::sort with big objects.
Those don't match up with my results for 256 byte elements: STL std::sort :1.04718 secs countertree::intro_sort :1.14214 secs
The numbers I see are not as extreme, but your merge_sort looks good in comparison, especially with large elements. I will investigate alternatives to see if this is a truly speed-competitive offering. Have you considered Timsort (standard in Java and Python)? Timsort is a stable sort that is particularly good with sorted, mostly-sorted, and reverse-sorted data. It might be worth offering as an alternative stable sort for those concerned about mostly-sorted data. Your intro_sort looks reasonable overall, but it varies from comparable speed to 14% slower when run in parallel on random data, and is an order of magnitude slower on already-sorted data. As tbb is free, I just don't see a compelling case for your introsort implementation; tbb is a very hard competitor to beat. The GCC std::sort and countertree::intro_sort have similar performance parallel::sort :0.679178 secs tbb::parallel_sort :0.600001 secs countertree::parallel_sort:0.627144 secs And this is what I see for size 64: STL std::sort :1.6577 secs countertree::intro_sort :1.73023 secs parallel::sort :0.963199 secs tbb::parallel_sort :0.873805 secs countertree::parallel_sort:0.997228 secs I'm not sure the benefits you are seeing are portable.
About the merge_sort algorithm. It use additionally half of the memory used by the data. But if you don't use additional memory, the speed drops. I didn't examine in deep the Algorithm used in GCC, I only codified my own algorithm. I will examine, and I will try to reduce the memory used, without lost of speed and scalability.
The standard algorithm says it will lose speed when it reduces memory used; it'd be a nice feature to have, but if we can repeatably see 2X relative speedups without that feature, I think some people may find it useful.
About the server for to test. I don't have modern servers in my school. The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3 years old machine with 4 HW threads, and in the benchmarks is slower than the I3 and I5 machined mounted this year. If someone can check and provide us the result, I will be grateful. The version 4.9.1 of GCC have better optimization than the 4.8.2. If you are using Ubuntu, the version 14.10 have the 4.9.1 version as predefined.
A boost library should still be useful for people at least one compiler
version back; I feel fully justified testing on 4.8.2. Have you tested on Windows yet?
Francisco, Have you looked at: https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-f... For comparison?
Francisco,
On Thu, Dec 18, 2014 at 10:22 PM, Steven Ross
Francisco,
Have you looked at: https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-f...
For comparison?
I used the free tbb parallel stable sort referenced above for comparison, and found that as soon as I randomized the contents of the entire struct and made the copy constructor copy over the entire struct, that the tbb version was 37% faster on randomized data all the way up to 256 bytes relative to countertree::parallel_merge_sort. I've attached my modified "original" directory where I tested this out (see: Original/benchmark/parallel_stable_sort/build/make_tbb_benchmark.sh). Unless you can get your library close to the speed of this tbb sort, I don't see how we'd be benefiting people by pointing to it instead of the tbb library. What I am interested in is your idea for indirect sorting: can you come up with an easy-to-use API to handle efficient indirect sorting? That would probably be worth including in the boost::sort library, especially if it is compatible with different sort functions.
Hi Steven, I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort . I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number. For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04. I repeat this process with all the algorithms for to know the memory used by each algorithm. Algorithm Memory used std::sort 1 565 456 countertree::intro_sort 1 565 280 GCC parallel::sort 2 346 868 tbb::parallel_sort 1 565 696 countertree::parallel_sort 1 565 468 std::stable_sort 1 957 232 countertree::merge_sort 1 955 908 countertree::circular_merge_sort 1 696 316 sfx::timsort 1 996 364 parallel::stable_sort 2 742 384 countertree::parallel_merge_sort 1 956 636 I checked the time with benchmark_sort_05, with copy all the structure in the operation = and in the copy constructor. The number of elements in each sort is 800000000/size of the element in bytes 8 bytes 16 bytes 32 bytes 48 bytes 64 bytes 128 bytes 256 bytes std::stable_sort 16.99 10.36 9.75 9.62 8.85 8.89 10.25 countertree::merge_sort 16.96 10.38 8.13 7.44 7.10 6.67 9.01 countertree::circular_merge_sort 19.77 12.80 8.81 7.61 7.01 6.48 8.62 timsort 22.89 13.43 10.12 8.33 7.70 7.00 8.62 parallel::stable_sort 9.68 8.91 8.90 8.90 8.23 8.21 7.72 countertree::parallel_merge_sort 7.04 5.96 5.74 5.70 5.64 5.54 5.94 About the stable sort of tbb, I see, but I didn't have time to examine and check in deep. But be quiet. The benchmark_sort indirect shows Size of element 48 Number of elements :16666666 parallel_sort :2.53964 sec indirect_parallel_sort :5.70979 sec parallel_stable_sort :5.73186 sec indirect_parallel_stable_sort :5.59269 sec Size of element 64 Number of elements :12500000 parallel_sort :2.59812 sec indirect_parallel_sort :4.1553 sec parallel_stable_sort :5.63771 sec indirect_parallel_stable_sort :3.95266 sec Size of element 128 Number of elements :6250000 parallel_sort :2.28648 sec indirect_parallel_sort :2.34792 sec parallel_stable_sort :5.62879 sec indirect_parallel_stable_sort :2.30188 sec Size of element 256 Number of elements :3125000 parallel_sort :2.05017 sec indirect_parallel_sort :1.50661 sec parallel_stable_sort :5.85409 sec indirect_parallel_stable_sort :1.48561 sec As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster. The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version. If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner. We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort. I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm. At end, only say Happy Christmas, and New Year to the Boost community. Yours Francisco Tapia fjtapia@gmail.com
I forget to mention the Windows version of the benchmark. I supouse, they
results will be similar to the obtained in Linux. I will prepare, too
With all the information, we can decide the best solution if exist.
Thanks
Francisco Tapia
2014-12-22 22:56 GMT+01:00 Francisco José Tapia
Hi Steven,
I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort
I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .
I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.
For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.
I repeat this process with all the algorithms for to know the memory used by each algorithm.
Algorithm
Memory used
std::sort
1 565 456
countertree::intro_sort
1 565 280
GCC parallel::sort
2 346 868
tbb::parallel_sort
1 565 696
countertree::parallel_sort
1 565 468
std::stable_sort
1 957 232
countertree::merge_sort
1 955 908
countertree::circular_merge_sort
1 696 316
sfx::timsort
1 996 364
parallel::stable_sort
2 742 384
countertree::parallel_merge_sort
1 956 636
I checked the time with benchmark_sort_05, with copy all the structure in the operation = and in the copy constructor. The number of elements in each sort is 800000000/size of the element in bytes
8
bytes
16
bytes
32
bytes
48
bytes
64
bytes
128
bytes
256
bytes
std::stable_sort
16.99
10.36
9.75
9.62
8.85
8.89
10.25
countertree::merge_sort
16.96
10.38
8.13
7.44
7.10
6.67
9.01
countertree::circular_merge_sort
19.77
12.80
8.81
7.61
7.01
6.48
8.62
timsort
22.89
13.43
10.12
8.33
7.70
7.00
8.62
parallel::stable_sort
9.68
8.91
8.90
8.90
8.23
8.21
7.72
countertree::parallel_merge_sort
7.04
5.96
5.74
5.70
5.64
5.54
5.94
About the stable sort of tbb, I see, but I didn't have time to examine and check in deep. But be quiet. The benchmark_sort indirect shows
Size of element 48 Number of elements :16666666
parallel_sort :2.53964 sec
indirect_parallel_sort :5.70979 sec
parallel_stable_sort :5.73186 sec
indirect_parallel_stable_sort :5.59269 sec
Size of element 64 Number of elements :12500000
parallel_sort :2.59812 sec
indirect_parallel_sort :4.1553 sec
parallel_stable_sort :5.63771 sec
indirect_parallel_stable_sort :3.95266 sec
Size of element 128 Number of elements :6250000
parallel_sort :2.28648 sec
indirect_parallel_sort :2.34792 sec
parallel_stable_sort :5.62879 sec
indirect_parallel_stable_sort :2.30188 sec
Size of element 256 Number of elements :3125000
parallel_sort :2.05017 sec
indirect_parallel_sort :1.50661 sec
parallel_stable_sort :5.85409 sec
indirect_parallel_stable_sort :1.48561 sec
As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.
The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements
I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.
If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience
Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.
We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort.
I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.
At end, only say Happy Christmas, and New Year to the Boost community.
Yours
Francisco Tapia
fjtapia@gmail.com
Francisco,
On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia
I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort
Some people may find that useful. What is the worst-case computational complexity? Is it a well-known algorithm? The bar is going to be higher (may require a full review) for previously unknown algorithms.
I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .
I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.
For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.
I repeat this process with all the algorithms for to know the memory used by each algorithm.
I suggest changing the struct you use for benchmarking to this:
template
As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.
Yes, indirect sort should eventually be faster, given a large enough element, especially if indirect sort is done with the fastest underlying algorithm available.
The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements
That sounds reasonable for fixed-size elements. What about variable-length elements, like strings? Will this only be for fixed-size elements?
I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.
If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience
I think indirect sorting is promising.
Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.
We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or
TBB
support, or simply, the user don't want to load them for to make a sort.
Agreed, but OpenMP is very common (even Android supports it)!, so systems without OpenMP are a very small niche in terms of total systems out there. TBB is supported on intel-compatible processors, and the vast majority of deployed processors where parallel sorting makes sense (won't drain the battery really quickly) are intel-compatible. If we're going to provide an alternative, it should cost very little in terms of performance, and ideally provide some kind of benefit besides eliminating a dependency.
I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.
I'd like to see that document. Along with testing a struct that is an array of ints, you should probably test variable-length string sorting, as string sorting is very common.
Merry Christmas, Steve
We can do too a benchmark with variable lenght elements with strings.
If you prepare the operations for to do in that test, I can prepare an
additional benchmark with all the algorithms and your operations.
After all, we see and take a decision
Yours
Francisco
2014-12-23 4:34 GMT+01:00 Steven Ross
Francisco,
On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia
wrote: I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort
Some people may find that useful. What is the worst-case computational complexity? Is it a well-known algorithm? The bar is going to be higher (may require a full review) for previously unknown algorithms.
I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .
I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.
For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.
I repeat this process with all the algorithms for to know the memory used by each algorithm.
I suggest changing the struct you use for benchmarking to this: template
struct reburcio { uint64_t M[NN]; reburcio () { init(); } void init() { for (int i = 0; i < NN; ++i) { M[i] = my_rand(); } }
bool operator < ( const reburcio &R) const{ for (int i = 0; i < NN; ++i) { if (M[i] != R.M[i]) { return M[i] < R.M[i]; } } return false; } };
That way it's using the default copy and assignment operators, which copy over the entire datastructure, and it initializes and compares all elements of the array. It appears that the benchmark you sent me only copies over and compares the first element in the array, which might have caused some of the large-element differences in performance you were seeing.
With this data structure, countertree intro_sort is competitive on my linux system with std::sort on fully random data, and countertree parallel_sort is competitive for 8, 16, and 256 bytes per element, but significantly underperforms (10-15%) tbb for 32 and 64 bytes.
As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.
Yes, indirect sort should eventually be faster, given a large enough element, especially if indirect sort is done with the fastest underlying algorithm available.
The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements
That sounds reasonable for fixed-size elements. What about variable-length elements, like strings? Will this only be for fixed-size elements?
I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.
If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience
I think indirect sorting is promising.
Mi intention is not to impose my code, I want provide to the final
users
the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.
We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort.
Agreed, but OpenMP is very common (even Android supports it)!, so systems without OpenMP are a very small niche in terms of total systems out there. TBB is supported on intel-compatible processors, and the vast majority of deployed processors where parallel sorting makes sense (won't drain the battery really quickly) are intel-compatible. If we're going to provide an alternative, it should cost very little in terms of performance, and ideally provide some kind of benefit besides eliminating a dependency.
I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.
I'd like to see that document. Along with testing a struct that is an array of ints, you should probably test variable-length string sorting, as string sorting is very common.
Merry Christmas,
Steve
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Francisco,
On Tue Dec 23 2014 at 3:31:32 AM Francisco José Tapia
We can do too a benchmark with variable lenght elements with strings.
If you prepare the operations for to do in that test, I can prepare an additional benchmark with all the algorithms and your operations.
After all, we see and take a decision
Yours
Francisco
The approach I've used with the boost::sort library is to set up the benchmark tests so that they read in a file of random data generated by a single application (randomgen), or any other file that is passed in for testing, and they write out their results to file. This enables better testing and debugging in case there are any problems (I directly diff the sort results when I can expect them to be identical; tune.pl does this automatically). I would prefer if you used stringsample.cpp and int64.cpp as examples of how to do this, and downloaded the boost sort library and added any modifications you need into that code. The library has a tune.pl script that runs the benchmarks. It may be helpful to write some utility templated funciton to switch between different algorithms, instead of the current approach of using a "-std" switch to use std::sort. https://github.com/spreadsort/sort There is a README there on how to install it using modular boost.
Hi Steven
Thanks by your ideas. I will check.
Now I am designing, developing and testing several improvements for the
algorithms. When finish I will prepare a clear and easy interface for to
transform any sort algorithm, even the parallel, in a indirect sort, and
prepare a brief document where explain how to do.
Perhaps it will be a good moment for to move all the code to the namespace
boost::sort
Only after these things, I will begin to prepare the big benchmark. And
then, I would like your experience and opinion.
My idea is to make the benchmark running in a easy way, if we want know
the opinion and results of other members of the Boost community, we must
try to simplify their work.
I will inform of my progress, but I must travel for to be with my family,
and until January, I didn't run at full speed.
Happy Christmas. My best wishes for you and the Boost community
Francisco
2014-12-23 12:49 GMT+01:00 Steven Ross
Francisco,
On Tue Dec 23 2014 at 3:31:32 AM Francisco José Tapia
wrote: We can do too a benchmark with variable lenght elements with strings.
If you prepare the operations for to do in that test, I can prepare an additional benchmark with all the algorithms and your operations.
After all, we see and take a decision
Yours
Francisco
The approach I've used with the boost::sort library is to set up the benchmark tests so that they read in a file of random data generated by a single application (randomgen), or any other file that is passed in for testing, and they write out their results to file. This enables better testing and debugging in case there are any problems (I directly diff the sort results when I can expect them to be identical; tune.pl does this automatically). I would prefer if you used stringsample.cpp and int64.cpp as examples of how to do this, and downloaded the boost sort library and added any modifications you need into that code. The library has a tune.pl script that runs the benchmarks. It may be helpful to write some utility templated funciton to switch between different algorithms, instead of the current approach of using a "-std" switch to use std::sort. https://github.com/spreadsort/sort There is a README there on how to install it using modular boost.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi Steven, This Christmas I had been working hard. I developed a new version of the merge sort about 10-15% faster than the previous, using an additional memory only ½ of the memory used by the data. The GCC and TBB algorithms use an additional memory as the size of the data. This version is exception safe. If an exception occurs due to the algorithm ( only throw exceptions in the allocation and deallocation of the additional memory), the integrity of the data is guarantee. You have all the original data (unordered in the allocation, and ordered in the deallocation) I examined the TBB version and the secret is the sample sort algorithm. The version of TBB is provisional and it's not exception safe. I am working in the design of a new version of this algorithm exception safe. The actual version of the parallel stable algorithm is about 20% slower than the TBB version in the worse case, but when the size of the elements grows, the indirect version is faster than all the others, with 64 bytes, is around 10% faster than the best TBB version, with 128 bytes , the best version of TBB is around a 90% slower, and with 256 is 240% slower. An additional advantage of the indirect sort is they use additional memory for the pointers, not for the data, and the additional memory is around a 10% of the used by the data. I hope to have the new algorithm in two weeks, more or less. Then I will put all in the name space boost::sort, and design all the test and benchmarks, as commented in the previous message. Yours Francisco
Francisco,
On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia
Hi Steven,
This Christmas I had been working hard. I developed a new version of the merge sort about 10-15% faster than the previous, using an additional memory only ½ of the memory used by the data. The GCC and TBB algorithms use an additional memory as the size of the data.
This version is exception safe. If an exception occurs due to the algorithm ( only throw exceptions in the allocation and deallocation of the additional memory), the integrity of the data is guarantee. You have all the original data (unordered in the allocation, and ordered in the deallocation)
Less memory, comparable speed, and exception safety sounds good.
I examined the TBB version and the secret is the sample sort algorithm. The version of TBB is provisional and it's not exception safe. I am working in the design of a new version of this algorithm exception safe. The actual version of the parallel stable algorithm is about 20% slower than the TBB version in the worse case, but when the size of the elements grows, the indirect version is faster than all the others, with 64 bytes, is around 10% faster than the best TBB version, with 128 bytes , the best version of TBB is around a 90% slower, and with 256 is 240% slower.
Yes, indirect sorting is clearly superior for large data types, and it would be good to make it as easy as possible to use. It'd be nice to also have a simple wrapper that can be use the spreadsort algorithm with indirect sorting too.
I am in the process of integrating the first version of the boost sort library in https://github.com/boostorg/sort. It would be good to integrate your testing with the tests in that library.
On 14-01-2015 03:52, Steven Ross wrote:
I examined the TBB version and the secret is the sample sort algorithm. The version of TBB is provisional and it's not exception safe. I am working in the design of a new version of this algorithm exception safe. The actual version of the parallel stable algorithm is about 20% slower than the TBB version in the worse case, but when the size of the elements grows, the indirect version is faster than all the others, with 64 bytes, is around 10% faster than the best TBB version, with 128 bytes , the best version of TBB is around a 90% slower, and with 256 is 240% slower.
Yes, indirect sorting is clearly superior for large data types, and it would be good to make it as easy as possible to use. It'd be nice to also have a simple wrapper that can be use the spreadsort algorithm with indirect sorting too.
Larger, non-fast movable data types? -Thorsten
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Ross Sent: 14 January 2015 02:53 To: boost@lists.boost.org Subject: Re: [boost] [SORT] Parallel Algorithms
Francisco,
On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia
wrote: Hi Steven,
This Christmas I had been working hard. I developed a new version of the merge sort about 10-15% faster than the previous, using an additional memory only ½ of the memory used by the data. The GCC and TBB algorithms use an additional memory as the size of the data.
<snip>
Yes, indirect sorting is clearly superior for large data types, and it would be good to make it as easy as possible to use. It'd be nice to also have a simple wrapper that can be use the spreadsort algorithm with indirect sorting too.
Would Boost.Multiprecision (50 decimal digits floating-point say) provide a simple demo type?
I am in the process of integrating the first version of the boost sort library in https://github.com/boostorg/sort. It would be good to integrate your testing with the tests in that library.
Perhaps a git branch off the new /sort/develop branch would be useful? I note that Steven Ross is making a separate namespace for spreadsort boost::sort::spreadsort Although this seems long (slow typists can provide an alias or a using ... ;-) I strongly suggest that you should follow this example, perhaps boost::sort::tbb though I don't like tbb at all - far too acronymic? Think of a better name, even if longer? boost::sort::parallel? Paul PS The new Sort/Spreadsort docs are almost ready for release. I will add Quickbook documentation (from plain text notes) for your parallel sort version if you write to me off-list. --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830
Hi Thorsten, The idea about the indirect sort is very easy. Instead of sort the elements, create a vector of pointers, and sort the vector of pointers, with this you don't move the elements , move the pointers. And at end with the vector of pointers sorted, move the data only 1 time. In the stable sort, the algorithms usually need additional memory. The GCC and TBB the same memory than the data, in my algorithms only 1 half. Imagine you want to sort 10 million elements of 128 bytes each. With the indirect sort you need 10 million pointers and additionally memory for to allocate 5 million pointers. The memory used by the data is 1280 Megas, but pointers and the additional memory are 15 million * 8 = 120 Megas, around a 10% of the memory used by the data. In the first messages of this conversation, I sent a zip with code. In this file you can find the code of the indirect sort. If don't understand, something, please say me, and I will try to explain. Paul, about the name space, if at end, the results are satisfactory, and decide to include in the library, only say me the name space and I will put. For me it's indifferent . But, I need time for the new algorithm, and design a complete test suite of the code, the benchmarks, for to compare with others versions and the documentation. If you want a benchmark with multi precision numbers, we can do without problem, even if you want to check the speed with different sizes of the numbers Yours Francisco
Hi Steven
I have the final version of all the algorithms. I wrote 6 algorithms of
stable sort, and selected the fastest. On my computer is about 15% faster
than the GCC stable sort.
I had problems too with an error in an algorithm mixed with an error in
the GCC dynamic memory system, Under certain sequence of request, the
system provide a non valid address and due this a segmentation fault. The
debug for to detect and correct had been a nightmare.
The algorithms provided are :
1.- sort : The algorithm is intro_sort. Don't need additional memory.
Performance similar to the GCC version and Windows version. Non stable
2.- parallel_sort : Decompose the global range in small ranges for to be
sorted with intro_sort. Don't need additional memory. Similar performance
than the TBB version and Microsoft version. Non Stable
3.- stable_sort : The algorithm is called smart_merge_sort, and is about
10-15% faster than the GCC stable sort version, The additional memory is an
half of the memory used by the data.( The GCC stable sort use the same
memory than the data)
4.- sample_sort : is a parallel stable sort. Have the same performance than
the TBB implementations, and much more faster than the GCC parallel
stable_sort. The additional memory is the same than the data
5.- parallel_stable_sort : This algorithm is based on the sample sort.
Increase the sample_sort time about a 5%, but the additional memory used is
an half of the memory used by the data.
The functions are in the namespàce boost::sort All the algorithms are
exception safe, and use indirect sort when the size of the objects is big.
According to my benchmarks, the size limit for to change a indirect sort is
128 bytes.
The function format of the parallel algorithms are
algorithm ( first, last, compare , number_threads).
The number of threads is passed with an struct NThreads which receives an
unsigned, and return an unsigned. The default construct of this struct is
filled with the HW threads
At today I am testing on Windows with Visual Studio 2013 CTP 4, and
changing some things of the code due to the incompatibility with C++11
features
In a few days I will document all the functions of the code for generate a
doxygen documentation, and can send you a version of the code.
Then we must talk about how to design the benchmark. I think the starting
point can be your experience with your code
The C++ committee propose a format for to call the parallel functions
described in
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
it would be a good idea to examine, and decide if follow the proposal, if
we don't follow or if we provide the two interfaces for the parallel
algorithms. According with my first impression it's not very complex to
implement.
Yours
Francisco Tapia
2015-01-14 17:26 GMT+01:00 Francisco José Tapia
Hi Thorsten,
The idea about the indirect sort is very easy. Instead of sort the elements, create a vector of pointers, and sort the vector of pointers, with this you don't move the elements , move the pointers. And at end with the vector of pointers sorted, move the data only 1 time.
In the stable sort, the algorithms usually need additional memory. The GCC and TBB the same memory than the data, in my algorithms only 1 half. Imagine you want to sort 10 million elements of 128 bytes each. With the indirect sort you need 10 million pointers and additionally memory for to allocate 5 million pointers.
The memory used by the data is 1280 Megas, but pointers and the additional memory are 15 million * 8 = 120 Megas, around a 10% of the memory used by the data.
In the first messages of this conversation, I sent a zip with code. In this file you can find the code of the indirect sort. If don't understand, something, please say me, and I will try to explain.
Paul,
about the name space, if at end, the results are satisfactory, and decide to include in the library, only say me the name space and I will put. For me it's indifferent . But, I need time for the new algorithm, and design a complete test suite of the code, the benchmarks, for to compare with others versions and the documentation.
If you want a benchmark with multi precision numbers, we can do without problem, even if you want to check the speed with different sizes of the numbers
Yours
Francisco
On Fri, Mar 20, 2015 at 9:12 AM Francisco José Tapia
I have the final version of all the algorithms. I wrote 6 algorithms of stable sort, and selected the fastest. On my computer is about 15% faster than the GCC stable sort.
Could you send them please? I'd like to verify the speedup you're seeing, and compare vs TBB. I'm ok if it just is comparable in speed to TBB and it has additional advantages of some type.
I had problems too with an error in an algorithm mixed with an error in the GCC dynamic memory system, Under certain sequence of request, the system provide a non valid address and due this a segmentation fault. The debug for to detect and correct had been a nightmare.
One thing to watch out for is undefined operations, such as type conversions between int and float types; those aren't compiler bugs: they're undefined (this problem hit me with float_sort, before someone showed me how to convert safely). I hope you've fixed the bug.
The algorithms provided are :
1.- sort : The algorithm is intro_sort. Don't need additional memory. Performance similar to the GCC version and Windows version. Non stable
2.- parallel_sort : Decompose the global range in small ranges for to be sorted with intro_sort. Don't need additional memory. Similar performance than the TBB version and Microsoft version. Non Stable
Does it provide any advantage over those implementations, aside from being open-source?
3.- stable_sort : The algorithm is called smart_merge_sort, and is about 10-15% faster than the GCC stable sort version, The additional memory is an half of the memory used by the data.( The GCC stable sort use the same memory than the data)
GCC uses no additional memory, or 100% additional memory? Did you compare against the tbb stable sort I pointed you to?
4.- sample_sort : is a parallel stable sort. Have the same performance than the TBB implementations, and much more faster than the GCC parallel stable_sort. The additional memory is the same than the data
5.- parallel_stable_sort : This algorithm is based on the sample sort. Increase the sample_sort time about a 5%, but the additional memory used is an half of the memory used by the data.
That's interesting, so it's better from a memory perspective than the competing algorithms, and not bad in speed? That sounds like it's worth serious consideration as an addition.
The functions are in the namespàce boost::sort All the algorithms are exception safe, and use indirect sort when the size of the objects is big. According to my benchmarks, the size limit for to change a indirect sort is 128 bytes.
The function format of the parallel algorithms are
algorithm ( first, last, compare , number_threads).
The number of threads is passed with an struct NThreads which receives an unsigned, and return an unsigned. The default construct of this struct is filled with the HW threads
At today I am testing on Windows with Visual Studio 2013 CTP 4, and changing some things of the code due to the incompatibility with C++11 features
In a few days I will document all the functions of the code for generate a doxygen documentation, and can send you a version of the code.
Then we must talk about how to design the benchmark. I think the starting point can be your experience with your code
Please look at this code, and refactor/reuse what test code you can: https://github.com/boostorg/sort Specifically, look in the test directory, the example directory, and look in tune.pl. Are there any parameters you would want to tune for specific systems (besides the number of threads)? The basic idea is that randomgen writes out a randomized distribution to disk, and then we run whatever algorithms we want to compare (writing their results to disk), verify that the results are correct, then compare speed. I'd prefer not to skip the disk intermediate step because having the files on disk makes it easier to debug when something gives incorrect results. We should compare with variable-length strings and ints, in addition to more complicated data structures like what you've been testing with. It's worth noting that it's quite common to use vectors for large quantities of data in a class/struct, in which case the sorting would already be indirect.
The C++ committee propose a format for to call the parallel functions described in
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
it would be a good idea to examine, and decide if follow the proposal, if we don't follow or if we provide the two interfaces for the parallel algorithms. According with my first impression it's not very complex to implement.
With your implementations, it looks like people have these choices: 1) how many threads to use 2) stable or not 3) memory or speed efficient 4) direct or indirect Note that #4 may be influenced by #3, and that you may be able to determine that automatically, but providing people a simple way to make those choices is a good idea. You could use the approach suggested by the committee if it fits well, or another one if you have a better approach to this specific problem. Steve
Hi Francisco,
On 17 December 2014 at 05:42, Francisco José Tapia
SCALABILITY : Some algorithms don't scale well when the number of threads grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows
Sorry about butting in so late, but on the topic of scalability and potential superiority to TBB, you might be interested to check out Hartmut Kaiser's talk on asynchronous computing ( https://www.youtube.com/watch?v=5xyztU__yys) and his lab's library, HPX ( http://stellar.cct.lsu.edu/projects/hpx/). Cheers. Jeremy
Thanks,
I will see. Always is interested to hear Mr Kaiser, and learn new things
from their huge knowledge
Francisco
2015-03-26 7:15 GMT+01:00 Jeremy Murphy
Hi Francisco,
On 17 December 2014 at 05:42, Francisco José Tapia
wrote: SCALABILITY : Some algorithms don't scale well when the number of
threads
grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows
Sorry about butting in so late, but on the topic of scalability and potential superiority to TBB, you might be interested to check out Hartmut Kaiser's talk on asynchronous computing ( https://www.youtube.com/watch?v=5xyztU__yys) and his lab's library, HPX ( http://stellar.cct.lsu.edu/projects/hpx/). Cheers.
Jeremy
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (5)
-
Francisco José Tapia
-
Jeremy Murphy
-
Paul A. Bristow
-
Steven Ross
-
Thorsten Ottosen