[SORT] Parallel Algorithms

newer
Boost.uBLAS a library for matrix...

older
[Locale] Boost.Locale cannot Run...

Francisco José Tapia

14 Dec 2014 14 Dec '14

6:03 p.m.

Sorry by the delay, As promised, the sorting parallel algorithms On my computer (Quad Core) the GCC std::sort and stable_sort are slight faster than my version of intro_sort and merge_sort (stable sort is a synonym of merge_sort). In other computer with an Intel I3, my algorithms are slight faster than in the GCC version. Trying to optimize the parallel implementation. I modified the parallel version for to use the std::sort and std::stable_sort. The parallel_sort version response good, but the parallel_stable_sort increase the time around a 60% compared with the version with countertree::parallel_merge_sort . After many test, I discover the cause of the problem. The demonstration is in the test_stress_sort_01.cpp. In this program we have a big vector of 100 000 000 elements uint64_t. The program make parts of the same size for each HW core. On my computer 4 parts and 4 cores. The program execute the part sequentially checking the time, and after do the same but in parallel, each thread sort a part, and check the time. This is done with the std::stable sort and countertree::merge_sort. The data input is the same for all. The next lines are the result show max. OpenMP threads: 4 (4 processors) Sort of 100000000 elements in a vector --------------------- Random elements, few repeated ( rand() )----------------------- Stable Sort Part 0 secs :2.77955 Part 1 secs :2.75015 Part 2 secs :2.76068 Part 3 secs :2.76284 Parallel Sort secs :8.49796 Countertree stable sort Part 0 secs :2.72802 Part 1 secs :2.71614 Part 2 secs :2.72074 Part 3 secs :2.722 Parallel Sort secs :4.80692 Surprisingly the parallel sort with countertree::stable_sort is much more faster than the sort with std::stable_sort in the GCC 4.8 and 4.9 compiler. This is the secret of the speed of the countertree::parallel_merge_sort. This is very important, because when the size of the elements is small, the GCC and the countertree algorithms have similar performance. But when the size of the elements grow, the std::stable sort is slower than the countertree::merge_sort. This difference is increased in the parallel version. The next lines are part of benchmark_sort_05.cpp *************************************************************** Sort of N = 6250000 elements of size 128 ***************************************************************** Random elements, few repeated ( rand() )----------------------- STL std::sort :2.9188 secs countertree::intro_sort :2.87864 secs parallel::sort :2.46113 secs tbb::parallel_sort :1.86458 secs countertree::parallel_sort:1.89263 secs std::stable_sort :8.77963 secs countertree::merge_sort :5.70694 secs parallel::stable_sort :8.03946 secs countertree::parallel_merge_sort :4.90815 secs I have in mind a new version of the stable sort in order to improve the speed, reducing the number of internal copy of elements. I hope to have time in Christmas, but I can't guarantee. If the results are the expected I will send you You have too, a version of indirect_sort, which sort pointers to the elements, and at end sort the elements according the pointers. You can replace any sort algorithm by their indirect version, obtaining the same results, with the only difference of the time needed. You have a benchmark_indirect_sort.cpp where see the comparison of the time needed of several sort methods and their indirect version with elements of different size. This code is a preliminary version, done for my countertree library. If you consider useful, I will change the name space and document the code, the algorithms and the ideas obtained from the benchmarks. Please feel free for ask me any question or for to modify or change anything. The goal is to provide good sort methods to the users. In the zip file you have the code and a file with a description of each code. Yours

Attachments:

Sort.7z (application/x-7z-compressed — 32.5 KB)

Show replies by date

Steven Ross

15 Dec 15 Dec

2:52 a.m.

Franciso, On Sun, Dec 14, 2014 at 1:03 PM, Francisco José Tapia <fjtapia@gmail.com> wrote:

...

Sorry by the delay, As promised, the sorting parallel algorithms

On my computer (Quad Core) the GCC std::sort and stable_sort are slight faster than my version of intro_sort and merge_sort (stable sort is a synonym of merge_sort). In other computer with an Intel I3, my algorithms are slight faster than in the GCC version.

An I3 is low-end hardware. I recommend testing on an i7 or faster system.

...

This code is a preliminary version, done for my countertree library. If you consider useful, I will change the name space and document the code, the algorithms and the ideas obtained from the benchmarks.

Please feel free for ask me any question or for to modify or change anything. The goal is to provide good sort methods to the users.

In the zip file you have the code and a file with a description of each code.

First, thanks for sending this in a fairly easy to test format. I suggest updating your docs to recommend single-line compilation (if appropriate), which saves a step: Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall -fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations -O3 -I../../../src benchmark_sort_03.cpp -fopenmp -s -lpthread -ltbb -o benchmark_sort_03 -pthread I was unable to run your indirect sort benchmark: Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall -fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations -O3 -I../../../src -c benchmark_sort_indirect.cpp -o benchmark_sort_indirect.o -pthread Sort/Modified/benchmark/GCC/util_sort$ g++ -o benchmark_sort_indirect benchmark_sort_indirect.o -pthread -fopenmp -s -lpthread -ltbb Sort/Modified/benchmark/GCC/util_sort$ ./benchmark_sort_indirect Size of element 8 Number of elements :100000000 terminate called after throwing an instance of 'std::system_error' what(): Enable multithreading to use std::thread: Operation not permitted Aborted (core dumped) I'm actually quite interested in the generalized indirect sort idea (I looked at your two functions). Do you think you could package that up so it's a simple wrapper, where the user passes in the sort function they want to use, and it takes care of changing the input and comparison (if possible) to be indirect, running the sort, and then swapping into place based on the results of the indirect sort? If that can be done without significant loss of efficiency, it would seem quite useful (I've had do indirect sorts multiple times, and it's always some code). Even your two functions look useful on their own. I suggest using the boost random number generator (or if you stick with the c++11 requirement, which I don't recommend, the std:: equivalent): #include <boost/random/mersenne_twister.hpp> #include <boost/random/uniform_int_distribution.hpp> random::mt19937 generator; random::uniform_int_distribution<int> distribution; And for each random number: int result = distribution(generator) You may want to reuse and modify some of my tests in https://github.com/spreadsort/sort/tree/master/example rand() is returning lots of duplicate elements, as you're sorting more than RAND_MAX elements. I'm not sure why you're putting so much effort into competing with stable_sort. Some people use it, but I've never seen it in actual production code. std::stable_sort is not just merge sort, as based on the documentation it can use less memory than merge sort: http://www.cplusplus.com/reference/algorithm/stable_sort/ Complexity If enough extra memory is available, linearithmic in the distance between first and last: Performs up to N*log2(N) element comparisons (where N is this distance), and up to that many element moves. Otherwise, polyloglinear in that distance: Performs up to N*log22(N) element comparisons, and up to that many element swaps. The speed results I get on my system for your library are below. It just doesn't seem competitive with tbb for speed on my system. If you get it there (within 5% on random, sorted, and mostly-sorted data on both Windows and Linux), or if you find someone else with a genuine need to use this in production code, I'll be interested to take another look. This is compiled using g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 and run on a Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz: max. OpenMP threads:8 (processors)8 Sort of 200000000 elements in a vector ------------------- Sorted elements ----------------------------------------------- STL std::sort :3.60897 secs parallel::sort :1.24468 secs tbb::parallel_sort :0.059404 secs countertree::parallel_sort :1.25563 secs std::stable_sort :10.3108 secs parallel::stable_sort :3.1866 secs countertree::parallel_merge_sort :3.71677 secs Reverse sorted elements --------------------------------------- STL std::sort :2.57237 secs parallel::sort :1.02421 secs tbb::parallel_sort :0.895415 secs countertree::parallel_sort :1.30754 secs std::stable_sort :10.4286 secs parallel::stable_sort :3.21695 secs countertree::parallel_merge_sort :4.08913 secs Random elements, few repeated ( rand() )----------------------- STL std::sort :19.7451 secs parallel::sort :3.98043 secs tbb::parallel_sort :4.28722 secs countertree::parallel_sort :6.85041 secs std::stable_sort :18.6291 secs parallel::stable_sort :4.08416 secs countertree::parallel_merge_sort :6.09388 secs Random elements, quite repeated ( rand() % (NELEM/2) )--------- STL std::sort :19.5069 secs parallel::sort :3.93207 secs tbb::parallel_sort :4.66967 secs countertree::parallel_sort :6.29049 secs std::stable_sort :18.8519 secs parallel::stable_sort :4.31173 secs countertree::parallel_merge_sort :6.1165 secs Random element many repeated (rand() % 10000)-------------------- STL std::sort :11.8256 secs parallel::sort :2.65534 secs tbb::parallel_sort :3.44815 secs countertree::parallel_sort :3.79712 secs std::stable_sort :18.7368 secs parallel::stable_sort :3.70629 secs countertree::parallel_merge_sort :6.09851 secs Equal elements -------------------------------------------------- STL std::sort :4.50343 secs parallel::sort :1.44262 secs tbb::parallel_sort :0.0588514 secs countertree::parallel_sort :1.53466 secs std::stable_sort :10.382 secs parallel::stable_sort :3.22236 secs countertree::parallel_merge_sort :4.17672 secs

Francisco José Tapia

16 Dec 16 Dec

6:42 p.m.

Hi Steven, I thought all the program was checked, but obviously not. Now in the folder you have a shell script with the command for to compile and link with optimization ( by example make_benchmark_sort_03.sh), and a make_all.sh which compile all. Open a terminal in the folder, and compile and run All the programs use the HW thread and use all available. All the programs had been done for to run in a Linux machine with 4 G of RAM. The number of element involved in the sort is defined at the beginning of the program with the #define NELEM. You can change this value, and compile again. As suggested me, change the number generator and now use a Mersenne of the standard library. The stable sort is less used than the non stable sort, but sometimes is important. I developed because in my project have a proxy , which receives the operations over the elements defined by a key. The proxy sort, and send the operations to the corresponding thread, and it is not the same, read the value of a key and after delete that key, than delete the key and after try to read the value. I can't add a time mark for to know which arrived before In order to clarify the information obtained from the benchmarks : In the folder Modified the GCC parallel:sort and parallel:merge_sort, and the countertree::parallel_sort and countertree::parallel_merge_sort use the std::sort and the std::merge_sort. The utility of this folder is the comparison with the same programs of the folder Original, the result show the problems described in this message. In the folder Original the GCC parallel:sort and parallel:merge_sort, use the std::sort and the std::merge_sort., and the countertree::parallel_sort use countertree::intro_sort and countertree::parallel_merge_sort use countertree::merge_sort. SCALABILITY : Some algorithms don't scale well when the number of threads grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows As say in the previous message, the stable sort, with 1 thread need 2.7, and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4 HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] ) This show countertree::merge_sort is a good for to use in a parallel SW and GCC std::stable_sort is bad. If you see the results of the countertree:parallel_merge_sort in the benchmarks in the folder Modified, are really bad compared with the same results in the folder Original. The difference is the internal algorithm. In the folder Modified use GCC std::stable_sort, and in the Original use countertree::merge_sort The GCC std::sort and countertree::intro_sort have similar performance with all the size of elements. But in the parallel process, when the size of the elements grow, the poor scalability appear. The next lines are are from Original benchmark_sort_05.cpp on my computer With 128 bytes elements Random elements, few repeated ----------------------- STL std::sort :2.9188 secs countertree::intro_sort :2.87864 secs parallel::sort :2.46113 secs tbb::parallel_sort :1.86458 secs countertree::parallel_sort:1.89263 secs With 256 bytes elements Random elements, few repeated ( rand() )----------------------- STL std::sort :3.03143 secs countertree::intro_sort :3.02733 secs parallel::sort :2.23231 secs tbb::parallel_sort :1.76186 secs countertree::parallel_sort:1.56509 secs As we can see the parallel process is ineffective with GCC parallel::sort with big objects. About the merge_sort algorithm. It use additionally half of the memory used by the data. But if you don't use additional memory, the speed drops. I didn't examine in deep the Algorithm used in GCC, I only codified my own algorithm. I will examine, and I will try to reduce the memory used, without lost of speed and scalability. About the server for to test. I don't have modern servers in my school. The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3 years old machine with 4 HW threads, and in the benchmarks is slower than the I3 and I5 machined mounted this year. If someone can check and provide us the result, I will be grateful. The version 4.9.1 of GCC have better optimization than the 4.8.2. If you are using Ubuntu, the version 14.10 have the 4.9.1 version as predefined. Yours Francisco Tapia

Steven Ross

19 Dec 19 Dec

3:14 a.m.

Francisco, On Tue Dec 16 2014 at 1:42:59 PM Francisco José Tapia <fjtapia@gmail.com> wrote:

...

Hi Steven,

I thought all the program was checked, but obviously not. Now in the folder you have a shell script with the command for to compile and link with optimization ( by example make_benchmark_sort_03.sh), and a make_all.sh which compile all. Open a terminal in the folder, and compile and run

Great! Thanks. I ran everything, and the only failure was with benchmark_sort_indirect: ./benchmark_sort_indirect Size of element 8 Number of elements :100000000 terminate called after throwing an instance of 'std::system_error' what(): Enable multithreading to use std::thread: Operation not permitted Aborted (core dumped)

...

As suggested me, change the number generator and now use a Mersenne of the standard library.

Thanks

...

The stable sort is less used than the non stable sort, but sometimes is important. I developed because in my project have a proxy , which receives the operations over the elements defined by a key. The proxy sort, and send the operations to the corresponding thread, and it is not the same, read the value of a key and after delete that key, than delete the key and after try to read the value. I can't add a time mark for to know which arrived before

Fair enough.

...

As say in the previous message, the stable sort, with 1 thread need 2.7, and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4 HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )

This show countertree::merge_sort is a good for to use in a parallel SW and GCC std::stable_sort is bad.

...

with all the size of elements. But in the parallel process, when the size of the elements grow, the poor scalability appear. The next lines are are from Original benchmark_sort_05.cpp on my computer

With 128 bytes elements

Random elements, few repeated -----------------------

STL std::sort :2.9188 secs

countertree::intro_sort :2.87864 secs

parallel::sort :2.46113 secs

tbb::parallel_sort :1.86458 secs

countertree::parallel_sort:1.89263 secs

With 256 bytes elements

Random elements, few repeated ( rand() )-----------------------

STL std::sort :3.03143 secs

countertree::intro_sort :3.02733 secs

parallel::sort :2.23231 secs

tbb::parallel_sort :1.76186 secs

countertree::parallel_sort:1.56509 secs

As we can see the parallel process is ineffective with GCC parallel::sort with big objects.

Those don't match up with my results for 256 byte elements: STL std::sort :1.04718 secs countertree::intro_sort :1.14214 secs

The numbers I see are not as extreme, but your merge_sort looks good in comparison, especially with large elements. I will investigate alternatives to see if this is a truly speed-competitive offering. Have you considered Timsort (standard in Java and Python)? Timsort is a stable sort that is particularly good with sorted, mostly-sorted, and reverse-sorted data. It might be worth offering as an alternative stable sort for those concerned about mostly-sorted data. Your intro_sort looks reasonable overall, but it varies from comparable speed to 14% slower when run in parallel on random data, and is an order of magnitude slower on already-sorted data. As tbb is free, I just don't see a compelling case for your introsort implementation; tbb is a very hard competitor to beat. The GCC std::sort and countertree::intro_sort have similar performance parallel::sort :0.679178 secs tbb::parallel_sort :0.600001 secs countertree::parallel_sort:0.627144 secs And this is what I see for size 64: STL std::sort :1.6577 secs countertree::intro_sort :1.73023 secs parallel::sort :0.963199 secs tbb::parallel_sort :0.873805 secs countertree::parallel_sort:0.997228 secs I'm not sure the benefits you are seeing are portable.

...

About the merge_sort algorithm. It use additionally half of the memory used by the data. But if you don't use additional memory, the speed drops. I didn't examine in deep the Algorithm used in GCC, I only codified my own algorithm. I will examine, and I will try to reduce the memory used, without lost of speed and scalability.

The standard algorithm says it will lose speed when it reduces memory used; it'd be a nice feature to have, but if we can repeatably see 2X relative speedups without that feature, I think some people may find it useful.

...

About the server for to test. I don't have modern servers in my school. The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3 years old machine with 4 HW threads, and in the benchmarks is slower than the I3 and I5 machined mounted this year. If someone can check and provide us the result, I will be grateful. The version 4.9.1 of GCC have better optimization than the 4.8.2. If you are using Ubuntu, the version 14.10 have the 4.9.1 version as predefined.

A boost library should still be useful for people at least one compiler

version back; I feel fully justified testing on 4.8.2. Have you tested on Windows yet?

Steven Ross

3:22 a.m.

Francisco, Have you looked at: https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-f... For comparison?

Steven Ross

22 Dec 22 Dec

11:08 a.m.

Francisco, On Thu, Dec 18, 2014 at 10:22 PM, Steven Ross <spreadsort@gmail.com> wrote:

...

Francisco,

Have you looked at: https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-f...

For comparison?

I used the free tbb parallel stable sort referenced above for comparison, and found that as soon as I randomized the contents of the entire struct and made the copy constructor copy over the entire struct, that the tbb version was 37% faster on randomized data all the way up to 256 bytes relative to countertree::parallel_merge_sort. I've attached my modified "original" directory where I tested this out (see: Original/benchmark/parallel_stable_sort/build/make_tbb_benchmark.sh). Unless you can get your library close to the speed of this tbb sort, I don't see how we'd be benefiting people by pointing to it instead of the tbb library. What I am interested in is your idea for indirect sorting: can you come up with an easy-to-use API to handle efficient indirect sorting? That would probably be worth including in the boost::sort library, especially if it is compatible with different sort functions.

Francisco José Tapia

9:56 p.m.

Hi Steven, I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort . I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number. For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04. I repeat this process with all the algorithms for to know the memory used by each algorithm. Algorithm Memory used std::sort 1 565 456 countertree::intro_sort 1 565 280 GCC parallel::sort 2 346 868 tbb::parallel_sort 1 565 696 countertree::parallel_sort 1 565 468 std::stable_sort 1 957 232 countertree::merge_sort 1 955 908 countertree::circular_merge_sort 1 696 316 sfx::timsort 1 996 364 parallel::stable_sort 2 742 384 countertree::parallel_merge_sort 1 956 636 I checked the time with benchmark_sort_05, with copy all the structure in the operation = and in the copy constructor. The number of elements in each sort is 800000000/size of the element in bytes 8 bytes 16 bytes 32 bytes 48 bytes 64 bytes 128 bytes 256 bytes std::stable_sort 16.99 10.36 9.75 9.62 8.85 8.89 10.25 countertree::merge_sort 16.96 10.38 8.13 7.44 7.10 6.67 9.01 countertree::circular_merge_sort 19.77 12.80 8.81 7.61 7.01 6.48 8.62 timsort 22.89 13.43 10.12 8.33 7.70 7.00 8.62 parallel::stable_sort 9.68 8.91 8.90 8.90 8.23 8.21 7.72 countertree::parallel_merge_sort 7.04 5.96 5.74 5.70 5.64 5.54 5.94 About the stable sort of tbb, I see, but I didn't have time to examine and check in deep. But be quiet. The benchmark_sort indirect shows Size of element 48 Number of elements :16666666 parallel_sort :2.53964 sec indirect_parallel_sort :5.70979 sec parallel_stable_sort :5.73186 sec indirect_parallel_stable_sort :5.59269 sec Size of element 64 Number of elements :12500000 parallel_sort :2.59812 sec indirect_parallel_sort :4.1553 sec parallel_stable_sort :5.63771 sec indirect_parallel_stable_sort :3.95266 sec Size of element 128 Number of elements :6250000 parallel_sort :2.28648 sec indirect_parallel_sort :2.34792 sec parallel_stable_sort :5.62879 sec indirect_parallel_stable_sort :2.30188 sec Size of element 256 Number of elements :3125000 parallel_sort :2.05017 sec indirect_parallel_sort :1.50661 sec parallel_stable_sort :5.85409 sec indirect_parallel_stable_sort :1.48561 sec As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster. The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version. If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner. We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort. I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm. At end, only say Happy Christmas, and New Year to the Boost community. Yours Francisco Tapia fjtapia@gmail.com

Francisco José Tapia

10:04 p.m.

I forget to mention the Windows version of the benchmark. I supouse, they results will be similar to the obtained in Linux. I will prepare, too With all the information, we can decide the best solution if exist. Thanks Francisco Tapia 2014-12-22 22:56 GMT+01:00 Francisco José Tapia <fjtapia@gmail.com>:

...

Hi Steven,

I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort

I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .

I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.

For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.

I repeat this process with all the algorithms for to know the memory used by each algorithm.

Algorithm

Memory used

std::sort

1 565 456

countertree::intro_sort

1 565 280

GCC parallel::sort

2 346 868

tbb::parallel_sort

1 565 696

countertree::parallel_sort

1 565 468

std::stable_sort

1 957 232

countertree::merge_sort

1 955 908

countertree::circular_merge_sort

1 696 316

sfx::timsort

1 996 364

parallel::stable_sort

2 742 384

countertree::parallel_merge_sort

1 956 636

I checked the time with benchmark_sort_05, with copy all the structure in the operation = and in the copy constructor. The number of elements in each sort is 800000000/size of the element in bytes

8

bytes

16

bytes

32

bytes

48

bytes

64

bytes

128

bytes

256

bytes

std::stable_sort

16.99

10.36

9.75

9.62

8.85

8.89

10.25

countertree::merge_sort

16.96

10.38

8.13

7.44

7.10

6.67

9.01

countertree::circular_merge_sort

19.77

12.80

8.81

7.61

7.01

6.48

8.62

timsort

22.89

13.43

10.12

8.33

7.70

7.00

8.62

parallel::stable_sort

9.68

8.91

8.90

8.90

8.23

8.21

7.72

countertree::parallel_merge_sort

7.04

5.96

5.74

5.70

5.64

5.54

5.94

About the stable sort of tbb, I see, but I didn't have time to examine and check in deep. But be quiet. The benchmark_sort indirect shows

Size of element 48 Number of elements :16666666

parallel_sort :2.53964 sec

indirect_parallel_sort :5.70979 sec

parallel_stable_sort :5.73186 sec

indirect_parallel_stable_sort :5.59269 sec

Size of element 64 Number of elements :12500000

parallel_sort :2.59812 sec

indirect_parallel_sort :4.1553 sec

parallel_stable_sort :5.63771 sec

indirect_parallel_stable_sort :3.95266 sec

Size of element 128 Number of elements :6250000

parallel_sort :2.28648 sec

indirect_parallel_sort :2.34792 sec

parallel_stable_sort :5.62879 sec

indirect_parallel_stable_sort :2.30188 sec

Size of element 256 Number of elements :3125000

parallel_sort :2.05017 sec

indirect_parallel_sort :1.50661 sec

parallel_stable_sort :5.85409 sec

indirect_parallel_stable_sort :1.48561 sec

As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.

The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements

I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.

If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience

Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.

We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort.

I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.

At end, only say Happy Christmas, and New Year to the Boost community.

Yours

Francisco Tapia

fjtapia@gmail.com

Steven Ross

23 Dec 23 Dec

3:34 a.m.

Francisco, On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia <fjtapia@gmail.com> wrote:

...

...
I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort

Some people may find that useful. What is the worst-case computational complexity? Is it a well-known algorithm? The bar is going to be higher (may require a full review) for previously unknown algorithms.

...

...
I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .

I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.

For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.

I repeat this process with all the algorithms for to know the memory used by each algorithm.

I suggest changing the struct you use for benchmarking to this: template <uint32_t NN> struct reburcio { uint64_t M[NN]; reburcio () { init(); } void init() { for (int i = 0; i < NN; ++i) { M[i] = my_rand(); } } bool operator < ( const reburcio &R) const{ for (int i = 0; i < NN; ++i) { if (M[i] != R.M[i]) { return M[i] < R.M[i]; } } return false; } }; That way it's using the default copy and assignment operators, which copy over the entire datastructure, and it initializes and compares all elements of the array. It appears that the benchmark you sent me only copies over and compares the first element in the array, which might have caused some of the large-element differences in performance you were seeing. With this data structure, countertree intro_sort is competitive on my linux system with std::sort on fully random data, and countertree parallel_sort is competitive for 8, 16, and 256 bytes per element, but significantly underperforms (10-15%) tbb for 32 and 64 bytes.

...

...
As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.

Yes, indirect sort should eventually be faster, given a large enough element, especially if indirect sort is done with the fastest underlying algorithm available.

...

...
The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements

That sounds reasonable for fixed-size elements. What about variable-length elements, like strings? Will this only be for fixed-size elements?

...

...
I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.

If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience

I think indirect sorting is promising.

...

...
Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.

We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or

TBB

...
support, or simply, the user don't want to load them for to make a sort.

Agreed, but OpenMP is very common (even Android supports it)!, so systems without OpenMP are a very small niche in terms of total systems out there. TBB is supported on intel-compatible processors, and the vast majority of deployed processors where parallel sorting makes sense (won't drain the battery really quickly) are intel-compatible. If we're going to provide an alternative, it should cost very little in terms of performance, and ideally provide some kind of benefit besides eliminating a dependency.

...

...
I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.

I'd like to see that document. Along with testing a struct that is an array of ints, you should probably test variable-length string sorting, as string sorting is very common.

Merry Christmas, Steve

Francisco José Tapia

8:31 a.m.

We can do too a benchmark with variable lenght elements with strings. If you prepare the operations for to do in that test, I can prepare an additional benchmark with all the algorithms and your operations. After all, we see and take a decision Yours Francisco 2014-12-23 4:34 GMT+01:00 Steven Ross <spreadsort@gmail.com>:

...

Francisco,

On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia <fjtapia@gmail.com> wrote:

...
...
I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort

Some people may find that useful. What is the worst-case computational complexity? Is it a well-known algorithm? The bar is going to be higher (may require a full review) for previously unknown algorithms.

...
...
I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort .

I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number.

For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04.

I repeat this process with all the algorithms for to know the memory used by each algorithm.

I suggest changing the struct you use for benchmarking to this: template <uint32_t NN> struct reburcio { uint64_t M[NN]; reburcio () { init(); }

void init() { for (int i = 0; i < NN; ++i) { M[i] = my_rand(); } }

bool operator < ( const reburcio &R) const{ for (int i = 0; i < NN; ++i) { if (M[i] != R.M[i]) { return M[i] < R.M[i]; } } return false; } };

That way it's using the default copy and assignment operators, which copy over the entire datastructure, and it initializes and compares all elements of the array. It appears that the benchmark you sent me only copies over and compares the first element in the array, which might have caused some of the large-element differences in performance you were seeing.

With this data structure, countertree intro_sort is competitive on my linux system with std::sort on fully random data, and countertree parallel_sort is competitive for 8, 16, and 256 bytes per element, but significantly underperforms (10-15%) tbb for 32 and 64 bytes.

...
...
As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster.

Yes, indirect sort should eventually be faster, given a large enough element, especially if indirect sort is done with the fastest underlying algorithm available.

...
...
The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements

That sounds reasonable for fixed-size elements. What about variable-length elements, like strings? Will this only be for fixed-size elements?

...
...
I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version.

If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience

I think indirect sorting is promising.

...
...
Mi intention is not to impose my code, I want provide to the final

users

...
the best solution, if exist. With my code or with the code of other. I am the first to clap the winner.

We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort.

Agreed, but OpenMP is very common (even Android supports it)!, so systems without OpenMP are a very small niche in terms of total systems out there. TBB is supported on intel-compatible processors, and the vast majority of deployed processors where parallel sorting makes sense (won't drain the battery really quickly) are intel-compatible. If we're going to provide an alternative, it should cost very little in terms of performance, and ideally provide some kind of benefit besides eliminating a dependency.

...
...
I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm.

I'd like to see that document. Along with testing a struct that is an array of ints, you should probably test variable-length string sorting, as string sorting is very common.

Merry Christmas,

Steve

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Ross

11:49 a.m.

Francisco, On Tue Dec 23 2014 at 3:31:32 AM Francisco José Tapia <fjtapia@gmail.com> wrote:

...

We can do too a benchmark with variable lenght elements with strings.

If you prepare the operations for to do in that test, I can prepare an additional benchmark with all the algorithms and your operations.

After all, we see and take a decision

Yours

Francisco

The approach I've used with the boost::sort library is to set up the benchmark tests so that they read in a file of random data generated by a single application (randomgen), or any other file that is passed in for testing, and they write out their results to file. This enables better testing and debugging in case there are any problems (I directly diff the sort results when I can expect them to be identical; tune.pl does this automatically). I would prefer if you used stringsample.cpp and int64.cpp as examples of how to do this, and downloaded the boost sort library and added any modifications you need into that code. The library has a tune.pl script that runs the benchmarks. It may be helpful to write some utility templated funciton to switch between different algorithms, instead of the current approach of using a "-std" switch to use std::sort. https://github.com/spreadsort/sort There is a README there on how to install it using modular boost.

Francisco José Tapia

11 p.m.

Hi Steven Thanks by your ideas. I will check. Now I am designing, developing and testing several improvements for the algorithms. When finish I will prepare a clear and easy interface for to transform any sort algorithm, even the parallel, in a indirect sort, and prepare a brief document where explain how to do. Perhaps it will be a good moment for to move all the code to the namespace boost::sort Only after these things, I will begin to prepare the big benchmark. And then, I would like your experience and opinion. My idea is to make the benchmark running in a easy way, if we want know the opinion and results of other members of the Boost community, we must try to simplify their work. I will inform of my progress, but I must travel for to be with my family, and until January, I didn't run at full speed. Happy Christmas. My best wishes for you and the Boost community Francisco 2014-12-23 12:49 GMT+01:00 Steven Ross <spreadsort@gmail.com>:

...

Francisco,

On Tue Dec 23 2014 at 3:31:32 AM Francisco José Tapia <fjtapia@gmail.com> wrote:

...
We can do too a benchmark with variable lenght elements with strings.

If you prepare the operations for to do in that test, I can prepare an additional benchmark with all the algorithms and your operations.

After all, we see and take a decision

Yours

Francisco

The approach I've used with the boost::sort library is to set up the benchmark tests so that they read in a file of random data generated by a single application (randomgen), or any other file that is passed in for testing, and they write out their results to file. This enables better testing and debugging in case there are any problems (I directly diff the sort results when I can expect them to be identical; tune.pl does this automatically). I would prefer if you used stringsample.cpp and int64.cpp as examples of how to do this, and downloaded the boost sort library and added any modifications you need into that code. The library has a tune.pl script that runs the benchmarks. It may be helpful to write some utility templated funciton to switch between different algorithms, instead of the current approach of using a "-std" switch to use std::sort. https://github.com/spreadsort/sort There is a README there on how to install it using modular boost.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Francisco José Tapia

13 Jan 13 Jan

5:42 p.m.

Hi Steven, This Christmas I had been working hard. I developed a new version of the merge sort about 10-15% faster than the previous, using an additional memory only ½ of the memory used by the data. The GCC and TBB algorithms use an additional memory as the size of the data. This version is exception safe. If an exception occurs due to the algorithm ( only throw exceptions in the allocation and deallocation of the additional memory), the integrity of the data is guarantee. You have all the original data (unordered in the allocation, and ordered in the deallocation) I examined the TBB version and the secret is the sample sort algorithm. The version of TBB is provisional and it's not exception safe. I am working in the design of a new version of this algorithm exception safe. The actual version of the parallel stable algorithm is about 20% slower than the TBB version in the worse case, but when the size of the elements grows, the indirect version is faster than all the others, with 64 bytes, is around 10% faster than the best TBB version, with 128 bytes , the best version of TBB is around a 90% slower, and with 256 is 240% slower. An additional advantage of the indirect sort is they use additional memory for the pointers, not for the data, and the additional memory is around a 10% of the used by the data. I hope to have the new algorithm in two weeks, more or less. Then I will put all in the name space boost::sort, and design all the test and benchmarks, as commented in the previous message. Yours Francisco

Steven Ross

14 Jan 14 Jan

2:52 a.m.

Francisco, On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia <fjtapia@gmail.com> wrote:

...

Hi Steven,

This Christmas I had been working hard. I developed a new version of the merge sort about 10-15% faster than the previous, using an additional memory only ½ of the memory used by the data. The GCC and TBB algorithms use an additional memory as the size of the data.

...

This version is exception safe. If an exception occurs due to the algorithm ( only throw exceptions in the allocation and deallocation of the additional memory), the integrity of the data is guarantee. You have all the original data (unordered in the allocation, and ordered in the deallocation)

Less memory, comparable speed, and exception safety sounds good.

...

I examined the TBB version and the secret is the sample sort algorithm. The version of TBB is provisional and it's not exception safe. I am working in the design of a new version of this algorithm exception safe. The actual version of the parallel stable algorithm is about 20% slower than the TBB version in the worse case, but when the size of the elements grows, the indirect version is faster than all the others, with 64 bytes, is around 10% faster than the best TBB version, with 128 bytes , the best version of TBB is around a 90% slower, and with 256 is 240% slower.

Yes, indirect sorting is clearly superior for large data types, and it would be good to make it as easy as possible to use. It'd be nice to also have a simple wrapper that can be use the spreadsort algorithm with indirect sorting too.

I am in the process of integrating the first version of the boost sort library in https://github.com/boostorg/sort. It would be good to integrate your testing with the tests in that library.

Thorsten Ottosen

9:49 a.m.

On 14-01-2015 03:52, Steven Ross wrote:

...

...
I examined the TBB version and the secret is the sample sort algorithm. The version of TBB is provisional and it's not exception safe. I am working in the design of a new version of this algorithm exception safe. The actual version of the parallel stable algorithm is about 20% slower than the TBB version in the worse case, but when the size of the elements grows, the indirect version is faster than all the others, with 64 bytes, is around 10% faster than the best TBB version, with 128 bytes , the best version of TBB is around a 90% slower, and with 256 is 240% slower.

Yes, indirect sorting is clearly superior for large data types, and it would be good to make it as easy as possible to use. It'd be nice to also have a simple wrapper that can be use the spreadsort algorithm with indirect sorting too.

Larger, non-fast movable data types? -Thorsten

Paul A. Bristow

11:59 a.m.

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Ross Sent: 14 January 2015 02:53 To: boost@lists.boost.org Subject: Re: [boost] [SORT] Parallel Algorithms

Francisco,

On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia <fjtapia@gmail.com> wrote:

...
Hi Steven,

This Christmas I had been working hard. I developed a new version of the merge sort about 10-15% faster than the previous, using an additional memory only ½ of the memory used by the data. The GCC and TBB algorithms use an additional memory as the size of the data.

<snip>

...

...
Yes, indirect sorting is clearly superior for large data types, and it would be good to make it as easy as possible to use. It'd be nice to also have a simple wrapper that can be use the spreadsort algorithm with indirect sorting too.

Would Boost.Multiprecision (50 decimal digits floating-point say) provide a simple demo type?

...

I am in the process of integrating the first version of the boost sort library in https://github.com/boostorg/sort. It would be good to integrate your testing with the tests in that library.

Perhaps a git branch off the new /sort/develop branch would be useful? I note that Steven Ross is making a separate namespace for spreadsort boost::sort::spreadsort Although this seems long (slow typists can provide an alias or a using ... ;-) I strongly suggest that you should follow this example, perhaps boost::sort::tbb though I don't like tbb at all - far too acronymic? Think of a better name, even if longer? boost::sort::parallel? Paul PS The new Sort/Spreadsort docs are almost ready for release. I will add Quickbook documentation (from plain text notes) for your parallel sort version if you write to me off-list. --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830

Francisco José Tapia

4:26 p.m.

Hi Thorsten, The idea about the indirect sort is very easy. Instead of sort the elements, create a vector of pointers, and sort the vector of pointers, with this you don't move the elements , move the pointers. And at end with the vector of pointers sorted, move the data only 1 time. In the stable sort, the algorithms usually need additional memory. The GCC and TBB the same memory than the data, in my algorithms only 1 half. Imagine you want to sort 10 million elements of 128 bytes each. With the indirect sort you need 10 million pointers and additionally memory for to allocate 5 million pointers. The memory used by the data is 1280 Megas, but pointers and the additional memory are 15 million * 8 = 120 Megas, around a 10% of the memory used by the data. In the first messages of this conversation, I sent a zip with code. In this file you can find the code of the indirect sort. If don't understand, something, please say me, and I will try to explain. Paul, about the name space, if at end, the results are satisfactory, and decide to include in the library, only say me the name space and I will put. For me it's indifferent . But, I need time for the new algorithm, and design a complete test suite of the code, the benchmarks, for to compare with others versions and the documentation. If you want a benchmark with multi precision numbers, we can do without problem, even if you want to check the speed with different sizes of the numbers Yours Francisco

Francisco José Tapia

20 Mar 20 Mar

1:12 p.m.

Hi Steven I have the final version of all the algorithms. I wrote 6 algorithms of stable sort, and selected the fastest. On my computer is about 15% faster than the GCC stable sort. I had problems too with an error in an algorithm mixed with an error in the GCC dynamic memory system, Under certain sequence of request, the system provide a non valid address and due this a segmentation fault. The debug for to detect and correct had been a nightmare. The algorithms provided are : 1.- sort : The algorithm is intro_sort. Don't need additional memory. Performance similar to the GCC version and Windows version. Non stable 2.- parallel_sort : Decompose the global range in small ranges for to be sorted with intro_sort. Don't need additional memory. Similar performance than the TBB version and Microsoft version. Non Stable 3.- stable_sort : The algorithm is called smart_merge_sort, and is about 10-15% faster than the GCC stable sort version, The additional memory is an half of the memory used by the data.( The GCC stable sort use the same memory than the data) 4.- sample_sort : is a parallel stable sort. Have the same performance than the TBB implementations, and much more faster than the GCC parallel stable_sort. The additional memory is the same than the data 5.- parallel_stable_sort : This algorithm is based on the sample sort. Increase the sample_sort time about a 5%, but the additional memory used is an half of the memory used by the data. The functions are in the namespàce boost::sort All the algorithms are exception safe, and use indirect sort when the size of the objects is big. According to my benchmarks, the size limit for to change a indirect sort is 128 bytes. The function format of the parallel algorithms are algorithm ( first, last, compare , number_threads). The number of threads is passed with an struct NThreads which receives an unsigned, and return an unsigned. The default construct of this struct is filled with the HW threads At today I am testing on Windows with Visual Studio 2013 CTP 4, and changing some things of the code due to the incompatibility with C++11 features In a few days I will document all the functions of the code for generate a doxygen documentation, and can send you a version of the code. Then we must talk about how to design the benchmark. I think the starting point can be your experience with your code The C++ committee propose a format for to call the parallel functions described in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf it would be a good idea to examine, and decide if follow the proposal, if we don't follow or if we provide the two interfaces for the parallel algorithms. According with my first impression it's not very complex to implement. Yours Francisco Tapia 2015-01-14 17:26 GMT+01:00 Francisco José Tapia <fjtapia@gmail.com>:

...

Hi Thorsten,

The idea about the indirect sort is very easy. Instead of sort the elements, create a vector of pointers, and sort the vector of pointers, with this you don't move the elements , move the pointers. And at end with the vector of pointers sorted, move the data only 1 time.

In the stable sort, the algorithms usually need additional memory. The GCC and TBB the same memory than the data, in my algorithms only 1 half. Imagine you want to sort 10 million elements of 128 bytes each. With the indirect sort you need 10 million pointers and additionally memory for to allocate 5 million pointers.

The memory used by the data is 1280 Megas, but pointers and the additional memory are 15 million * 8 = 120 Megas, around a 10% of the memory used by the data.

In the first messages of this conversation, I sent a zip with code. In this file you can find the code of the indirect sort. If don't understand, something, please say me, and I will try to explain.

Paul,

about the name space, if at end, the results are satisfactory, and decide to include in the library, only say me the name space and I will put. For me it's indifferent . But, I need time for the new algorithm, and design a complete test suite of the code, the benchmarks, for to compare with others versions and the documentation.

If you want a benchmark with multi precision numbers, we can do without problem, even if you want to check the speed with different sizes of the numbers

Yours

Francisco

Steven Ross

23 Mar 23 Mar

10:46 a.m.

On Fri, Mar 20, 2015 at 9:12 AM Francisco José Tapia <fjtapia@gmail.com> wrote:

...

I have the final version of all the algorithms. I wrote 6 algorithms of stable sort, and selected the fastest. On my computer is about 15% faster than the GCC stable sort.

Could you send them please? I'd like to verify the speedup you're seeing, and compare vs TBB. I'm ok if it just is comparable in speed to TBB and it has additional advantages of some type.

...

I had problems too with an error in an algorithm mixed with an error in the GCC dynamic memory system, Under certain sequence of request, the system provide a non valid address and due this a segmentation fault. The debug for to detect and correct had been a nightmare.

One thing to watch out for is undefined operations, such as type conversions between int and float types; those aren't compiler bugs: they're undefined (this problem hit me with float_sort, before someone showed me how to convert safely). I hope you've fixed the bug.

...

The algorithms provided are :

1.- sort : The algorithm is intro_sort. Don't need additional memory. Performance similar to the GCC version and Windows version. Non stable

...

2.- parallel_sort : Decompose the global range in small ranges for to be sorted with intro_sort. Don't need additional memory. Similar performance than the TBB version and Microsoft version. Non Stable

Does it provide any advantage over those implementations, aside from being open-source?

...

3.- stable_sort : The algorithm is called smart_merge_sort, and is about 10-15% faster than the GCC stable sort version, The additional memory is an half of the memory used by the data.( The GCC stable sort use the same memory than the data)

GCC uses no additional memory, or 100% additional memory? Did you compare against the tbb stable sort I pointed you to?

...

4.- sample_sort : is a parallel stable sort. Have the same performance than the TBB implementations, and much more faster than the GCC parallel stable_sort. The additional memory is the same than the data

5.- parallel_stable_sort : This algorithm is based on the sample sort. Increase the sample_sort time about a 5%, but the additional memory used is an half of the memory used by the data.

That's interesting, so it's better from a memory perspective than the competing algorithms, and not bad in speed? That sounds like it's worth serious consideration as an addition.

...

The functions are in the namespàce boost::sort All the algorithms are exception safe, and use indirect sort when the size of the objects is big. According to my benchmarks, the size limit for to change a indirect sort is 128 bytes.

The function format of the parallel algorithms are

algorithm ( first, last, compare , number_threads).

The number of threads is passed with an struct NThreads which receives an unsigned, and return an unsigned. The default construct of this struct is filled with the HW threads

At today I am testing on Windows with Visual Studio 2013 CTP 4, and changing some things of the code due to the incompatibility with C++11 features

In a few days I will document all the functions of the code for generate a doxygen documentation, and can send you a version of the code.

Then we must talk about how to design the benchmark. I think the starting point can be your experience with your code

Please look at this code, and refactor/reuse what test code you can: https://github.com/boostorg/sort Specifically, look in the test directory, the example directory, and look in tune.pl. Are there any parameters you would want to tune for specific systems (besides the number of threads)? The basic idea is that randomgen writes out a randomized distribution to disk, and then we run whatever algorithms we want to compare (writing their results to disk), verify that the results are correct, then compare speed. I'd prefer not to skip the disk intermediate step because having the files on disk makes it easier to debug when something gives incorrect results. We should compare with variable-length strings and ints, in addition to more complicated data structures like what you've been testing with. It's worth noting that it's quite common to use vectors for large quantities of data in a class/struct, in which case the sorting would already be indirect.

...

The C++ committee propose a format for to call the parallel functions described in

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf

it would be a good idea to examine, and decide if follow the proposal, if we don't follow or if we provide the two interfaces for the parallel algorithms. According with my first impression it's not very complex to implement.

With your implementations, it looks like people have these choices: 1) how many threads to use 2) stable or not 3) memory or speed efficient 4) direct or indirect Note that #4 may be influenced by #3, and that you may be able to determine that automatically, but providing people a simple way to make those choices is a good idea. You could use the approach suggested by the committee if it fits well, or another one if you have a better approach to this specific problem. Steve

Jeremy Murphy

26 Mar 26 Mar

6:15 a.m.

Hi Francisco, On 17 December 2014 at 05:42, Francisco José Tapia <fjtapia@gmail.com> wrote:

...

SCALABILITY : Some algorithms don't scale well when the number of threads grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows

Sorry about butting in so late, but on the topic of scalability and potential superiority to TBB, you might be interested to check out Hartmut Kaiser's talk on asynchronous computing ( https://www.youtube.com/watch?v=5xyztU__yys) and his lab's library, HPX ( http://stellar.cct.lsu.edu/projects/hpx/). Cheers. Jeremy

Francisco José Tapia

1:07 p.m.

Thanks, I will see. Always is interested to hear Mr Kaiser, and learn new things from their huge knowledge Francisco 2015-03-26 7:15 GMT+01:00 Jeremy Murphy <jeremy.william.murphy@gmail.com>:

...

Hi Francisco,

On 17 December 2014 at 05:42, Francisco José Tapia <fjtapia@gmail.com> wrote:

...
SCALABILITY : Some algorithms don't scale well when the number of

threads

...
grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows

Sorry about butting in so late, but on the topic of scalability and potential superiority to TBB, you might be interested to check out Hartmut Kaiser's talk on asynchronous computing ( https://www.youtube.com/watch?v=5xyztU__yys) and his lab's library, HPX ( http://stellar.cct.lsu.edu/projects/hpx/). Cheers.

Jeremy

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

3752

Age (days ago)

3854

Last active (days ago)

List overview

Download

20 comments

5 participants

participants (5)

Francisco José Tapia
Jeremy Murphy
Paul A. Bristow
Steven Ross
Thorsten Ottosen