Sorry by the delay, As promised, the sorting parallel algorithms On my computer (Quad Core) the GCC std::sort and stable_sort are slight faster than my version of intro_sort and merge_sort (stable sort is a synonym of merge_sort). In other computer with an Intel I3, my algorithms are slight faster than in the GCC version. Trying to optimize the parallel implementation. I modified the parallel version for to use the std::sort and std::stable_sort. The parallel_sort version response good, but the parallel_stable_sort increase the time around a 60% compared with the version with countertree::parallel_merge_sort . After many test, I discover the cause of the problem. The demonstration is in the test_stress_sort_01.cpp. In this program we have a big vector of 100 000 000 elements uint64_t. The program make parts of the same size for each HW core. On my computer 4 parts and 4 cores. The program execute the part sequentially checking the time, and after do the same but in parallel, each thread sort a part, and check the time. This is done with the std::stable sort and countertree::merge_sort. The data input is the same for all. The next lines are the result show max. OpenMP threads: 4 (4 processors) Sort of 100000000 elements in a vector --------------------- Random elements, few repeated ( rand() )----------------------- Stable Sort Part 0 secs :2.77955 Part 1 secs :2.75015 Part 2 secs :2.76068 Part 3 secs :2.76284 Parallel Sort secs :8.49796 Countertree stable sort Part 0 secs :2.72802 Part 1 secs :2.71614 Part 2 secs :2.72074 Part 3 secs :2.722 Parallel Sort secs :4.80692 Surprisingly the parallel sort with countertree::stable_sort is much more faster than the sort with std::stable_sort in the GCC 4.8 and 4.9 compiler. This is the secret of the speed of the countertree::parallel_merge_sort. This is very important, because when the size of the elements is small, the GCC and the countertree algorithms have similar performance. But when the size of the elements grow, the std::stable sort is slower than the countertree::merge_sort. This difference is increased in the parallel version. The next lines are part of benchmark_sort_05.cpp *************************************************************** Sort of N = 6250000 elements of size 128 ***************************************************************** Random elements, few repeated ( rand() )----------------------- STL std::sort :2.9188 secs countertree::intro_sort :2.87864 secs parallel::sort :2.46113 secs tbb::parallel_sort :1.86458 secs countertree::parallel_sort:1.89263 secs std::stable_sort :8.77963 secs countertree::merge_sort :5.70694 secs parallel::stable_sort :8.03946 secs countertree::parallel_merge_sort :4.90815 secs I have in mind a new version of the stable sort in order to improve the speed, reducing the number of internal copy of elements. I hope to have time in Christmas, but I can't guarantee. If the results are the expected I will send you You have too, a version of indirect_sort, which sort pointers to the elements, and at end sort the elements according the pointers. You can replace any sort algorithm by their indirect version, obtaining the same results, with the only difference of the time needed. You have a benchmark_indirect_sort.cpp where see the comparison of the time needed of several sort methods and their indirect version with elements of different size. This code is a preliminary version, done for my countertree library. If you consider useful, I will change the name space and document the code, the algorithms and the ideas obtained from the benchmarks. Please feel free for ask me any question or for to modify or change anything. The goal is to provide good sort methods to the users. In the zip file you have the code and a file with a description of each code. Yours