Hi Steven, I had been working about the memory used by the different algorithms, specially the stable algorithms. I developed a low memory algorithm circular_stable_sort, which use only a 10% more of memory, and the speed decrease around 15% with small objects and with big objects is similar to the merge_sort I was looking too, the TimSort algorithm. I found a C++ implementation in https://github.com/gfx/cpp-TimSort . I inserted this algorithm in the benchmark programs. With small objects, the speed is slower than the stable sort and merge_sort, and have a good performance with big objects, greater than 64 bits. It's logic the decision of Java and Python, because they must order objects, usually non contiguous and bigger than the naked number. For the measure of the memory used by the program I use the command /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04 (100000000 uint64_t numbers), and comment all the invocations to the algorithms , except one, compile, and run in a terminal with the command /usr/bin/time -v benchmark_sort_04. I repeat this process with all the algorithms for to know the memory used by each algorithm. Algorithm Memory used std::sort 1 565 456 countertree::intro_sort 1 565 280 GCC parallel::sort 2 346 868 tbb::parallel_sort 1 565 696 countertree::parallel_sort 1 565 468 std::stable_sort 1 957 232 countertree::merge_sort 1 955 908 countertree::circular_merge_sort 1 696 316 sfx::timsort 1 996 364 parallel::stable_sort 2 742 384 countertree::parallel_merge_sort 1 956 636 I checked the time with benchmark_sort_05, with copy all the structure in the operation = and in the copy constructor. The number of elements in each sort is 800000000/size of the element in bytes 8 bytes 16 bytes 32 bytes 48 bytes 64 bytes 128 bytes 256 bytes std::stable_sort 16.99 10.36 9.75 9.62 8.85 8.89 10.25 countertree::merge_sort 16.96 10.38 8.13 7.44 7.10 6.67 9.01 countertree::circular_merge_sort 19.77 12.80 8.81 7.61 7.01 6.48 8.62 timsort 22.89 13.43 10.12 8.33 7.70 7.00 8.62 parallel::stable_sort 9.68 8.91 8.90 8.90 8.23 8.21 7.72 countertree::parallel_merge_sort 7.04 5.96 5.74 5.70 5.64 5.54 5.94 About the stable sort of tbb, I see, but I didn't have time to examine and check in deep. But be quiet. The benchmark_sort indirect shows Size of element 48 Number of elements :16666666 parallel_sort :2.53964 sec indirect_parallel_sort :5.70979 sec parallel_stable_sort :5.73186 sec indirect_parallel_stable_sort :5.59269 sec Size of element 64 Number of elements :12500000 parallel_sort :2.59812 sec indirect_parallel_sort :4.1553 sec parallel_stable_sort :5.63771 sec indirect_parallel_stable_sort :3.95266 sec Size of element 128 Number of elements :6250000 parallel_sort :2.28648 sec indirect_parallel_sort :2.34792 sec parallel_stable_sort :5.62879 sec indirect_parallel_stable_sort :2.30188 sec Size of element 256 Number of elements :3125000 parallel_sort :2.05017 sec indirect_parallel_sort :1.50661 sec parallel_stable_sort :5.85409 sec indirect_parallel_stable_sort :1.48561 sec As you can see with a 256 bytes of size the indirect parallel_stable_sort is 3 times faster. And the indirect_sort is a 25% faster. The final solution, if at end is taken, will be a hybrid algorithm depending of the size of the elements I propose you the next. Let me 2 or 3 weeks and I will prepare a big benchmark with all the algorithms, including the indirect version of the stable and unstable sort. We will run the benchmarks over different machines, and according the results, take a decision about if we propose something to Boost, and which algorithm must be used in the hybrid version. If, at end, we decide don't propose nothing, I only can say this had been a nice and exciting experience Mi intention is not to impose my code, I want provide to the final users the best solution, if exist. With my code or with the code of other. I am the first to clap the winner. We are doing this because this parallel implementation don't need any other library, only the standard. There are compilers without OpenMP or TBB support, or simply, the user don't want to load them for to make a sort. I have in mind, several improvements of the algorithms. I will codify and test, and if run well, I include in the algorithms. I want, too, simplify the interface of indirect_sort, for to be used in a easy way by any sort algorithm. If you are interested, I will prepare too a brief document with the explanation and the internal algorithm. At end, only say Happy Christmas, and New Year to the Boost community. Yours Francisco Tapia fjtapia@gmail.com