[move][sort] Asymptotically optimal stable and in-place merging and sorting with no auxiliary memory
[Disclaimer, long e-mail, thank you for your attention] Hi to all, Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory. My main use case for these algorithms was lowering the complexity guarantees of range insertion functions offered by flat associative containers without any temporary allocation (the programmer does not expect allocations when capacity() is big enough to hold the new elements). On the other hand, I wanted those algorithms to be adaptive so as to take advantage of the extra capacity already allocated by flat containers. Of course, if a good implementation is available in Boost, it could be used as existing practice to change the worst-case complexities of some STL algorithms when no additional memory (e.g. get_temporary_buffer()) is available. After reviewing several algorithms from papers and some open source implementations, I decided to implement one of those algorithms. The implemented algorithm is based on the GrailSort algorithm developed by Andrey Astrelin. The main idea of the my adaptive_sort algorithm is explained by Andrey in an article published by the russian collaborative blog Habrahabr (http://habrahabr.ru/post/205290/). The algorithm is based on ideas explained by Huang and Langston in their paper "Fast Stable Merging and Sorting in Constant Extra Space (1989-1992" (http://comjnl.oxfordjournals.org/content/35/6/643.full.pdf). Since the implementation from (https://github.com/Mrrl/GrailSort) is macro-based, does not support non-memcopyable types and has a MIT license (which seems to be problematic for Boost), I implemented the algorithm from scratch following the article and taking ideas from Andrey's implementation. I added some new ideas and optimizations in order to reduce comparisons and elements moves. I must say that the implementation was really hard for someone like me without experience in sorting algorithms, the complexity is huge when compared with a classic mergesort algorithm. The implementation is already in develop and master branches but the develop branch contains more tests and a refactored and more documented code. So please check the code from the develop branch if interested. The implementation is still experimental and not production ready. What about benchmarks? I've only experimented in the oldest Visual compilers to make sure it correctly supports Boost.Move emulation and complexity guarantees were correct. Note that the STL algorithms tested in the benchmark might not be as fast as current implementations. In general, the new algorithms need much more elements moves than algorithms that allocate O(n) memory, but they are still quite fast when data movement is cheap (e.g. when a move constructor just copies a couple of pointers). These are the numbers for A VERY LIGHTWEIGHT object (a struct holding two integers, one for the key, another one to test the stability, don't take times very seriously, just pay attention to the number of comparisons/moves). ////////////////////////// ////////////////////////// STABLE MERGING ////////////////////////// ////////////////////////// - N: number of elements - NK: number of keys (0 means all keys different) - InplaceMerge: std::inplace_merge - Sqrt2AdaptMerge: adaptive_merge with an auxiliary buffer of 2*sqrt(N) elements - SqrtAdaptMerge: adaptive_merge with an auxiliary buffer of sqrt(N) elements - AdaptMerge: adaptive_merge with no external buffer - Cmp: number of comparisons / N - Cpy: number of move constructor/assignments / N - (ratio): time of execution / time needed by InplaceMerge ---------------------------------------------- Numbers produced by the bench_merge.cpp test in Boost.Move: ---------------------------------------------- - - N: 100001, NK: 511 - - InplaceMerge Cmp: 0.9989 Cpy: 1.5000 633.65us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2429 Cpy: 4.4209 910.16us ( 1.44) SqrtAdaptMerge Cmp: 4.2708 Cpy: 5.6712 1.63ms ( 2.57) AdaptMerge Cmp: 5.6014 Cpy: 8.7573 3.11ms ( 4.91) - - N: 100001, NK: 2047 - - InplaceMerge Cmp: 0.9998 Cpy: 1.5000 604.77us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2449 Cpy: 4.4483 934.83us ( 1.55) SqrtAdaptMerge Cmp: 2.0131 Cpy: 4.7609 1.12ms ( 1.85) AdaptMerge Cmp: 2.7608 Cpy: 11.1419 2.83ms ( 4.67) - - N: 100001, NK: 8191 - - InplaceMerge Cmp: 0.9999 Cpy: 1.5000 653.51us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2443 Cpy: 4.3179 943.86us ( 1.44) SqrtAdaptMerge Cmp: 1.4661 Cpy: 4.4348 998.62us ( 1.53) AdaptMerge Cmp: 1.7692 Cpy: 10.6292 2.54ms ( 3.89) - - N: 100001, NK: 32767 - - InplaceMerge Cmp: 1.0000 Cpy: 1.5000 772.06us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2417 Cpy: 4.4317 1.03ms ( 1.33) SqrtAdaptMerge Cmp: 1.3320 Cpy: 4.4758 1.06ms ( 1.38) AdaptMerge Cmp: 1.5295 Cpy: 10.1233 2.44ms ( 3.16) - - N: 100001, NK: 0 - - InplaceMerge Cmp: 1.0000 Cpy: 1.5000 881.88us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2387 Cpy: 4.4124 1.16ms ( 1.31) SqrtAdaptMerge Cmp: 1.2933 Cpy: 4.4412 1.20ms ( 1.36) AdaptMerge Cmp: 1.4110 Cpy: 7.5794 1.93ms ( 2.19) In general, with no memory and enough keys, the algorithm needs 7,5*N data moves and 1,5*N comparisons. Data locality is quite high and with cheap-moving types the overall time can be just two-three times slower. With fewer keys, the algorithm might need upt to 5xN comparisons and 11xN moves. Number are improved if an auxiliary buffer of sqrt(N) elements is available. The number of comparisons and moves are even better than those shown in experimental results from several papers, but I still find them quite high. The algorithm needs some initial steps (e.g. unique key extraction) that are not amortized during the merge (adaptive_sort amortizes the initial steps much better ). I've derived the merge algorithm from the sorting algorithm so maybe there is room for improvement (I've invested much more time in the sorting algorithm). ////////////////////////// ////////////////////////// STABLE SORTING ////////////////////////// ////////////////////////// Numbers when sorting: - N: number of elements - NK: number of keys (0 means all keys different) - MergeSort: A half-copying mergesort (it only needs N/2 auxiliary memory) implemented in boost/move/algo/detail/merge_sort.hpp. Based on the paper "Fast mergesort implementation based on half-copying merge algorithm" by Cezary Juszczak (http://kicia.ift.uni.wroc.pl/algorytmy/mergesortpaper.pdf). Usually faster than std implementations with less auxiliary memory. - StableSort: std::stable_sort - HeapSort: std::heap_sort - QuartAdaptSort: adaptive_sort with N/4 auxiliary memory - Sqrt2AdaptSort: adaptive_sort with 2*sqrt(N) auxiliary memory - SqrtAdaptSort: adaptive_sort with sqrt(N) auxiliary memory - SqrtHAdaptSor: adaptive_sort with sqrt(N)/2 auxiliary memory - AdaptSort: adaptive_sort with no auxiliary memory - Cmp: number of comparisons / N - Cpy: number of move constructor/assignments / N - (ratio): time of execution / time needed by MergeSort [numbers produced by the bench_sort.cpp test in Boost.Move] - - N: 100001, NK: 511 - - MergeSort Cmp: 16.379 Cpy: 16.707 6.49ms ( 1.00) StableSort Cmp: 20.594 Cpy: 22.978 10.85ms ( 1.67) HeapSort Cmp: 16.992 Cpy: 22.098 10.98ms ( 1.69) QuartAdaptSort Cmp: 17.659 Cpy: 26.314 8.34ms ( 1.29) Sqrt2AdaptSort Cmp: 17.829 Cpy: 35.340 9.90ms ( 1.53) SqrtAdaptSort Cmp: 27.805 Cpy: 56.059 19.48ms ( 3.00) SqrtHAdaptSort Cmp: 27.356 Cpy: 60.375 20.35ms ( 3.14) AdaptSort Cmp: 27.352 Cpy: 67.889 21.83ms ( 3.36) - - N: 100001, NK: 8191 - - MergeSort Cmp: 16.394 Cpy: 16.725 7.47ms ( 1.00) StableSort Cmp: 20.578 Cpy: 22.978 11.92ms ( 1.60) HeapSort Cmp: 16.992 Cpy: 22.099 11.91ms ( 1.59) QuartAdaptSort Cmp: 17.676 Cpy: 26.362 9.26ms ( 1.24) Sqrt2AdaptSort Cmp: 17.847 Cpy: 35.747 10.93ms ( 1.46) SqrtAdaptSort Cmp: 18.977 Cpy: 41.641 12.19ms ( 1.63) SqrtHAdaptSort Cmp: 17.076 Cpy: 57.799 15.47ms ( 2.07) AdaptSort Cmp: 17.090 Cpy: 67.179 17.37ms ( 2.33) - - N: 100001, NK: 32767 - - MergeSort Cmp: 16.395 Cpy: 16.727 7.66ms ( 1.00) StableSort Cmp: 20.581 Cpy: 22.986 12.16ms ( 1.59) HeapSort Cmp: 16.994 Cpy: 22.099 11.96ms ( 1.56) QuartAdaptSort Cmp: 17.674 Cpy: 26.319 9.47ms ( 1.24) Sqrt2AdaptSort Cmp: 17.833 Cpy: 35.792 11.16ms ( 1.46) SqrtAdaptSort Cmp: 18.967 Cpy: 41.604 12.42ms ( 1.62) SqrtHAdaptSort Cmp: 17.065 Cpy: 57.756 15.71ms ( 2.05) AdaptSort Cmp: 17.081 Cpy: 67.131 18.32ms ( 2.39) - - N: 100001, NK: 0 - - MergeSort Cmp: 16.399 Cpy: 16.731 7.76ms ( 1.00) StableSort Cmp: 20.521 Cpy: 22.920 12.23ms ( 1.58) HeapSort Cmp: 16.991 Cpy: 22.094 11.85ms ( 1.53) QuartAdaptSort Cmp: 17.671 Cpy: 26.380 9.50ms ( 1.22) Sqrt2AdaptSort Cmp: 17.825 Cpy: 35.653 11.17ms ( 1.44) SqrtAdaptSort Cmp: 18.954 Cpy: 41.545 12.42ms ( 1.60) SqrtHAdaptSort Cmp: 17.058 Cpy: 57.694 15.66ms ( 2.02) AdaptSort Cmp: 17.072 Cpy: 67.072 17.57ms ( 2.26) In general, with no memory and enough keys, the algorithm needs nearly optimal comparisons and significantly more (4x) data movements. Data locality is quite high and with cheap-moving types the overall time can be just two-three times slower. With fewer keys, the algorithm performs worse but it's quite competitive. When extra memory (e.g. sqrt(N)) is available, the algorithm is quite fast with cheap-moving types. //////////////////////////////////////////// Final words: The effort to implement the algorithms was big and I couldn't fix many open bugs in my libraries, so in the following months I don't expect to invest much time on this. At least I wanted to publicly share the committed implementation so that it can be improved by sorting experts. I don't claim these algorithms are groundbreaking or even useful for many programmers, but they might be a good starting point. I'm not happy with the implementation as the algorithm is complicated and I' haven't found a way to simplify it further without noticeable slowdowns. For those interested in the implementation and the algorithm: please contribute correct benchmarks and testing, as there are very few open source implementations of similar algorithms. Any optimization, patch or bug report is welcome. Best, Ion
On Wed, Mar 23, 2016 at 6:52 PM Ion Gaztañaga
[Disclaimer, long e-mail, thank you for your attention]
Hi to all,
Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory.
My main use case for these algorithms was lowering the complexity guarantees of range insertion functions offered by flat associative containers without any temporary allocation (the programmer does not expect allocations when capacity() is big enough to hold the new elements). On the other hand, I wanted those algorithms to be adaptive so as to take advantage of the extra capacity already allocated by flat containers.
I'm still confused why std::inplace_merge doesn't work for your use case; what's wrong with attempting to create a temporary buffer, and falling back to O(Nlog(N)) if it fails? I still fail to see a practical advantage.
On 24/03/2016 2:02, Steven Ross wrote:
On Wed, Mar 23, 2016 at 6:52 PM Ion Gaztañaga
wrote: [Disclaimer, long e-mail, thank you for your attention]
Hi to all,
Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory.
My main use case for these algorithms was lowering the complexity guarantees of range insertion functions offered by flat associative containers without any temporary allocation (the programmer does not expect allocations when capacity() is big enough to hold the new elements). On the other hand, I wanted those algorithms to be adaptive so as to take advantage of the extra capacity already allocated by flat containers.
I'm still confused why std::inplace_merge doesn't work for your use case; what's wrong with attempting to create a temporary buffer, and falling back to O(Nlog(N)) if it fails? I still fail to see a practical advantage.
For desktop systems this is not a problem, unless you want to sort big ranges (like 1 GB in 32bit systems, you can run out of memory). For flat containers, the issue is more related to class invariants where a vector/flat_map should never allocate if capacity() is enough, that's what the user expects. Although I've only mentioned flat_xxx containers, my goal was to have a tool that could be useful in other contexts. Another potential users are embedded developers that also don't like to allocate memory after initialization. We could also try to allocate temporary storage, and if that fails, then the fallback could be NlogN to sort the input and O(N) to merge it instead of Nlog^2(N) when sorting and NlogN when merging. Best, Ion
On 24/03/2016 12:14, Ion Gaztañaga wrote:
On 24/03/2016 2:02, Steven Ross wrote:
On Wed, Mar 23, 2016 at 6:52 PM Ion Gaztañaga
wrote: [Disclaimer, long e-mail, thank you for your attention]
Hi to all,
Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory.
My main use case for these algorithms was lowering the complexity guarantees of range insertion functions offered by flat associative containers without any temporary allocation (the programmer does not expect allocations when capacity() is big enough to hold the new elements). On the other hand, I wanted those algorithms to be adaptive so as to take advantage of the extra capacity already allocated by flat containers.
I'm still confused why std::inplace_merge doesn't work for your use case; what's wrong with attempting to create a temporary buffer, and falling back to O(Nlog(N)) if it fails? I still fail to see a practical advantage.
I forgot to mention that Boost.Container supports C++03 so I would need to write my own inplace_merge version (which maybe I should). You might be interested in the mergesort implementation used for tests as I think it's faster than many std::stable_sort implementations. Best, Ion
On Thu, Mar 24, 2016 at 7:14 AM Ion Gaztañaga
On 24/03/2016 2:02, Steven Ross wrote:
On Wed, Mar 23, 2016 at 6:52 PM Ion Gaztañaga
wrote: [Disclaimer, long e-mail, thank you for your attention]
Hi to all,
Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory.
My main use case for these algorithms was lowering the complexity guarantees of range insertion functions offered by flat associative containers without any temporary allocation (the programmer does not expect allocations when capacity() is big enough to hold the new elements). On the other hand, I wanted those algorithms to be adaptive so as to take advantage of the extra capacity already allocated by flat containers.
I'm still confused why std::inplace_merge doesn't work for your use case; what's wrong with attempting to create a temporary buffer, and falling back to O(Nlog(N)) if it fails? I still fail to see a practical advantage.
For desktop systems this is not a problem, unless you want to sort big ranges (like 1 GB in 32bit systems, you can run out of memory).
For flat containers, the issue is more related to class invariants where a vector/flat_map should never allocate if capacity() is enough, that's what the user expects.
Although I've only mentioned flat_xxx containers, my goal was to have a tool that could be useful in other contexts.
Another potential users are embedded developers that also don't like to allocate memory after initialization.
Ok, your use case seems arcane, but I see how that might be useful. Do you want to add it as a sub-library inside Boost.Sort?
On 25/03/2016 1:57, Steven Ross wrote:
Ok, your use case seems arcane, but I see how that might be useful. Do you want to add it as a sub-library inside Boost.Sort?
My main intent was to see if there is enough interesting for such an algorithm and if sorting experts can help improving it. If it's considered useful enough, then Boost.Sort seems a good place, if it's considered too flat-continer specified it can live inside the detail namespace. Best, Ion
Ion Gazta?aga
Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory.
After reviewing several algorithms from papers and some open source implementations, I decided to implement one of those algorithms.
Wonderful. I don't think I have any practical use for it, but I still think it's great that you've done it!
- NK: number of keys (0 means all keys different)
Can you clarify what this means? For example, what does NK = 1 mean?
In general, with no memory and enough keys, the algorithm needs 7,5*N data moves and 1,5*N comparisons. Data locality is quite high and with cheap-moving types the overall time can be just two-three times slower. With fewer keys, the algorithm might need upt to 5xN comparisons and 11xN moves.
What happens when the number of distinct keys is very small, e.g. 1 or 2? My recollection was that these algorithms are worse than O(N) in that case, but I could be wrong. Regards, Phil.
On 24/03/2016 15:08, Phil Endecott wrote:
Ion Gazta?aga
wrote: Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory.
After reviewing several algorithms from papers and some open source implementations, I decided to implement one of those algorithms.
Wonderful.
I don't think I have any practical use for it, but I still think it's great that you've done it!
- NK: number of keys (0 means all keys different)
Can you clarify what this means? For example, what does NK = 1 mean?
NK=1 means that the number of different keys is 1, that is, all elements are equal (NK=0 is a special number meaning NK is equal to N).
In general, with no memory and enough keys, the algorithm needs 7,5*N data moves and 1,5*N comparisons. Data locality is quite high and with cheap-moving types the overall time can be just two-three times slower. With fewer keys, the algorithm might need upt to 5xN comparisons and 11xN moves.
What happens when the number of distinct keys is very small, e.g. 1 or 2? My recollection was that these algorithms are worse than O(N) in that case, but I could be wrong.
With very few keys a rotation-based sort is used and in these cases it's very fast as most keys are already ordered and rotations are minimal. If I've correctly implemented the algorithm it should be still O(N) when merging and O(NlogN) when sorting. Best, Ion
As a base line would also be possible to show the performance of `std::inplace_merge` and `std::stable_sort` when `std::get_temporary_buffer` returns a zero sized buffer? That is, in the case that no memory can be allocated, how much better are these new sorting algorithms compared to the std ones?
On 24/03/2016 17:37, Gonzalo BG wrote:
As a base line would also be possible to show the performance of `std::inplace_merge` and `std::stable_sort` when `std::get_temporary_buffer` returns a zero sized buffer? That is, in the case that no memory can be allocated, how much better are these new sorting algorithms compared to the std ones?
We would need to implement the algorithm ourselves as there is no way to stub get_temporary_buffer() to return zero unless we modify STL headers, which does not seem very clean ;-) The worst case complexity of STL fallback algorithms is worse, which means that with big ranges the new algorithm will always win. However, the algorithm could have a much bigger constant factor that makes it inefficient for smaller ranges. I'll try to see how STL implementations implement the fallback inplace_merge and maybe implement it and add it to the benchmarks. Best, Ion
participants (4)
-
Gonzalo BG
-
Ion Gaztañaga
-
Phil Endecott
-
Steven Ross