[sort][container] Challenge for sorting experts, an improved stable sort
Hi, I'm trying to improve insertion times for flat associative containers. In range insertions one optimization is to insert the whole range in the end of the container and sort the whole vector instead of trying a loop. An additional optimization proposed in the mailing list is to have an option to create externally the internal vector type of the flat map/set, fill it and then try to move it into the flat container. That requires sorting the moved vector (and removing duplicates if needed). I could try to use std::sort/std::stable_sort but there are some problems: -> Non-portable for C++03 compilers and Boost.Move, and Boost.Container fully supports those platforms. -> Additional O(N) temporary memory for stable sorting. Usually this means a temporary N element array, although there are algorithm to use only N/2 temporary elements. With big containers, this is a lot of memory. So I started analyzing several papers and implementations to see if a custom implementation was possible, but I felt Boost.Sort could be a good place to have that implementation and the Boost community a good place to discuss design alternatives. My initial requirements: 1) Stable sort. A non-stable sort is faster and useful for unique keys containers, but stable sort can be used initially in all flat containers until the non-stable version is implemented. 2) Externally supplied auxiliary memory with optionally no auxiliary memory at all. It seems that there are newer algorithms like Grailsort (https://github.com/Mrrl/GrailSort) or WikiSort (https://github.com/BonzaiThePenguin/WikiSort) that claim O(1) memory, even no auxiliary buffer, and key comparison and data exchanges with complexity O(n*log(n)). However I don't think those implementation correctly handle the basic exception guarantee or non-trivially copyable types. Reason: I want to avoid allocating short-lived temporary memory: if it's available I want to pass the extra capacity that flat containers' internal vector holds for future insertions. 3) Customization points (without ADL, which IMHO is not customizable enough). The sort algorithm should use a SortTraits template parameter for basic operations like "move", "swap", "initialized_move". E.g: template<class T> struct sort_traits { static? void move(T &from, T &to); static? void swap(T &l, T &r); static? void uninitialized_move(T &from, void *to); static? void destroy(T &from, void *to); }; I don't know if this trait should be stateful or not (static members). I could imagine the usefulness of a stateful implementation of sort_traits that could store a reference to an allocator so that some special "construct" version is used in some operations. Or just a version that counts the number of each operation to ease debugging. In any case, a stateless version would be enough for most cases. Reason: In Boost.Container, the idea is to use boost::move + boost::adl_move_swap in the initial implementation, maybe some destructive move implementation in a second iteration if Boost.Move implements it. In any case those operations should be customizable by the user, independently from the type to be sorted. This also avoids unnecessary dependencies in Boost.Sort. -> Dependencies: the implementation should only rely on Boost.Core, Boost.Config, maybe Boost.TypeTraits and no heavy STL dependencies (optionally). Currently Boost.Sort depends on Serialization for BOOST_STATIC_WARNING, but I think this is easily solvable, maybe this static assert utility should go to the core/utility library. Reason: Sort is a low-level algorithm that should avoid pulling unneeded dependencies to other boost libraries. * * * I don't know if other Boost libraries could benefit from this new stable_sort implementation. Maybe it's too special and I should try to write it myself in Boost.Container. I would love to hear comments and suggestions from sort experts, this might be a common need that could enrich the Boost.Sort library. Best, Ion
On Sun, Sep 20, 2015 at 6:09 PM Ion Gaztañaga
Hi,
I'm trying to improve insertion times for flat associative containers. In range insertions one optimization is to insert the whole range in the end of the container and sort the whole vector instead of trying a loop.
I could try to use std::sort/std::stable_sort but there are some problems:
-> Non-portable for C++03 compilers and Boost.Move, and Boost.Container fully supports those platforms.
Do you have example current users of your library who need this portability? It would help me to understand an actual use case.
So I started analyzing several papers and implementations to see if a custom implementation was possible, but I felt Boost.Sort could be a good place to have that implementation and the Boost community a good place to discuss design alternatives.
That's reasonable.
My initial requirements:
1) Stable sort. A non-stable sort is faster and useful for unique keys containers, but stable sort can be used initially in all flat containers until the non-stable version is implemented.
2) Externally supplied auxiliary memory with optionally no auxiliary memory at all. It seems that there are newer algorithms like Grailsort (https://github.com/Mrrl/GrailSort) or WikiSort (https://github.com/BonzaiThePenguin/WikiSort) that claim O(1) memory, even no auxiliary buffer, and key comparison and data exchanges with complexity O(n*log(n)). However I don't think those implementation correctly handle the basic exception guarantee or non-trivially copyable types.
Reason: I want to avoid allocating short-lived temporary memory: if it's available I want to pass the extra capacity that flat containers' internal vector holds for future insertions.
Please figure out your full set of requirements, state them, and then list algorithms that fulfill them (if any). Spreadsort and other radix-based sorts are capable of being stable if you give them enough RAM and structure the computation properly, and are quite fast properly used. Would it be ok to have a full copy of pointers to the elements?
3) Customization points (without ADL, which IMHO is not customizable enough). The sort algorithm should use a SortTraits template parameter for basic operations like "move", "swap", "initialized_move". E.g:
template<class T> struct sort_traits { static? void move(T &from, T &to); static? void swap(T &l, T &r); static? void uninitialized_move(T &from, void *to); static? void destroy(T &from, void *to); };
I don't know if this trait should be stateful or not (static members). I could imagine the usefulness of a stateful implementation of sort_traits that could store a reference to an allocator so that some special "construct" version is used in some operations. Or just a version that counts the number of each operation to ease debugging. In any case, a stateless version would be enough for most cases.
My inclination is to avoid adding options for users unless they are absolutely necessary, but if you're confident you'll need this flexibility in the future it seems reasonable.
Reason: In Boost.Container, the idea is to use boost::move + boost::adl_move_swap in the initial implementation, maybe some destructive move implementation in a second iteration if Boost.Move implements it. In any case those operations should be customizable by the user, independently from the type to be sorted. This also avoids unnecessary dependencies in Boost.Sort.
-> Dependencies: the implementation should only rely on Boost.Core, Boost.Config, maybe Boost.TypeTraits and no heavy STL dependencies (optionally). Currently Boost.Sort depends on Serialization for BOOST_STATIC_WARNING, but I think this is easily solvable, maybe this static assert utility should go to the core/utility library.
Reason: Sort is a low-level algorithm that should avoid pulling unneeded dependencies to other boost libraries.
If you're willing to shift the critical dependencies of Boost.Sort into a smaller library so it doesn't include the world, I'd love that.
* * *
I don't know if other Boost libraries could benefit from this new stable_sort implementation. Maybe it's too special and I should try to write it myself in Boost.Container.
I would love to hear comments and suggestions from sort experts, this might be a common need that could enrich the Boost.Sort library.
Have you talked with Francisco? He has a promising stable sort that we
were getting close to adding, but he hasn't fixed his test class yet to eliminate some weirdness in it I've pointed out. Note: I have very little bandwidth until late December to work on anything, as I'm on a remote assignment.
Ion Gazta?aga wrote:
I'm trying to improve insertion times for flat associative containers. In range insertions one optimization is to insert the whole range in the end of the container and sort the whole vector instead of trying a loop.
Hi Ion, We've discussed this before... Why do you (still) want to sort the whole container, rather than just sorting the new items and then using inplace_merge? The idea of using the allocated-but-unused part of the buffer for temporary storage is an interesting one, which would benefit inplace_merge as well as stable_sort. Have you tried to quantify the benefits? Cheers, Phil.
On 21/09/2015 20:53, Phil Endecott wrote:
Ion Gazta?aga wrote:
I'm trying to improve insertion times for flat associative containers. In range insertions one optimization is to insert the whole range in the end of the container and sort the whole vector instead of trying a loop.
Hi Ion,
We've discussed this before...
Why do you (still) want to sort the whole container, rather than just sorting the new items and then using inplace_merge?
Well, I actually want to have both merge and sort. And yes, if understand this correctly sort + merge would be faster. In any case the idea is to have both (sort and merge) as adaptable algorithms, so that the extra capacity of the vector can be used to speed up the both sort and merge. For already sorted inputs and multikey containers, when size()+std::distance(begin, end) > capacity(), then a usual O(N) merge can be done in the newly allocated memory. The latest version of flat containers (Boost 1.59) already does this.
The idea of using the allocated-but-unused part of the buffer for temporary storage is an interesting one, which would benefit inplace_merge as well as stable_sort. Have you tried to quantify the benefits?
Not yet, but if known algorithms are used, the difference between inplace_sort|merge and sort|merge is theoretically big. Merge becomes O(N) instead of O(NlogN). Sort becomes O(logN) instead of O(N(logN)^2). When mergingsorting a range bigger than the extra capacity, steps up the the extra capacity would be O(N) and the last merge step would be N*logN. There are some mergesort implementations that only require N/2 extra memory, which really makes the extra capacity more attractive: "Fast mergesort implementation based on half-copying merge algorithm" http://kicia.ift.uni.wroc.pl/algorytmy/mergesortpaper.pdf. When growing the vector, the extra capacity is sometimes quite big (depending on the growth factor). I think speedups will be noticeable. Obviously, nothing is real until it's measured. There are C++ implementations of stable sort algorithms which claim they beat std::stable_sort, only require O(1) extra memory (although I guess it's really O(logN) extra memory if recursion is used), and still guarantee O(N*log(N)) complexity instead of O(N*(logN)^2). Based on these papers: http://www.math.ntua.gr/~symvonis/publications/c_1994_S_Optimal%20Stable%20M... https://github.com/BonzaiThePenguin/WikiSort/blob/master/tamc2008.pdf http://comjnl.oxfordjournals.org/content/35/6/643.full.pdf In any case those implementations are mainly tested with integers, they use memcpy so work only with PODs and completely ignore exception safety. But fixing that seems quite easier than writing a new algorithm ;-) Best, Ion PS: Extra memory is also currently used by an undocumented function, called "insert_ordered_at" that calculates an array of positions where the new values should be inserted, and then only moves already inserted elements once which minimizes copies to O(N).
Ion Gazta?aga wrote:
On 21/09/2015 20:53, Phil Endecott wrote:
Why do you (still) want to sort the whole container, rather than just sorting the new items and then using inplace_merge?
Well, I actually want to have both merge and sort. And yes, if understand this correctly sort + merge would be faster.
In any case the idea is to have both (sort and merge) as adaptable algorithms, so that the extra capacity of the vector can be used to speed up the both sort and merge.
Have you considered making this general-purpose by adding a feature to vector that exports an allocator, or a memory_resource, that could be used by any code that needs a temporary buffer while it can guarantee not to cause the vector to expand? You could then modify inplace_merge, stable_sort etc. to take such an allocator to use instead of get_temporary_buffer. And you could choose whether or not to use a fallback, i.e. get_temporary_buffer, if the spare space in the vector is insufficient. (I'd want to see numbers convincing me of the practical usefulness of this, though; I've worked on plenty of memory-constrained projects and I've never been desperate enough to resort to this sort of thing!)
For already sorted inputs and multikey containers, when size()+std::distance(begin, end) > capacity(),
I think you mean < capacity(), right?
then a usual O(N) merge can be done in the newly allocated memory. The latest version of flat containers (Boost 1.59) already does this.
The idea of using the allocated-but-unused part of the buffer for temporary storage is an interesting one, which would benefit inplace_merge as well as stable_sort. Have you tried to quantify the benefits?
Not yet, but if known algorithms are used, the difference between inplace_sort|merge and sort|merge is theoretically big. Merge becomes O(N) instead of O(NlogN). Sort becomes O(logN) instead of O(N(logN)^2).
I think you're misusing the term "inplace" in that paragraph. I think you mean "with no additional memory". "inplace" just means that the result ends up back where the input was. All std:: sorts are inplace. std::sort is never O(N(logN)^2) (or O(logN), but I presume that's a typo!). std::stable_sort is allowed be, but see below.
When mergingsorting a range bigger than the extra capacity, steps up the the extra capacity would be O(N) and the last merge step would be N*logN. There are some mergesort implementations that only require N/2 extra memory, which really makes the extra capacity more attractive:
"Fast mergesort implementation based on half-copying merge algorithm" http://kicia.ift.uni.wroc.pl/algorytmy/mergesortpaper.pdf.
When growing the vector, the extra capacity is sometimes quite big (depending on the growth factor). I think speedups will be noticeable. Obviously, nothing is real until it's measured.
There are C++ implementations of stable sort algorithms which claim they beat std::stable_sort, only require O(1) extra memory (although I guess it's really O(logN) extra memory if recursion is used), and still guarantee O(N*log(N)) complexity instead of O(N*(logN)^2). Based on these papers:
http://www.math.ntua.gr/~symvonis/publications/c_1994_S_Optimal%20Stable%20M...
https://github.com/BonzaiThePenguin/WikiSort/blob/master/tamc2008.pdf
Right, stable sorts that are "almost" O(N log N) are known; they work with blocks of size sqrt(N). I say "almost" because they have problems when more than sqrt(N) items are equal to each other. I'm not sure if that is solved. But these algorithms tend to have a constant factor of maybe 5 or 7 compared to unstable sorts or sorts that use extra memory, and I don't think I've ever seen one used anywhere. I guess it's because of this that std::stable_sort is not required to be O(N log N).
In any case those implementations are mainly tested with integers, they use memcpy so work only with PODs and completely ignore exception safety. But fixing that seems quite easier than writing a new algorithm ;-)
Yes, I don't think there should be any fundamental issues with making any of these sqrt(N) block-based algorithms more "C++". Cheers, Phil.
On 22/09/2015 18:41, Phil Endecott wrote:
Have you considered making this general-purpose by adding a feature to vector that exports an allocator, or a memory_resource, that could be used by any code that needs a temporary buffer while it can guarantee not to cause the vector to expand? You could then modify inplace_merge, stable_sort etc. to take such an allocator to use instead of get_temporary_buffer. And you could choose whether or not to use a fallback, i.e. get_temporary_buffer, if the spare space in the vector is insufficient.
Interesting. Could you please elaborate the basic ideas of this approach? * * * I think there are some "guarantees" or "expected behaviour" when inserting to a flat container: -> The standard insertion function, IMHO, should guarantee that if capacity() - size() is bigger than the range to be inserted, vector won't be reallocated (after all, capacity() should mean something ;-)). -> About using temporary storage (except the stack) in a method of a container: I haven't seen it in any STL implementation. Even std::list::sort, which maybe could be optimized using a temporary storage offering random-access iterators, is always implemented using temporary lists and merging them. Other flat container implementations like Folly or Loki don't use any temporary storage.
(I'd want to see numbers convincing me of the practical usefulness of this, though; I've worked on plenty of memory-constrained projects and I've never been desperate enough to resort to this sort of thing!)
It's a bit strange, I know. According to the standard a container should allocate all memory from the allocator, I don't know if it should include temporary storage, I can't find anything explicit in the standard. The issue is that we have no STL container that does something similar, it's hard to know what a user should expect. In any case, allocating temporary storage from the allocator or new/malloc in every range insertion sounds like a very good method to fragment memory.
For already sorted inputs and multikey containers, when size()+std::distance(begin, end) > capacity(),
I think you mean < capacity(), right?
No, sorry. I think I didn't explained it very well. When the current remaining capacity (capacity() - size()) is less than the size of range to be inserted (that is, the vector has to be reallocated because new elements don't fit in the capacity), and the input is sorted (ordered_range_t overloads), then a new buffer is allocated and std::merge() can be used to achieve a O(N) range insertion.
Not yet, but if known algorithms are used, the difference between inplace_sort|merge and sort|merge is theoretically big. Merge becomes O(N) instead of O(NlogN). Sort becomes O(logN) instead of O(N(logN)^2).
I think you're misusing the term "inplace" in that paragraph. I think you mean "with no additional memory". "inplace" just means that the result ends up back where the input was. All std:: sorts are inplace.
Yes, you are right, let's call them "no additional memory" methods.
std::sort is never O(N(logN)^2) (or O(logN), but I presume that's a typo!). std::stable_sort is allowed be, but see below.
Sorry if I didn't explain it clearly: I'm always talking about a stable sort, I think it's the only alternative for flat containers. Even with flat_set/map (unique keys), only duplicates that are inserted after the first occurrence of a key shall be deleted, as it would happen if a one-by-one loop is used to insert a range.
Right, stable sorts that are "almost" O(N log N) are known; they work with blocks of size sqrt(N). I say "almost" because they have problems when more than sqrt(N) items are equal to each other. I'm not sure if that is solved. But these algorithms tend to have a constant factor of maybe 5 or 7 compared to unstable sorts or sorts that use extra memory, and I don't think I've ever seen one used anywhere. I guess it's because of this that std::stable_sort is not required to be O(N log N).
Thanks. It seems that some implementations have improved that constant factor. At least some benchmarks against std::stable_sort in the following links tell us that they might quite competitive: https://github.com/Mrrl/GrailSort https://github.com/BonzaiThePenguin/WikiSort According to "Fast Stable Merging and Sorting in Constant Extra Space'" (http://comjnl.oxfordjournals.org/content/35/6/643.full.pdf), chapter "5.3 Constant of Proportionality Bounds": "We conclude that we are guaranteed a worst-case key-comparison and record-exchange grand total not greater than 2.5N*log2N" "This worst-case total compares favourably with average-case key-comparison and record-exchange totals for popular unstable methods: quick-sort's average-case figure is a little more than 1.4Nlog2N; heap-sort's is about 2.3Nlog2N. (These values are derived from the analysis in Ref. 14, where we count a single record movement at one third the cost of a two-record exchange.) Not sure if those algorithms are properly tested or benchmarked. Best, Ion
On Wed, Sep 23, 2015 at 12:29 PM, Ion Gaztañaga
On 22/09/2015 18:41, Phil Endecott wrote:
According to the standard a container should allocate all memory from the allocator, I don't know if it should include temporary storage, I can't find anything explicit in the standard.
The issue is that we have no STL container that does something similar, it's hard to know what a user should expect. In any case, allocating temporary storage from the allocator or new/malloc in every range insertion sounds like a very good method to fragment memory.
I think it is fair to expect any memory allocations made by container are done through the allocator, whether these are for temporary use or not (well, except for exceptions and strings in them). This expectation is further reinforced with the addition of memory resource propagation, which allows to use the same allocator for the contained elements as well.
On 23/09/2015 12:59, Andrey Semashev wrote:
I think it is fair to expect any memory allocations made by container are done through the allocator, whether these are for temporary use or not (well, except for exceptions and strings in them). This expectation is further reinforced with the addition of memory resource propagation, which allows to use the same allocator for the contained elements as well.
This sounds reasonable. So in your opinion, get_temporary_storage() should be avoided? Ion
On Wed, Sep 23, 2015 at 2:36 PM, Ion Gaztañaga
On 23/09/2015 12:59, Andrey Semashev wrote:
I think it is fair to expect any memory allocations made by container are done through the allocator, whether these are for temporary use or not (well, except for exceptions and strings in them). This expectation is further reinforced with the addition of memory resource propagation, which allows to use the same allocator for the contained elements as well.
This sounds reasonable. So in your opinion, get_temporary_storage() should be avoided?
Yes, I think so. Yet another reason to always rely on the allocator is that the user can supply his own allocator with special error reporting method (e.g. by throwing an application-specific exception instead of bad_alloc). The application may be expecting that the container will always report memory allocation errors in this special way.
Ion Gazta?aga wrote:
On 22/09/2015 18:41, Phil Endecott wrote:
Have you considered making this general-purpose by adding a feature to vector that exports an allocator, or a memory_resource, that could be used by any code that needs a temporary buffer while it can guarantee not to cause the vector to expand? You could then modify inplace_merge, stable_sort etc. to take such an allocator to use instead of get_temporary_buffer. And you could choose whether or not to use a fallback, i.e. get_temporary_buffer, if the spare space in the vector is insufficient.
Interesting. Could you please elaborate the basic ideas of this approach?
Well there's nothing magic, but using polymorphic memory resources it would be something like this; you could do something similar with Allocators: // Add features to vector to get access to its beyond-the-end // storage: class vector_unused_storage: public memory_resource { ... }; vector_unused_storage vector::get_unused_storage() const; // Add a version of inplace_merge that takes a memory resource // to use instead of get_temporary_buffer: template< class BidirIt > void inplace_merge( BidirIt first, BidirIt middle, BidirIt last, memory_resource& mr ); // A vector containing two ordered subranges: vector<int> v; v.push_back(42); v.push_back(123); v.push_back(99); v.push_back(100); // Merge the two subranges, using the vector's own unused beyond- // the-end storage for any temporary that is required: inplace_merge(v.begin(), v.begin()+2, v.end(), v.get_unused_storage()); I'll just say again - I think this is an interesting thought experiment, but I'm not at all convinced that it's actually useful in practice!
I think there are some "guarantees" or "expected behaviour" when inserting to a flat container:
-> The standard insertion function, IMHO, should guarantee that if capacity() - size() is bigger than the range to be inserted, vector won't be reallocated (after all, capacity() should mean something ;-)).
Yes, certainly.
-> About using temporary storage (except the stack) in a method of a container: I haven't seen it in any STL implementation.
Probably true. But there are some std algorithms that do use temporary storage, and I don't see any particular reason why methods of containers should have different requirements than non-member algorithms. Perhaps the committee had some rationale for not providing an allocator interface to algorithms that need temporary storage, and for the existence of get_temporary_buffer.
Even std::list::sort, which maybe could be optimized using a temporary storage offering random-access iterators, is always implemented using temporary lists and merging them.
It can't have its big-O complexity improved though.
allocating temporary storage from the allocator or new/malloc in every range insertion sounds like a very good method to fragment memory.
Shrug. As is often the case, there is a tradeoff here: - Insert in the naive way and you get O(N^2) execution time. - Insert in the smart way and you get O(N log N) execution time but maybe you suffer some memory fragmentation issue.
For already sorted inputs and multikey containers, when size()+std::distance(begin, end) > capacity(),
I think you mean < capacity(), right?
No, sorry. I think I didn't explained it very well. When the current remaining capacity (capacity() - size()) is less than the size of range to be inserted (that is, the vector has to be reallocated because new elements don't fit in the capacity), and the input is sorted (ordered_range_t overloads), then a new buffer is allocated and std::merge() can be used to achieve a O(N) range insertion.
Ah, OK, I see. Yes that makes sense.
Not yet, but if known algorithms are used, the difference between inplace_sort|merge and sort|merge is theoretically big. Merge becomes O(N) instead of O(NlogN). Sort becomes O(logN) instead of O(N(logN)^2).
I think you're misusing the term "inplace" in that paragraph. I think you mean "with no additional memory". "inplace" just means that the result ends up back where the input was. All std:: sorts are inplace.
Yes, you are right, let's call them "no additional memory" methods.
std::sort is never O(N(logN)^2) (or O(logN), but I presume that's a typo!). std::stable_sort is allowed be, but see below.
Sorry if I didn't explain it clearly: I'm always talking about a stable sort, I think it's the only alternative for flat containers.
Well that's worth thinking about for a moment. You know that keys are unique in the "old" part of the container. Can that be exploited in some useful way? Hmm, maybe not.
Right, stable sorts that are "almost" O(N log N) are known; they work with blocks of size sqrt(N). I say "almost" because they have problems when more than sqrt(N) items are equal to each other. I'm not sure if that is solved. But these algorithms tend to have a constant factor of maybe 5 or 7 compared to unstable sorts or sorts that use extra memory, and I don't think I've ever seen one used anywhere. I guess it's because of this that std::stable_sort is not required to be O(N log N).
Thanks. It seems that some implementations have improved that constant factor. At least some benchmarks against std::stable_sort in the following links tell us that they might quite competitive:
https://github.com/Mrrl/GrailSort https://github.com/BonzaiThePenguin/WikiSort
According to "Fast Stable Merging and Sorting in Constant Extra Space'" (http://comjnl.oxfordjournals.org/content/35/6/643.full.pdf), chapter "5.3 Constant of Proportionality Bounds":
"We conclude that we are guaranteed a worst-case key-comparison and record-exchange grand total not greater than 2.5N*log2N"
"This worst-case total compares favourably with average-case key-comparison and record-exchange totals for popular unstable methods: quick-sort's average-case figure is a little more than 1.4Nlog2N; heap-sort's is about 2.3Nlog2N. (These values are derived from the analysis in Ref. 14, where we count a single record movement at one third the cost of a two-record exchange.)
Well that's great then. Someone want to implement it?! Regards, Phil.
participants (4)
-
Andrey Semashev
-
Ion Gaztañaga
-
Phil Endecott
-
Steven Ross