[flat_map] Any interest in a flat_map variant that is much faster on insertion/erase at the price of a slower lookup?
Hi My library is not ready for a formal review process, and I'd like to ask whether there is some potential interest first, thus saving everybody a lot of time. I have been working on a flat_map (http://www.boost.org/doc/libs/1_57_0/doc/html/boost/container/flat_map.html) variant that keeps some of the distinctive features of the original one (contiguous storage/cache friendliness, zero memory overhead/shrink_to_fit support, vector-based representation, etc...) but makes insertion and erasure of the elements much faster, at the price of a slower lookup and in-order traversal. Explaining how it works requires time and it's not really necessary if there is no interest in the first place (e.g. "thanks, but we have no use for this"). Let's get straight to the benchmarks and to some conclusions. #elems: 50000 std::map: Random insertion: 18.605 ms Random find: 24.382 ms Iteration: 5.658 ms Random erase: 37.166 ms experimental::flat_map: Random insertion: 400.141 ms Random find: 12.352 ms Iteration: 7.099 ms Random erase: 467.051 ms boost::container::flat_map: Random insertion: 611.678 ms Random find: 6.293 ms Iteration: 0.038 ms Random erase: 615.311 ms #elems: 200000 std::map: Random insertion: 146.394 ms Random find: 163.506 ms Iteration: 21.973 ms Random erase: 225.314 ms experimental::flat_map: Random insertion: 5068.93 ms Random find: 65.618 ms Iteration: 34.244 ms Random erase: 6851.09 ms boost::container::flat_map: Random insertion: 10696.9 ms Random find: 43.896 ms Iteration: 0.254 ms Random erase: 10604 ms Of course one doesn't expect the random insertion to have a performance that is comparable to the one of a tree-based map, like std::map. However, my experimental::flat_map provides a better trade-off than boost's flat_map for me, as random insertion is massively faster and in the big picture of the application I am developing this is making a hell of a difference. Am I making any sense? Shall I polish the implementation, publish it, and submit a proposal? I'd really appreciate your feedback. Kind Regards Giacomo
El 13/03/2015 a las 8:51, Giacomo Drago escribió:
Hi
My library is not ready for a formal review process, and I'd like to ask whether there is some potential interest first, thus saving everybody a lot of time.
I have been working on a flat_map (http://www.boost.org/doc/libs/1_57_0/doc/html/boost/container/flat_map.html) variant that keeps some of the distinctive features of the original one (contiguous storage/cache friendliness, zero memory overhead/shrink_to_fit support, vector-based representation, etc...) but makes insertion and erasure of the elements much faster, at the price of a slower lookup and in-order traversal.
Just a question. Are your insertions one by one or by range? Best, Ion
On Fri, Mar 13, 2015 at 9:22 AM, Ion Gaztañaga
El 13/03/2015 a las 8:51, Giacomo Drago escribió:
My library is not ready for a formal review process, and I'd like to ask whether there is some potential interest first, thus saving everybody a lot of time.
I have been working on a flat_map (http://www.boost.org/doc/libs/1_57_0/doc/html/boost/ container/flat_map.html) variant that keeps some of the distinctive features of the original one (contiguous storage/cache friendliness, zero memory overhead/shrink_to_fit support, vector-based representation, etc...) but makes insertion and erasure of the elements much faster, at the price of a slower lookup and in-order traversal.
Just a question. Are your insertions one by one or by range?
Might also be interesting to know if your perf numbers are based on a primitive key or a UDT, and whether the relative performance changes depending on key/value sizes. --DD
On 2015-03-13 08:27, Dominique Devienne wrote:
On Fri, Mar 13, 2015 at 9:22 AM, Ion Gaztañaga
wrote: Just a question. Are your insertions one by one or by range?
Might also be interesting to know if your perf numbers are based on a primitive key or a UDT, and whether the relative performance changes depending on key/value sizes. --DD
Insertions are one by one, and they are random with respect to the order of the keys. Same applies for lookups and erasures. No cheating. I will try with different keys that are not PDOs and have non-trivial comparators, but I expect the benchmarks to reflect my previous findings. The bottom line is that it makes fewer comparisons and swaps when inserting and removing, and more comparisons when looking up. Elements are not *quite* sorted into the storage vector, of course. The main use case I see for this data structure - and I am already using it even though not in a production environment - is that you may need a very space-efficient associative, sorted container that is not so painfully slow when it comes to insert/erase elements. I'll send more benchmarks. Any interest in the general design and the source code (not "boostified" yet)? Best Giacomo
"Hi
My library is not ready for a formal review process, and I'd like to
ask whether there is some potential interest first, thus saving
everybody a lot of time.
I have been working on a flat_map
(http://www.boost.org/doc/libs/1_57_0/doc/html/boost/container/flat_map.html)
variant that keeps some of the distinctive features of the original
one (contiguous storage/cache friendliness, zero memory
overhead/shrink_to_fit support, vector-based representation, etc...)
but makes insertion and erasure of the elements much faster, at the
price of a slower lookup and in-order traversal."
Would it be easy for you to provide benchmarks of std::map with
boost::container::node_allocator available in latest boost release?
According to my own benchmarks it can improve performance a lot,
mainly when memory is likely to be fragmented.
Regards,
Marcelo
2015-03-13 6:11 GMT-03:00 Giacomo Drago
On 2015-03-13 08:27, Dominique Devienne wrote:
On Fri, Mar 13, 2015 at 9:22 AM, Ion Gaztañaga
wrote: Just a question. Are your insertions one by one or by range?
Might also be interesting to know if your perf numbers are based on a primitive key or a UDT, and whether the relative performance changes depending on key/value sizes. --DD
Insertions are one by one, and they are random with respect to the order of the keys. Same applies for lookups and erasures. No cheating.
I will try with different keys that are not PDOs and have non-trivial comparators, but I expect the benchmarks to reflect my previous findings. The bottom line is that it makes fewer comparisons and swaps when inserting and removing, and more comparisons when looking up. Elements are not *quite* sorted into the storage vector, of course.
The main use case I see for this data structure - and I am already using it even though not in a production environment - is that you may need a very space-efficient associative, sorted container that is not so painfully slow when it comes to insert/erase elements.
I'll send more benchmarks. Any interest in the general design and the source code (not "boostified" yet)?
Best Giacomo
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 2015-03-13 13:00, Marcelo Zimbres wrote:
"Hi Would it be easy for you to provide benchmarks of std::map with boost::container::node_allocator available in latest boost release? According to my own benchmarks it can improve performance a lot, mainly when memory is likely to be fragmented.
Of course, I will do it ASAP. Best Giacomo
On 2015-03-13 15:55, Giacomo Drago wrote:
On 2015-03-13 13:00, Marcelo Zimbres wrote:
"Hi Would it be easy for you to provide benchmarks of std::map with boost::container::node_allocator available in latest boost release? According to my own benchmarks it can improve performance a lot, mainly when memory is likely to be fragmented.
Of course, I will do it ASAP.
Hi Marcelo, I tried to do some benchmarks with boost::container::node_allocator. I found little or no examples on how to use it. I tried to provide it as the fourth parameter of std::map and I got weird compiler errors, so I fell back to boost::container::map. This is what I get with integer keys: Size: 50000 std::map: Random insertion: 16.916 ms Random find: 24.702 ms Iteration: 7.528 ms Random erase: 40.328 ms boost::container::map<..., boost::container::node_allocator<...>>: Random insertion: 13.085 ms Random find: 11.189 ms Iteration: 5.301 ms Random erase: 34.896 ms experimental::flat_map: Random insertion: 347.988 ms Random find: 11.603 ms Iteration: 7.723 ms Random erase: 494.447 ms boost::container::flat_map: Random insertion: 644.319 ms Random find: 5.481 ms Iteration: 0.106 ms Random erase: 624.03 ms Size: 200000 std::map: Random insertion: 120.064 ms Random find: 142.521 ms Iteration: 22.736 ms Random erase: 212.725 ms boost::container::map<..., boost::container::node_allocator<...>>: Random insertion: 106.19 ms Random find: 86.353 ms Iteration: 19.375 ms Random erase: 169.475 ms experimental::flat_map: Random insertion: 5207.11 ms Random find: 72.71 ms Iteration: 36.488 ms Random erase: 6861.92 ms boost::container::flat_map: Random insertion: 10995 ms Random find: 50.254 ms Iteration: 0.317 ms Random erase: 10924.4 ms This what I get with a 4-integer struct as the key: Size: 50000 std::map: Random insertion: 15.269 ms Random find: 19.85 ms Iteration: 7.1 ms Random erase: 45.85 ms boost::container::map<..., boost::container::node_allocator<...>>: Random insertion: 18.278 ms Random find: 31.747 ms Iteration: 4.911 ms Random erase: 33.039 ms experimental::flat_map: Random insertion: 551.777 ms Random find: 12.955 ms Iteration: 7.921 ms Random erase: 688.911 ms boost::container::flat_map: Random insertion: 1040.7 ms Random find: 7.71 ms Iteration: 0.161 ms Random erase: 870.031 ms Size: 200000 std::map: Random insertion: 128.527 ms Random find: 137.222 ms Iteration: 21.572 ms Random erase: 206.251 ms boost::container::map<..., boost::container::node_allocator<...>>: Random insertion: 124.658 ms Random find: 152.009 ms Iteration: 23.368 ms Random erase: 193.538 ms experimental::flat_map: Random insertion: 7969.37 ms Random find: 92.236 ms Iteration: 41.396 ms Random erase: 10724.9 ms boost::container::flat_map: Random insertion: 19883.7 ms Random find: 57.496 ms Iteration: 0.459 ms Random erase: 17759.1 ms The size of the keys seems to make quite a difference. And of course, one has to consider the overhead given by the pointers in a linked structure, however allocations are performed. What are your thoughts? Giacomo
El 13/03/2015 a las 10:11, Giacomo Drago escribió:
Insertions are one by one, and they are random with respect to the order of the keys. Same applies for lookups and erasures. No cheating.
I didn't want to suggest any cheating, just trying to understand the scenario. If insertions are by range, then flat_map can be optimized. If insertions are one by one, it's more difficult, unless there are two phases, first insertion (without lookups between insertions) and after that, only searches. That scenario could be also optimized. Best, Ion
On 2015-03-13 13:37, Ion Gaztañaga wrote:
El 13/03/2015 a las 10:11, Giacomo Drago escribió:
Insertions are one by one, and they are random with respect to the order of the keys. Same applies for lookups and erasures. No cheating.
I didn't want to suggest any cheating, just trying to understand the scenario. If insertions are by range, then flat_map can be optimized. If insertions are one by one, it's more difficult, unless there are two phases, first insertion (without lookups between insertions) and after that, only searches. That scenario could be also optimized.
They are one by one, and can interleave in any possible way without affecting performance. I didn't imply you were suggesting any cheating, but I know that benchmarks can be crafted to make things look better than they are. And you can do this without even knowing. Best GIacomo
On 2015-03-13 08:27, Dominique Devienne wrote:
On Fri, Mar 13, 2015 at 9:22 AM, Ion Gaztañaga
wrote: Might also be interesting to know if your perf numbers are based on a primitive key or a UDT, and whether the relative performance changes depending on key/value sizes. --DD
On 2015-03-13 09:11, Giacomo Drago wrote:> On 2015-03-13 08:27, Dominique Devienne wrote:
On Fri, Mar 13, 2015 at 9:22 AM, Ion Gaztañaga
wrote: Just a question. Are your insertions one by one or by range?
Might also be interesting to know if your perf numbers are based on a primitive key or a UDT, and whether the relative performance changes depending on key/value sizes. --DD
This is what happens with a key made of four integers rather than just an integer. Size: 50000 std::map: Random insertion: 19.881 ms Random find: 19.87 ms Iteration: 6.241 ms Random erase: 42.872 ms experimental::flat_map: Random insertion: 509.001 ms Random find: 14.922 ms Iteration: 9.036 ms Random erase: 727.366 ms boost::container::flat_map: Random insertion: 1049.87 ms Random find: 7.778 ms Iteration: 0.082 ms Random erase: 885.753 ms Size: 200000 std::map: Random insertion: 124.174 ms Random find: 133.111 ms Iteration: 21.915 ms Random erase: 253.892 ms experimental::flat_map: Random insertion: 8181.72 ms Random find: 96.38 ms Iteration: 45.847 ms Random erase: 10951.3 ms boost::container::flat_map: Random insertion: 20271.5 ms Random find: 53.225 ms Iteration: 0.455 ms Random erase: 17338.5 ms Best Giacomo
Hi Giacomo, Giacomo Drago wrote:
I have been working on a flat_map (http://www.boost.org/doc/libs/1_57_0/doc/html/boost/container/flat_map.html) variant that keeps some of the distinctive features of the original one (contiguous storage/cache friendliness, zero memory overhead/shrink_to_fit support, vector-based representation, etc...) but makes insertion and erasure of the elements much faster, at the price of a slower lookup and in-order traversal.
Explaining how it works requires time and it's not really necessary
No, "how it works" is the first thing I want to know! It's unlikely that you've invented something completely new...
Let's get straight to the benchmarks
Rather than quantified benchmarks, to start with I'd like to know the big-O complexity of the operations. (Personally, I have a flat_map that does batched insertions i.e. the new items are appended individually and then merged at the end of the batch. This will have significant benefits for some workloads.) Regards, Phil.
El 13/03/2015 a las 14:38, Phil Endecott escribió:
(Personally, I have a flat_map that does batched insertions i.e. the new items are appended individually and then merged at the end of the batch. This will have significant benefits for some workloads.)
Yes, this should be added to flat_map. The interface won't be very pretty, but I would like to hear suggestions in this direction. Best, Ion
Ion Gaztañaga wrote:
El 13/03/2015 a las 14:38, Phil Endecott escribió:
(Personally, I have a flat_map that does batched insertions i.e. the new items are appended individually and then merged at the end of the batch. This will have significant benefits for some workloads.)
Yes, this should be added to flat_map. The interface won't be very pretty, but I would like to hear suggestions in this direction.
It doesn't have to have an interface, at least in principle. flat_map can just append the new elements to the end, keeping them sorted, until it hits a threshold, then merge. Lookups will first look at the new element range (because those are likely to be in cache), if not found, at the old element range. You could add a "please merge now" hint, but it won't be required. Actually, the new element range doesn't even need to be kept sorted, if the merge threshold is low, although on the other hand this runs into op< vs op== issues, so maybe not. Of course, when the underlying vector runs out of capacity, a non-inplace merge will be performed after reallocation. Upon closer look on the subject line, did I just reinvent a wheel? :-)
Ion Gaztañaga wrote:
El 13/03/2015 a las 14:38, Phil Endecott escribió:
(Personally, I have a flat_map that does batched insertions i.e. the new items are appended individually and then merged at the end of the batch.
On Fri, 13 Mar 2015 at 11:04 Peter Dimov
wrote: It doesn't have to have an interface, at least in principle. flat_map can just append the new elements to the end, keeping them sorted, until it hits a threshold, then merge. Lookups will first look at the new element range (because those are likely to be in cache), if not found, at the old element range.
It's possible to accomplish this today, but you need to do the buffering yourself outside of the flat_map using the ordered_unique_range_t insert method. http://www.boost.org/doc/libs/1_57_0/doc/html/boost/container/flat_map.html#... I use this method a lot where I figure out what I am going to be inserting before I do the insert, then I insert in one step. Of course, this assumes you don't need to know what's in the map while you're also figuring out what to put in the map. -- chris
El 13/03/2015 a las 16:04, Peter Dimov escribió:
It doesn't have to have an interface, at least in principle. flat_map can just append the new elements to the end, keeping them sorted, until it hits a threshold, then merge. Lookups will first look at the new element range (because those are likely to be in cache), if not found, at the old element range.
That would complicate the container, and break some invariants, like "equal_range". flat_map is designed to cover an scenario for more searches than insertions, maybe another container could be designed for a different scenario.
Actually, the new element range doesn't even need to be kept sorted, if the merge threshold is low, although on the other hand this runs into op< vs op== issues, so maybe not.
You've touched an interesting point. Sort+Merging could improve flat_map performance in some other scenarios. The option to do a merge can be added to a range insertion, current implementation has not very good performance due to excessive data movement. In this scenario we could back-insert, sort and merge. The problem is: -> It requires additional memory (stable_sort is needed for flat_multimap) to achieve O(N·log(N)). inplace_merge requires additional memory to obtain O(N). A user might not want to pay for this (you might need 2x temporary memory). One option is to use the remaining capacity as additional memory by default and optionally add an insertion option that allocates temporary storage to achieve maximum performance. -> we need to reimplement std::sort/std::stable_sort/std::inplace_merge in a Boost.Move-compatible way (using boost::adl_move_swap). It should allow customizing the temporary storage source. Maybe sorting experts would like to implement it in Boost.Move.
Of course, when the underlying vector runs out of capacity, a non-inplace merge will be performed after reallocation.
Upon closer look on the subject line, did I just reinvent a wheel? :-)
Maybe, but everyone should reinvent the wheel at least once in life. It's an great pleasure! Best, Ion
Peter Dimov wrote:
Ion Gazta?aga wrote:
El 13/03/2015 a las 14:38, Phil Endecott escribi?:
(Personally, I have a flat_map that does batched insertions i.e. the new items are appended individually and then merged at the end of the batch. This will have significant benefits for some workloads.)
Yes, this should be added to flat_map. The interface won't be very pretty, but I would like to hear suggestions in this direction.
I think we discussed this before. My preferred approach is flat_set S; { flat_set::batch b(S); b.insert(xxx); b.insert(yyy); } // b's dtor does the merge Or, if you don't like doing things in destructors, give batch a "commit" or "merge" method. One question is: should lookups while a batch is in progress (a) look only at pre-batch items, or (b) crash, or (c) work correctly (with a linear scan of the new items)? I prefer (a).
It doesn't have to have an interface, at least in principle. flat_map can just append the new elements to the end, keeping them sorted, until it hits a threshold, then merge. Lookups will first look at the new element range (because those are likely to be in cache), if not found, at the old element range.
I once tried to work out the complexity of operations on this structure, and I don't think the results were very impressive. Say you have N elements in the "main" container, and M elements in the "recently added" portion. Insertion will be O(M) due to moving the older "new" elements, plus some amortised amount for the eventual merge into the main container. Lookup will be O(log N + log M). Traversal is O(N + M) but with a significant constant factor as you have to compare main and new elements as you go. I think a useful maximum M is sqrt(N). If M is smaller than sqrt(N), then you are doing the merge often enough that the amortised merge complexity exceeds the insertion complexity. It is also possible to make this a recursive data structure, i.e. the "recently added" portion is the same sort of container with its own "very recently added" portion, ad infinitum. I don't think I managed to work out the complexity of that structure. I suspect that there is a provable lower bound on what can be achieved with "flat" data structures, though. Regards, Phil.
Phil Endecott wrote:
I think we discussed this before. My preferred approach is
flat_set S; { flat_set::batch b(S); b.insert(xxx); b.insert(yyy); } // b's dtor does the merge
What do you do when b.insert(yyy) throws?
Peter Dimov wrote:
Phil Endecott wrote:
I think we discussed this before. My preferred approach is
flat_set S; { flat_set::batch b(S); b.insert(xxx); b.insert(yyy); } // b's dtor does the merge
What do you do when b.insert(yyy) throws?
You have at least a couple of choices, e.g. the dtor merges the successfully-inserted values or it discards everything. Of course the next question is whether either merging or discarding can also throw. I would hope that discarding (i.e. truncating the vector back to its original size) could reasonably be expected never to throw, right? Merging would require that the comparisons cannot throw. Regards, Phil.
On 13/03/15 17:41, Phil Endecott wrote:
Peter Dimov wrote:
Ion Gazta?aga wrote:
El 13/03/2015 a las 14:38, Phil Endecott escribi?:
(Personally, I have a flat_map that does batched insertions i.e. the new > items are appended individually and then merged at the end of the batch. > This will have significant benefits for some workloads.)
Yes, this should be added to flat_map. The interface won't be very pretty, but I would like to hear suggestions in this direction.
I think we discussed this before. My preferred approach is
flat_set S; { flat_set::batch b(S); b.insert(xxx); b.insert(yyy); } // b's dtor does the merge
How is that any different from the following? flat_set S; flat_set b; b.insert(xxx); b.insert(yyy); S.insert_sorted(b.begin(), b.end());
El 18/03/2015 a las 22:37, Mathias Gaunard escribió:
How is that any different from the following?
flat_set S; flat_set b; b.insert(xxx); b.insert(yyy); S.insert_sorted(b.begin(), b.end());
Currently you can do this: S.insert(ordered_unique_range, b.begin(), b.end()); It's more efficient that a loop inserting one by one, but it still can be improved. If: - distance(b.begin(), b.end()) == M and - S.size() == N currently (Boost 1.58) it does MlogN comparisons and at most N + M moves/copies for Bidirectional Iterators. It allocates extra memory for that (memory for insertion positions/indexes that allow linear move/copies when inserting M objects in those positions. Currently I'm working on vector::merge/merge_unique functions that need N + M comparisons and moves, while trying to minimize extra memory (if vector memory needs to be reallocated because S.capacity() < N + M, it uses a std::merge variant to achieve O(N). I plan this improvement for Boost 1.59. If there is enough room in current vector memory for to store N elements plus indexes, it achieves also O(N) without allocating a new buffer. An optional refinement would be to fallback to MlogN (using binary search) instead of N+M comparisons when N >> M. I have plans to improve non-ordered range insertion to achieve optimal complexity but that requires storing and sorting (stable sorting for non-unique containers) the range [b.begin(), b.end()) in auxiliary memory, and then merging the newly ordered M elements with already stored N elements. Using std::stable_sort/sort + std::merge/inplace_merge is not optimal as they require additional data movements and memory allocations (e.g. memory from get_temporary_storage() is not reused). Since this optimal solution requires reimplementing stable_sort/sort/inplace_merge adding support for external temporary storage and Boost.Move I don't think it's feasible in the short term. Best, Ion
Mathias Gaunard
On 13/03/15 17:41, Phil Endecott wrote:
flat_set S; { flat_set::batch b(S); b.insert(xxx); b.insert(yyy); } // b's dtor does the merge
How is that any different from the following?
flat_set S; flat_set b; b.insert(xxx); b.insert(yyy); S.insert_sorted(b.begin(), b.end());
Your code uses more memory (temporarily) and does more copying. To be clear, my "batch" is a small object that stores only a reference to the flat_set and some iterators/indexes/pointers. Here's a sketch of one way of doing it: class batch { flat_set& f; size_t orig_size; public: batch(flat_set& f_): f(f_), orig_size(f.size()) {} void insert(const T& val) { f.push_back(val); } void rollback() { f.truncate(orig_size); } void commit() { sort(f.begin()+orig_size, f.end()); inplace_merge(f.begin(),f.begin()+orig_size,f.end()); orig_size = f.size(); } ~batch() { commit(); } }; Regards, Phil.
On 2015-03-13 13:38, Phil Endecott wrote:
Hi Giacomo,
No, "how it works" is the first thing I want to know! It's unlikely that you've invented something completely new... Granted... I don't think I have invented anything new, but if I had started describing the data structure (and I am not very good at it) I would have got no answers. Raw milliseconds usually do the trick, especially with C++ enthusiasts. I am writing a PDF that explains how this thing works, but basically it's a heap with a few tweaks.
Let's get straight to the benchmarks
Rather than quantified benchmarks, to start with I'd like to know the big-O complexity of the operations.
The Big-O complexity of the operations is same or worse. Lookup is O((log(n))^2), insertion/erasure is O(n), and full in-order traversal is O(n (log(n))^2) as unfortunately you can't conveniently move to the next element in the storage vector.
(Personally, I have a flat_map that does batched insertions i.e. the new items are appended individually and then merged at the end of the batch. This will have significant benefits for some workloads.)
Indeed, that's the first thing I would do but I wanted to get better insertions even when adding one element at a time, and interleaving insertions with lookups, because that's what I do in my actual usage scenario. Thank you for your time! Giacomo
OK, it is time for me to decide whether to submit this or not. Can this flat_map reasonably make it into Boost? If not, I'll push it on GitHub after polishing the source code. I thank you all for your kind feedback! Giacomo On 2015-03-13 07:51, Giacomo Drago wrote:
Hi
My library is not ready for a formal review process, and I'd like to ask whether there is some potential interest first, thus saving everybody a lot of time.
I have been working on a flat_map (http://www.boost.org/doc/libs/1_57_0/doc/html/boost/container/flat_map.html) variant that keeps some of the distinctive features of the original one (contiguous storage/cache friendliness, zero memory overhead/shrink_to_fit support, vector-based representation, etc...) but makes insertion and erasure of the elements much faster, at the price of a slower lookup and in-order traversal.
Explaining how it works requires time and it's not really necessary if there is no interest in the first place (e.g. "thanks, but we have no use for this"). Let's get straight to the benchmarks and to some conclusions.
#elems: 50000 std::map: Random insertion: 18.605 ms Random find: 24.382 ms Iteration: 5.658 ms Random erase: 37.166 ms experimental::flat_map: Random insertion: 400.141 ms Random find: 12.352 ms Iteration: 7.099 ms Random erase: 467.051 ms boost::container::flat_map: Random insertion: 611.678 ms Random find: 6.293 ms Iteration: 0.038 ms Random erase: 615.311 ms
#elems: 200000 std::map: Random insertion: 146.394 ms Random find: 163.506 ms Iteration: 21.973 ms Random erase: 225.314 ms experimental::flat_map: Random insertion: 5068.93 ms Random find: 65.618 ms Iteration: 34.244 ms Random erase: 6851.09 ms boost::container::flat_map: Random insertion: 10696.9 ms Random find: 43.896 ms Iteration: 0.254 ms Random erase: 10604 ms
Of course one doesn't expect the random insertion to have a performance that is comparable to the one of a tree-based map, like std::map. However, my experimental::flat_map provides a better trade-off than boost's flat_map for me, as random insertion is massively faster and in the big picture of the application I am developing this is making a hell of a difference.
Am I making any sense? Shall I polish the implementation, publish it, and submit a proposal? I'd really appreciate your feedback.
Kind Regards Giacomo
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Giacomo Drago wrote:
OK, it is time for me to decide whether to submit this or not.
Tell us how it works! (If you don't want to tell us how it works, then no, this is not suitable for Boost.) Regards, Phil.
On 2015-03-15 16:15, Phil Endecott wrote:
Giacomo Drago wrote:
OK, it is time for me to decide whether to submit this or not.
Tell us how it works!
(If you don't want to tell us how it works, then no, this is not suitable for Boost.)
Of course I want, it is not secret, but usually people can express a (lack of) interest in something without knowing "how it works," like: "No, we don't need a time machine that brings you to the same time you are when you use it." By telling this, you save some time to the self-deluded inventor, and to yourselves. Anyway, I finally managed to find some time to publish the source code with a brief explanation: https://github.com/giacomodrago/flat_map Feel free to challenge whatever part of it. ;) Please don't remark (at least not now) the absence of proper iterators or exception safety, and any other implementation detail, because at the moment I still have to figure out whether this data structure is a reasonable idea in the first place. There's time for polishing it. Of course, if a bug is affecting the benchmarks, please tell me: I am sorry, it is not done on purpose. Best Giacomo
"The size of the keys seems to make quite a difference. And of course, one has
to consider the overhead given by the pointers in a linked structure, however
allocations are performed.
What are your thoughts?"
Hi Giacomo,
thanks for providing these benchmarks with booost::container::node_allocator.
My point is that the performance of std::map depends so much on the allocation
strategy used that it is almost unfair to compare its benchmark when using the
standard allocator. One can draw conclusions about its bad performance when
the real problem may be on the allocator used. I see for example that for size
50000 the boost::container::map with the node allocator performs better than
your experimental map in all cases (specially on random insertion and erase).
You may want to have a look on this plot [1] of my own benchmarks of std::set.
It will give you an idea of how much the allocator influences the performance.
Regards,
Marcelo
[1] https://github.com/mzimbres/rtcpp/blob/master/fig/std_set_insertion.png
2015-03-15 15:14 GMT-03:00 Giacomo Drago
On 2015-03-15 16:15, Phil Endecott wrote:
Giacomo Drago wrote:
OK, it is time for me to decide whether to submit this or not.
Tell us how it works!
(If you don't want to tell us how it works, then no, this is not suitable for Boost.)
Of course I want, it is not secret, but usually people can express a (lack of) interest in something without knowing "how it works," like: "No, we don't need a time machine that brings you to the same time you are when you use it." By telling this, you save some time to the self-deluded inventor, and to yourselves.
Anyway, I finally managed to find some time to publish the source code with a brief explanation: https://github.com/giacomodrago/flat_map
Feel free to challenge whatever part of it. ;) Please don't remark (at least not now) the absence of proper iterators or exception safety, and any other implementation detail, because at the moment I still have to figure out whether this data structure is a reasonable idea in the first place. There's time for polishing it. Of course, if a bug is affecting the benchmarks, please tell me: I am sorry, it is not done on purpose.
Best Giacomo
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 2015-03-15 22:33, Marcelo Zimbres wrote:
"The size of the keys seems to make quite a difference. And of course, one has to consider the overhead given by the pointers in a linked structure, however allocations are performed.
What are your thoughts?"
Hi Giacomo,
thanks for providing these benchmarks with booost::container::node_allocator. My point is that the performance of std::map depends so much on the allocation strategy used that it is almost unfair to compare its benchmark when using the standard allocator. One can draw conclusions about its bad performance when the real problem may be on the allocator used. I see for example that for size 50000 the boost::container::map with the node allocator performs better than your experimental map in all cases (specially on random insertion and erase).
You may want to have a look on this plot [1] of my own benchmarks of std::set. It will give you an idea of how much the allocator influences the performance.
Thank you for your feedback. While I see your point, I think one does not choose a flat_map over a linked structure only for having faster lookups. We are talking about an associative container with no memory overhead, and, possibly, faster lookups. When I am in need of an associative, sorted container, in 95% of the cases I would opt for the vanilla std::map, with or without a better allocator. But we are addressing specific needs here. The comparison I am making is between boost's flat_map and my "experimental" flat_map, with a linked-structure map providing just the baseline to understand what are the trade-offs involved. Furthermore, comparing insertion/erasure performance of std::map (with or without boost::container::node_allocator) with insertion/erasure in whatever flat map is...uhm...pointless. It's... well... another Big-O. Everybody knows the linked structure to have blazing fast insertions and erasures when compared to a vector where you have to move elements around. I also point you to the size 50000 benchmark when the key is a 4-integer PDO. Said that, thank you for talking about the importance of allocators: I'll keep that in mind for future benchmarks. Best Giacomo
Giacomo Drago
Anyway, I finally managed to find some time to publish the source code with a brief explanation: https://github.com/giacomodrago/flat_map
I think your "brief explanation" is this sentence: [The elements] are organised in a heap where each level is kept sorted. OK. If there is more that I've missed, please let me know. That almost sounds like an implicit binary tree, except that I guess you're storing the min item in the parent, not the mid. I.e. to store the values 0 to 6, an implicit binary tree would store: 3 1 5 0 2 4 6 The children of item i are at locations 2i+1 and 2i+2. To iterate the items in order, you do a mid-order traversal. But a (min-) heap with sorted levels would be more like this: 0 1 4 2 3 5 6 In this case you can do a pre-order traversal to get the items in-order. Have I understood correctly? An implicit tree can clearly do lookup in O(log N). Insertion at a leaf is also O(log N), but I'm not sure about rebalancing; I suspect that each time you move a node you need to move all of its children, which obviously gets expensive. You say that you think your lookup is O(log^2 N), so maybe I've not correctly understood your design. The last time I looked at implicit binary trees, it was with the aim of improving locality of reference: a binary search in a flat_map, although theoretically O(log N), has terribly unfriendly VM and cache behaviour for large N. This is the same argument as for B-trees. Regards, Phil. P.S. Have you looked at Boost.Heap?
Thanks for your kind reply, Phil. I will answer your questions the best I can. On 2015-03-19 17:05, Phil Endecott wrote:
That almost sounds like an implicit binary tree, except that I guess you're storing the min item in the parent, not the mid. I.e. to store the values 0 to 6, an implicit binary tree would store:
3 1 5 0 2 4 6
The children of item i are at locations 2i+1 and 2i+2. To iterate the items in order, you do a mid-order traversal.
But a (min-) heap with sorted levels would be more like this:
0 1 4 2 3 5 6
In this case you can do a pre-order traversal to get the items in-order.
A simple pre-order traversal? Does it work? In your example it clearly does, but you can also have [ 0 ] [ 1 4 ] [ 2 5 6 7 ] In that case, how do you 'jump' from 2 up to 4 and then down again to 5? How do you know "who's next" (as it can be someone ten levels above)? I am just confused and talking nonsense?
You say that you think your lookup is O(log^2 N), so maybe I've not correctly understood your design.
P.S. Have you looked at Boost.Heap? I used Boost.Heap with great profit in the past. I didn't use any Boost
I basically make a per-level binary search with some minor optimizations easily understandable by reading the code (but I can elaborate, if I am making some sense so far and you wish me to continue). How can you search in O(log N) in a heap where the elements of each level are sorted (not antagonizing, genuinely just asking for curiosity)? If I convinced you so far, then we can get into how insertion and erasure work. library for this project because the implementation is a throwaway prototype and hacking up a modified heap by the means of a std::vector was, for me, the fastest way to get the job done - for now. In the unlikely case this strange and hardly useful data structure made it into Boost, it would be rewritten and properly Boostified, without any reinvention of the wheel of course, and stress tested to death. Best Giacomo
Giacomo Drago wrote:
On 2015-03-19 17:05, Phil Endecott wrote:
That almost sounds like an implicit binary tree, except that I guess you're storing the min item in the parent, not the mid. I.e. to store the values 0 to 6, an implicit binary tree would store:
3 1 5 0 2 4 6
The children of item i are at locations 2i+1 and 2i+2. To iterate the items in order, you do a mid-order traversal.
But a (min-) heap with sorted levels would be more like this:
0 1 4 2 3 5 6
In this case you can do a pre-order traversal to get the items in-order.
A simple pre-order traversal? Does it work? In your example it clearly does, but you can also have
[ 0 ] [ 1 4 ] [ 2 5 6 7 ]
Well that doesn't look like "each level is kept sorted" to me, but thanks for the further explanation of what you're actually doing. It is possible that this is a novel structure. I'm not convinced that it is useful. If you want a reasonably compact representation, i.e. without the overhead of per-item pointers, but need fast insertion and lookup (both in terms of computational complexity and absolute speed), then an in-memory B tree is hard to beat. Regards, Phil.
On 2015-03-21 17:00, Phil Endecott wrote:
Giacomo Drago wrote: Well that doesn't look like "each level is kept sorted" to me, but thanks for the further explanation of what you're actually doing. Well, it is *exactly* "each level is kept sorted", literally. :)
It is possible that this is a novel structure. I'm not convinced that it is useful. If you want a reasonably compact representation, i.e. without the overhead of per-item pointers, but need fast insertion and lookup (both in terms of computational complexity and absolute speed), then an in-memory B tree is hard to beat.
I see. Can you provide me with an implementation of such data structure having the same memory footprint of a flat_map (no per-item pointers) and better performance characteristics? I look forward to seeing it. Best Giacomo
Giacomo Drago wrote:
On 2015-03-21 17:00, Phil Endecott wrote:
Giacomo Drago wrote: Well that doesn't look like "each level is kept sorted" to me, but thanks for the further explanation of what you're actually doing. Well, it is *exactly* "each level is kept sorted", literally. :)
To be clear: in your structure, each pair of siblings is ordered. But those elements are not ordered with respect to the other items at the same level (the "cousins").
It is possible that this is a novel structure. I'm not convinced that it is useful. If you want a reasonably compact representation, i.e. without the overhead of per-item pointers, but need fast insertion and lookup (both in terms of computational complexity and absolute speed), then an in-memory B tree is hard to beat.
I see. Can you provide me with an implementation of such data structure having the same memory footprint of a flat_map (no per-item pointers) and better performance characteristics? I look forward to seeing it.
There is an in-memory B-tree at google code that looks decent, though I've not personally used it: http://code.google.com/p/cpp-btree/ If you click "Wiki pages - UsageInstructions" at the bottom left of that page you'll find tables giving the memory overhead. A B-tree has no per-item pointers, but it does have per-page pointers; you can asymptotically approach zero overhead by increasing the page size. With an infinite page size you effectively have a flat_map, with a page size of one you have a binary tree. In practice, a more important influence on memory footprint than the per-page pointers is the empty space in the pages that results from splitting after insertion, and after deletion but before merging. But a flat_map implemented on a std::vector has a similar issue, i.e. that std::vector doubles its size when it re-allocates. So a std::vector will be wasting up to half of the memory that it has allocated. Regards, Phil.
Hi Phil, thanks for following up. On 2015-03-22 14:41, Phil Endecott wrote:
To be clear: in your structure, each pair of siblings is ordered. But those elements are not ordered with respect to the other items at the same level (the "cousins").
Well... To be clear: they are. For each level of the binary heap (if we agree on the definition of level) the elements are sorted. Indeed, that's basically the whole point of it, otherwise I couldn't make a level-wise binary search when looking up, and I would need some sort of sorcery to get those benchmarks. Are we... on the same page? Am I missing something?
There is an in-memory B-tree at google code that looks decent, though I've not personally used it:
http://code.google.com/p/cpp-btree/
If you click "Wiki pages - UsageInstructions" at the bottom left of that page you'll find tables giving the memory overhead. A B-tree has no per-item pointers, but it does have per-page pointers; you can asymptotically approach zero overhead by increasing the page size. With an infinite page size you effectively have a flat_map, with a page size of one you have a binary tree.
In practice, a more important influence on memory footprint than the per-page pointers is the empty space in the pages that results from splitting after insertion, and after deletion but before merging. But a flat_map implemented on a std::vector has a similar issue, i.e. that std::vector doubles its size when it re-allocates. So a std::vector will be wasting up to half of the memory that it has allocated.
On a vector you can easily shrink_to_fit thus bringing the overhead to zero (if you neglect the vector size and a pointer to its buffer). I'll study the B-tree and its trade-offs - I declare my ignorance in this particular area - but I still think a flat_map (that does not mean this flat_map in particular) has some reason to be. Best Giacomo
Giacomo Drago wrote:
On 2015-03-22 14:41, Phil Endecott wrote:
There is an in-memory B-tree at google code that looks decent, though I've not personally used it:
http://code.google.com/p/cpp-btree/
If you click "Wiki pages - UsageInstructions" at the bottom left of that page you'll find tables giving the memory overhead. A B-tree has no per-item pointers, but it does have per-page pointers; you can asymptotically approach zero overhead by increasing the page size. With an infinite page size you effectively have a flat_map, with a page size of one you have a binary tree.
In practice, a more important influence on memory footprint than the per-page pointers is the empty space in the pages that results from splitting after insertion, and after deletion but before merging. But a flat_map implemented on a std::vector has a similar issue, i.e. that std::vector doubles its size when it re-allocates. So a std::vector will be wasting up to half of the memory that it has allocated.
On a vector you can easily shrink_to_fit thus bringing the overhead to zero (if you neglect the vector size and a pointer to its buffer).
And you can compact ("vacuum") a B-tree, if you want.
I still think a flat_map (that does not mean this flat_map in particular) has some reason to be.
For me, the main uses of flat data structures have been: (a) When it's important to have no pointers at all, so that I can store the data in a memory-mapped file or perhaps in interprocess shared memory. (Though most recently I decided to use offset pointers from Boost.Interprocess along with a Boost.Intrusive map, and that worked well.) (b) When the data has a "life cycle" with different phases, e.g. an initial phase of bulk creation, followed by a phase of random lookup. In this case you only sort once so the computational complexity is optimal. I would not want to use them when either you have fine-grained mixing of insertion and lookup, or when the amount of data is large enough that the poor locality of reference starts to dominate. Regards, Phil.
On 2015-03-22 16:40, Phil Endecott wrote:
And you can compact ("vacuum") a B-tree, if you want.
I see.
(a) When it's important to have no pointers at all, so that I can store the data in a memory-mapped file or perhaps in interprocess shared memory. (Though most recently I decided to use offset pointers from Boost.Interprocess along with a Boost.Intrusive map, and that worked well.)
(b) When the data has a "life cycle" with different phases, e.g. an initial phase of bulk creation, followed by a phase of random lookup. In this case you only sort once so the computational complexity is optimal.
I would not want to use them when either you have fine-grained mixing of insertion and lookup, or when the amount of data is large enough that the poor locality of reference starts to dominate.
You make a good point here! From your silence on the "sorted vs non sorted" issue, I am positive we are on the same page about that, regardless its relevance. Best Giacomo
Giacomo Drago wrote:
From your silence on the "sorted vs non sorted" issue, I am positive we are on the same page about that, regardless its relevance.
Yes, I have finally worked out what you're doing there. It seems to me that for an insertion, you'll potentially have to move all of the bottom level; that will be up to half of the container. If my maths is right, you'll do on average one third of the number of moves that a flat_map will do. Another way to get the same saving would be to have N normal flat_maps (where N is e.g. 2 or 3). Insert into whichever is currently smallest, and lookup in both. That has the same typical and worst-case complexity as a flat_map, but you can trade off (by constant factors) the cost of insertion vs. the cost of lookup and iteration by adjusting N. Looking at some of your benchmark numbers, you seem to be getting about X2 for insertion and erasing, vs. X0.66 for find. So I would expect a "flat map pair" to get comparable performance, and without losing the worst-case complexity of flat_map. Regards, Phil.
On 2015-03-24 19:32, Phil Endecott wrote:
It seems to me that for an insertion, you'll potentially have to move all of the bottom level; that will be up to half of the container. If my maths is right, you'll do on average one third of the number of moves that a flat_map will do.
I concur with you.
Another way to get the same saving would be to have N normal flat_maps (where N is e.g. 2 or 3). Insert into whichever is currently smallest, and lookup in both. That has the same typical and worst-case complexity as a flat_map, but you can trade off (by constant factors) the cost of insertion vs. the cost of lookup and iteration by adjusting N.
This is simple and brilliant, and can help me with the actual scenario I am dealing with. Do you mind if I use this idea, and I appropriately mention you? At this point I think there is no reason to submit this library to Boost. Best Giacomo
Giacomo Drago wrote:
On 2015-03-24 19:32, Phil Endecott wrote:
Another way to get the same saving would be to have N normal flat_maps (where N is e.g. 2 or 3). Insert into whichever is currently smallest, and lookup in both. That has the same typical and worst-case complexity as a flat_map, but you can trade off (by constant factors) the cost of insertion vs. the cost of lookup and iteration by adjusting N.
This is simple and brilliant, and can help me with the actual scenario I am dealing with. Do you mind if I use this idea, and I appropriately mention you?
You're welcome, and there is no need for any attribution. Regards, Phil.
participants (8)
-
Chris Glover
-
Dominique Devienne
-
Giacomo Drago
-
Ion Gaztañaga
-
Marcelo Zimbres
-
Mathias Gaunard
-
Peter Dimov
-
Phil Endecott