[poly_collection] Memory allocation and allocator propagation
Hi Joaquín,
Some additional comments about the design:
/////////////////////////
// Memory indirections
/////////////////////////
If I understand code correctly, the internal structure of a poly
collection is approximately:
1) poly_collection stores an unordered_map
El 14/05/2017 a las 17:01, Ion Gaztañaga escribió:
Hi Joaquín,
Some additional comments about the design:
///////////////////////// // Memory indirections /////////////////////////
If I understand code correctly, the internal structure of a poly collection is approximately:
1) poly_collection stores an unordered_map
2) segment_type stores a unique_ptr
and segment_backend_is an abstract class, so roughly equivalent to unique_ptr (both packed_segment/split_segment derive from segment_backend. 3) packed_segment/split_segment stores a vector of concrete types.
So we need 3 memory hops to navigate between the "this" pointer of a PolyCollection to a pointer to a concrete type: poly_collection -> unordered node -> segment in unique_ptr -> vector value.
I think the data structure could be optimized to save one indirection. segment_type could directly hold the class derived from segment_backend as packed_segment/split_segment will need the same size and alignment requirements regardless of the ConcreteType they hold (as long as they use std::vector). This could improve insertion times and other operations. This optimization could also make easier to fix the following section (allocator propagation).
*Main question 1*: Do you find think PolyCollection should implement this optimization?
How do yo envision this could be done? Replacing the current
std::unique_ptr
///////////////////////// // Allocator propagation /////////////////////////
As mentioned in Adam Wulkiewicz's review PolyCollection currently does not propagate the allocator from the constructor argument of poly_collection to the concrete type. From a practical point of view, the key requirement is that all intermediate memory until the concrete type must be allocated using the same root allocator passed when a poly_collection was constructed, and that allocator must be propagated also to the concrete type.
We have a problem here: your key requirement can be split in two:
A. Allocator is propagated down the data structure for memory allocaion
B. allocator_traits
[...]
I think that a unique memory source for all internal structures and the correct propagation of the allocator using allocator_traits is a very interesting feature for PolyCollection.
Hopefully answered above.
[...]
*Main question 2*: Do you think PolyCollection should support stateful allocator using the scoped allocator model?
Now, the scoped allocator model is an even stronger requirement than we have discussed so far. I don't have the necessary expertise in this, but I fail to see the applicability to PolyCollection from a cursory glance at N2554.
Some specific notes that might show places where the scoped allocator model has problems.:
[...]
- The "emplace" operation uses a placement new to wrapped in a lambda which is type-erased with a captureless lambda. It looks to me that instead of delegating the work into the value_holder, the type erasure arguments should be treated by the segment so that internal vectors propagate the allocator when needed.
I don't see how this could be done: type erasure must happen before segment is used, and actual args types are then unknown to segment... Care to elaborate? Joaquín M López Muñoz
On 15/05/2017 11:57, Joaquin M López Muñoz wrote:
I think the data structure could be optimized to save one indirection. segment_type could directly hold the class derived from segment_backend as packed_segment/split_segment will need the same size and alignment requirements regardless of the ConcreteType they hold (as long as they use std::vector). This could improve insertion times and other operations. This optimization could also make easier to fix the following section (allocator propagation).
*Main question 1*: Do you find think PolyCollection should implement this optimization?
How do yo envision this could be done? Replacing the current std::unique_ptr
pimpl with a std::variant (which would restrict the set of implementations of segment_backend to only these two)? Or do you have something else in mind?
In theory each Model can specify the alignment and size to hold every
possible segment type for any ConcreteType. I think any ConcreteType
will have the same size and alignment requirements (as a vector stores
pointers so the size or alignment of T does not participate in the size
or alignment of segment_type).
So the requirement is that there is an upper bound on the size and
alignment for segment types holding any ConcreteType. Class segment
holds aligned storage to hold the any concrete segment_type:
class MyModel
{
private:
typedef packed_segment
We have a problem here: your key requirement can be split in two:
A. Allocator is propagated down the data structure for memory allocaion B. allocator_traits
::construct/destroy is used for construction/destruction of the "container's element type". As for A, in fact the root allocator is copied and used for all dynamic memory handling from the top std::unordered_map down to the private std::vectors in packed_segment/split_segment *except* at the std::unique_ptr
pimpl part. This could be remedied either by supressing the unique_ptr altogether as you suggest at the beginning of the post, or providing a custom deleter to unique_ptr or something.
Right. I noticed that value_holder also does not propagate the allocator argument when the concrete type is constructed. value_holder could be defined so that uses_allocator_v == true (defining an internal allocator_type and having extended constructors. Then it could use allocator traits to construct the concrete type and forward the allocator if that type supports it. IMHO, even if the standard seems to say the opposite (as only requires allocator_traits::construct for the element_type), I think the allocator should be propagated also to the vector holding the index in split_segment. The idea is that all memory needed by the container should come from the same source.
Now as for B: This really depends on what the "container's element type" is supposed to be. A natural posibility would be to equate this with the container's value_type, in which the case the situation is:
I think the standard is written to support only std containers but not generic containers. IMHO requirement B is not applicable to PolyCollection.
Summing up: I see A (all dyamic memory handled by passed allocator) as an useful and desireable feature. I don't see the point of doing contortions to achieve B (by passing allocators to value_holder etc.) just for the sake of compliance, much more so when in one collection (namely base_collection) it is impossible to comply in a literal sense (it is derived classes that get constructed/destroyed).
You need to pass the allocator to value_holder so it forwards it to the real concrete type, which could be a container. Just use allocator_traits each time you need to construct an object in place.
- The "emplace" operation uses a placement new to wrapped in a lambda which is type-erased with a captureless lambda. It looks to me that instead of delegating the work into the value_holder, the type erasure arguments should be treated by the segment so that internal vectors propagate the allocator when needed.
I don't see how this could be done: type erasure must happen before segment is used, and actual args types are then unknown to segment... Care to elaborate?
I haven't thought it carefully and it might not possible at all, but maybe the lambda could execute: s.emplace_back(forward<Args>(args...)); where s is the concrete segment type that can be obtained via Model (the address of the aligned storage inside "class storage" is casted to Mode::get_segment_type<T>::type*) because the ConcreteType we are going to construct is known at compile time in emplace_back_impl. I also wonder if insert<T> and other functions should avoid slicing detection so that they can also avoid the virtual call overhead. If T is assumed to be the same type of the final ConcreteType, you can use typeid(T) instead of typeid (x) (which can require to trip through the vptr) when searching in the unordered map and obtain a pointer to the concrete segment type just casting. Insertion times could be improved, specially range insertions could shine (as vector could be resized just once). Or am I missing something? Best, Ion
El 16/05/2017 a las 1:00, Ion Gaztañaga escribió:
[discussion about avoiding unique_ptr in segment_backend and scoped allocators]
I'll definitely try to implement this along your suggestions --might need your offlist help, though :-)
- The "emplace" operation uses a placement new to wrapped in a lambda which is type-erased with a captureless lambda. It looks to me that instead of delegating the work into the value_holder, the type erasure arguments should be treated by the segment so that internal vectors propagate the allocator when needed.
I don't see how this could be done: type erasure must happen before segment is used, and actual args types are then unknown to segment... Care to elaborate?
I haven't thought it carefully and it might not possible at all, but maybe the lambda could execute:
s.emplace_back(forward<Args>(args...));
where s is the concrete segment type that can be obtained via Model (the address of the aligned storage inside "class storage" is casted to Mode::get_segment_type<T>::type*) because the ConcreteType we are going to construct is known at compile time in emplace_back_impl.
I also wonder if insert<T> and other functions should avoid slicing detection so that they can also avoid the virtual call overhead. If T is assumed to be the same type of the final ConcreteType, you can use typeid(T) instead of typeid (x) (which can require to trip through the vptr) when searching in the unordered map and obtain a pointer to the concrete segment type just casting. Insertion times could be improved, specially range insertions could shine (as vector could be resized just once). Or am I missing something?
Ion, I think you're on to something here. As it happens, poly_collection
can
statically determine when T is acceptable and dynamically detect the cases
where x's subtype is actually T, in which case virtual calls can be omitted
altogether --A reference to the internal
packed_segment/split_segment
On 16/05/2017 9:23, Joaquin M López Muñoz wrote:
where s is the concrete segment type that can be obtained via Model (the address of the aligned storage inside "class storage" is casted to Mode::get_segment_type<T>::type*) because the ConcreteType we are going to construct is known at compile time in emplace_back_impl.
I also wonder if insert<T> and other functions should avoid slicing detection so that they can also avoid the virtual call overhead. If T is assumed to be the same type of the final ConcreteType, you can use typeid(T) instead of typeid (x) (which can require to trip through the vptr) when searching in the unordered map and obtain a pointer to the concrete segment type just casting. Insertion times could be improved, specially range insertions could shine (as vector could be resized just once). Or am I missing something?
For the emplace function, as the ConcreteType it's always statically known (type T is the type to be constructed, whereas in "insert" T is the type used to construct the concrete type), then I think no and lambda is needed, it could be directly translated to a simple call to the statically known segment type: s.emplace_back(forward<Args>(args...));
Ion, I think you're on to something here. As it happens, poly_collection can statically determine when T is acceptable and dynamically detect the cases where x's subtype is actually T, in which case virtual calls can be omitted altogether --A reference to the internal packed_segment/split_segment
can be statically obtained from segment .
Ok, I see it. If typeid(x) == typeid(T), the static path (let's call it statically_known_insert(x)) is taken. Otherwise the virtual function is used. In any case, the runtime slicing detection will slow down ranges, as you'll need to check the typeid of each *it reference in the range. If all the range is assumed to be the same runtime type (or slicing is assumed) then the operation is a simple: s.insert(s.end(), first, last); which will use distance(first, last) for forward iterators, allocate all the storage once and avoid a lot of objet moving/copying. Maybe a new overload for insert taking a range that the user does not care to slice could be a good idea for performance reasons: base_collection.insert<warrior>(warriors_first, warriors_last); For absolute zero overhead (no repeteated typeid, no repeated hash search), a segment handle utility comes to my mind: typedef base_collection<sprite> bc_t; bc_t bc; //register type bc.register_types<warrior>(); //Obtain segment handle (maybe just a struct holding the pointer to //the concrete segment) bc_t::segment_handle<warrior> sh = bc.segment_handle<warrior>(); //Insert directly in the segment (just vector insertion) //and updating bc's member data if any. bc.insert(sh, pos, warrior); Additional comments: 1) I also noticed that bc.size() and bc.empty() are not constant-time. It seems a bit surprising, it might be noticeable when the number of segments grows. This avoids common data in poly_collection but I imagine most users assume these functions are free. 2) The number of additional operations that could take advantage of the "I know your static type" optimization is important, although the speedup will depend on the cost of each operation (eg. size()/ reserve() vs. a virtual function call): template<typename T> void reserve(size_type n); template<typename T> void shrink_to_fit(); template<typename T> size_type capacity() const; template<typename T> size_type max_size() const; template<typename T> size_type size() const; template<typename T> bool empty() const; template<typename T> void clear(); Additional possible "zero cost" calls using the segment handle: template<typename SegmentHandle> void reserve(SegmentHandle sh, size_type n); template<typename SegmentHandle> void shrink_to_fit(SegmentHandle sh); template<typename SegmentHandle> size_type capacity(SegmentHandle sh) const; template<typename SegmentHandle> size_type max_size(SegmentHandle sh) const; template<typename SegmentHandle> size_type size(SegmentHandle sh) const; template<typename SegmentHandle> bool empty(SegmentHandle sh) const; template<typename SegmentHandle> void clear(SegmentHandle sh); Best, Ion
El 16/05/2017 a las 13:35, Ion Gaztañaga escribió:
On 16/05/2017 9:23, Joaquin M López Muñoz wrote:
where s is the concrete segment type that can be obtained via Model (the address of the aligned storage inside "class storage" is casted to Mode::get_segment_type<T>::type*) because the ConcreteType we are going to construct is known at compile time in emplace_back_impl.
I also wonder if insert<T> and other functions should avoid slicing detection so that they can also avoid the virtual call overhead. If T is assumed to be the same type of the final ConcreteType, you can use typeid(T) instead of typeid (x) (which can require to trip through the vptr) when searching in the unordered map and obtain a pointer to the concrete segment type just casting. Insertion times could be improved, specially range insertions could shine (as vector could be resized just once). Or am I missing something?
For the emplace function, as the ConcreteType it's always statically known (type T is the type to be constructed, whereas in "insert" T is the type used to construct the concrete type), then I think no and lambda is needed, it could be directly translated to a simple call to the statically known segment type:
s.emplace_back(forward<Args>(args...));
Correct.
Ion, I think you're on to something here. As it happens, poly_collection can statically determine when T is acceptable and dynamically detect the cases where x's subtype is actually T, in which case virtual calls can be omitted altogether --A reference to the internal packed_segment/split_segment
can be statically obtained from segment . Ok, I see it. If typeid(x) == typeid(T), the static path (let's call it statically_known_insert(x)) is taken. Otherwise the virtual function is used.
In any case, the runtime slicing detection will slow down ranges, as you'll need to check the typeid of each *it reference in the range.
Not in every case, see for instance the two insert overloads at: https://github.com/joaquintides/poly_collection/blob/master/include/boost/po... where you can notice than when the type is terminal we don't repeat typeid checking. But in general we need to typeid individually, and I don't see this as something we can dispense with: otherwise there'd be no way to do type-erased insertions as simple as: boost::ptr_vector<Base> v; ... c.insert(p.begin(),p.end());
If all the range is assumed to be the same runtime type (or slicing is assumed) then the operation is a simple:
s.insert(s.end(), first, last);
which will use distance(first, last) for forward iterators, allocate all the storage once and avoid a lot of objet moving/copying. Maybe a new overload for insert taking a range that the user does not care to slice could be a good idea for performance reasons:
base_collection.insert<warrior>(warriors_first, warriors_last);
We sort of already have that: bc.insert(bc.end<warrior>(),warriors_first, warriors_last); See: https://github.com/joaquintides/poly_collection/blob/master/include/boost/po... This is efficient typeid-wise, not so with respect to preallocating based on distance. If we go ahead with your proposal of static-fying the interface when the concrete type is known, we can get much better --actually, as fast as possible.
For absolute zero overhead (no repeteated typeid, no repeated hash search), a segment handle utility comes to my mind:
[...]
I think this is not needed in light of my previous comment.
1) I also noticed that bc.size() and bc.empty() are not constant-time. It seems a bit surprising, it might be noticeable when the number of segments grows. This avoids common data in poly_collection but I imagine most users assume these functions are free.
Strictly speaking, they're constant time wrt number of elements, assuming (correctly, I'd say) that the number of registered types is bounded and don't increase linearly with size(). I'm very against caching size for consistency reasons and because we are adding a little fat to every operation of the container.
2) The number of additional operations that could take advantage of the "I know your static type" optimization is important, although the speedup will depend on the cost of each operation (eg. size()/ reserve() vs. a virtual function call):
template<typename T> void reserve(size_type n); template<typename T> void shrink_to_fit(); template<typename T> size_type capacity() const; template<typename T> size_type max_size() const; template<typename T> size_type size() const; template<typename T> bool empty() const; template<typename T> void clear();
If/when we staticfy the interface of course we'll go all the way and improve as much as can be improved. The functions you list above, though, are not expected to be a performance problem in any sensible program. Joaquín M López Muñoz
On 16/05/2017 20:28, Joaquin M López Muñoz wrote:
In any case, the runtime slicing detection will slow down ranges, as you'll need to check the typeid of each *it reference in the range.
Not in every case, see for instance the two insert overloads at:
https://github.com/joaquintides/poly_collection/blob/master/include/boost/po...
where you can notice than when the type is terminal we don't repeat typeid checking.
Is the iterator's "value_type" a guarantee of the dynamic type of the referenced object? I can imagine ptr_vector or Boost.Intrusive iterators, where value_type is the base class, but the dynamic type of the object is a derived class. If the implementation wants to avoid any slicing and support these "special" iterators then it should check the typeid of every value in the range.
We sort of already have that:
bc.insert(bc.end<warrior>(),warriors_first, warriors_last);
See:
https://github.com/joaquintides/poly_collection/blob/master/include/boost/po...
I was not talking about a range taking local iterators, but a range of iterators of a vector holding warriors. Something like: https://github.com/joaquintides/poly_collection/blob/master/include/boost/po... which also assumes iterator::value_type is the dynamic type of the whole range.
This is efficient typeid-wise, not so with respect to preallocating based on distance. If we go ahead with your proposal of static-fying the interface when the concrete type is known, we can get much better --actually, as fast as possible.
For absolute zero overhead (no repeteated typeid, no repeated hash search), a segment handle utility comes to my mind:
[...]
I think this is not needed in light of my previous comment.
As a suggestion, I would include in the insertion benchmarks a vector push_back. At least for packed_segment, vector push_back is the upper_bound of any performance improvement. If the performance difference between the new "staticfied" single value insertion function: while(....) pc.insert(warrior()); and the vector push_back code while(...) warrior_vector.push_back(warrior()); is significant, then maybe a uglier but near-zero overhead alternative (like the segment handle) methods might be a good idea. Specially for power programmers that want to create a bunch of warriors when the new game level starts ;-)
1) I also noticed that bc.size() and bc.empty() are not constant-time. It seems a bit surprising, it might be noticeable when the number of segments grows. This avoids common data in poly_collection but I imagine most users assume these functions are free.
Strictly speaking, they're constant time wrt number of elements, assuming (correctly, I'd say) that the number of registered types is bounded and don't increase linearly with size(). I'm very against caching size for consistency reasons and because we are adding a little fat to every operation of the container.
I don't know if anyone will use poly_collection with many different types and few elements per segment. As always users will surprise us ;-) In any case I would recommend documenting this visible somewhere (ideally in the method description as Complexity) with a complexity similar to (I hope I don't get it wrong): O(distance(segment_traversal().begin(), segment_traversal().end())) as typically most programmers would think about those as extremely cheap operations and might call them in a loop.
2) The number of additional operations that could take advantage of the "I know your static type" optimization is important, although the speedup will depend on the cost of each operation (eg. size()/ reserve() vs. a virtual function call):
template<typename T> void reserve(size_type n); template<typename T> void shrink_to_fit(); template<typename T> size_type capacity() const; template<typename T> size_type max_size() const; template<typename T> size_type size() const; template<typename T> bool empty() const; template<typename T> void clear();
If/when we staticfy the interface of course we'll go all the way and improve as much as can be improved. The functions you list above, though, are not expected to be a performance problem in any sensible program.
Right. My "the root of all evil" side is also reviewing the library ;-) And hopefully these are my last comments ;-) I think none of them are needed for the acceptance of the library, as they could be added as future improvements. Best, Ion
El 16/05/2017 a las 23:41, Ion Gaztañaga escribió:
On 16/05/2017 20:28, Joaquin M López Muñoz wrote:
In any case, the runtime slicing detection will slow down ranges, as you'll need to check the typeid of each *it reference in the range.
Not in every case, see for instance the two insert overloads at:
https://github.com/joaquintides/poly_collection/blob/master/include/boost/po...
where you can notice than when the type is terminal we don't repeat typeid checking.
Is the iterator's "value_type" a guarantee of the dynamic type of the referenced object? I can imagine ptr_vector or Boost.Intrusive iterators, where value_type is the base class, but the dynamic type of the object is a derived class. If the implementation wants to avoid any slicing and support these "special" iterators then it should check the typeid of every value in the range.
Exactly, this is what it does. At the end of the day, this is about convenience vs. performance. I'd have to find a very convincing argument to give slicing prevention away. Furthermore, you can have the best of both worlds if you keep reading below :-)
We sort of already have that:
bc.insert(bc.end<warrior>(),warriors_first, warriors_last);
See:
https://github.com/joaquintides/poly_collection/blob/master/include/boost/po...
I was not talking about a range taking local iterators, but a range of iterators of a vector holding warriors. Something like:
https://github.com/joaquintides/poly_collection/blob/master/include/boost/po...
which also assumes iterator::value_type is the dynamic type of the whole range.
I know, and this is what you got: https://github.com/joaquintides/poly_collection/blob/master/include/boost/po... Note that in bc.insert(bc.end<warrior>(),warriors_first, warriors_last) it's only the first arg that's actually a local iterator: warriors_first and warriors_last can be any InputIterator. No typeid checking is done (except on debug mode, see the BOOST_ASSERT's) and when this is staticfied we can get the preallocation performance boost etc. In a sense, bc.end<warrior>() is the indication to Boost.PolyCollection that the inserted range *must* be warriors and not something else.
[...]
As a suggestion, I would include in the insertion benchmarks a vector push_back. At least for packed_segment, vector push_back is the upper_bound of any performance improvement. If the performance difference between the new "staticfied" single value insertion function:
while(....) pc.insert(warrior());
and the vector push_back code
while(...) warrior_vector.push_back(warrior());
is significant, then maybe a uglier but near-zero overhead alternative (like the segment handle) methods might be a good idea. Specially for power programmers that want to create a bunch of warriors when the new game level starts ;-)
Hintless insertion resolves to pushing back to the corresponding segment. The difference in performance would solely consist in the looking up of the segment. Again, you can already have your second version with the existing interface: while(...) pc.insert(bc.end<warrior>(),warrior()); (Insertion at the end is, of course, the same as pushing back).
1) I also noticed that bc.size() and bc.empty() are not constant-time. It seems a bit surprising, it might be noticeable when the number of segments grows. This avoids common data in poly_collection but I imagine most users assume these functions are free.
Strictly speaking, they're constant time wrt number of elements, assuming (correctly, I'd say) that the number of registered types is bounded and don't increase linearly with size(). I'm very against caching size for consistency reasons and because we are adding a little fat to every operation of the container.
I don't know if anyone will use poly_collection with many different types and few elements per segment. As always users will surprise us ;-)
In any case I would recommend documenting this visible somewhere (ideally in the method description as Complexity) with a complexity similar to (I hope I don't get it wrong):
O(distance(segment_traversal().begin(), segment_traversal().end()))
as typically most programmers would think about those as extremely cheap operations and might call them in a loop.
Noted.
[...]
And hopefully these are my last comments ;-) I think none of them are needed for the acceptance of the library, as they could be added as future improvements.
Thank you for your very interesting input. Let's see is some latecomer joins the party before the review is over. Joaquín M López Muñoz
Den 16-05-2017 kl. 23:59 skrev Joaquin M López Muñoz via Boost:
El 16/05/2017 a las 23:41, Ion Gaztañaga escribió:
I don't know if anyone will use poly_collection with many different types and few elements per segment. As always users will surprise us ;-)
I agree with Joaquin that we should not store the size.
In any case I would recommend documenting this visible somewhere (ideally in the method description as Complexity) with a complexity similar to (I hope I don't get it wrong):
O(distance(segment_traversal().begin(), segment_traversal().end()))
as typically most programmers would think about those as extremely cheap operations and might call them in a loop.
Noted.
I think if your documentation defines n = # of elements in container, k = # of segments in container, then simply saying Complexity: O(k) or Complexity: linear in k is nicer to read than all the range stuff. kind regards -Thorsten
participants (3)
-
Ion Gaztañaga
-
Joaquin M López Muñoz
-
Thorsten Ottosen