Query regarding interest in my submitting plf::colony to Boost
Hi all, I've had some expressions of support outside of the Boost community for submission of the plf::colony container and it's associated container plf::stack, mainly from the review judges for my cppcon submission on the same topic (topic didn't make it through to the finals but got a positive response). I'd like to guage reactions here to see whether it is worth spending the time necessary to port it to Boost, with all the associated reviews etc. plf::colony is a container for unordered data, where erasure and addition do not cause iterator/pointer invalidation, with superior performance characteristics to all std:: containers when either (a) Pointers and iterators which point to container elements must stay valid regardless of container additions and deletions and/or (b) Additions to or deletions from the container will be occurring in performance-critical code. Erasure performance in particular can be in advance of 6000x of vectors, while still being faster than lists, maps, deques etc for add and iteration. Full details and benchmarks across four compilers are here: http://www.plflib.org/colony.htm#benchmarks It has been thoroughly tested, including integrating it into the game engine I constructed last year as well as the usual unit tests. It supports both C++0x and C++11, and has been tested on the following compilers so far: Clang 3.4, GCC 4.6, 4.7, 4.8, 4.9 and 5.1 (both 32-bit and 64-bit), MS Visual C++ 2010 & 2013 (2015 supported but untested as yet). Internally a colony is a combination of a linked-list of increasingly-larger memory blocks combined with boolean erasure fields, combined with a custom memory location recycle stack. The most common comment I get is "that sounds kinda like a deque", so to save some time here are the differences (from the FAQ): "A double-ended queue requires a very different internal framework. It may or may not use a linked-list of memory blocks, dependent on the compiler, and has no comparable performance characteristics. Deque element erasure is very slow, similar to a vector, and it does not free unused memory blocks to the OS once they are empty, unlike a colony. In addition a deque invalidates references to subsequent container elements when erasing elements, which a colony does not. A colony never invalidates pointers/iterators to elements unless the element being referred to is erased." The rest of the faq is here and I would appreciate it if people read it before casting judgement: http://www.plflib.org/colony.htm I am aware that it will be a lot of work to port colony to Boost, so I will only undertake this if there is genuine interest. In addition, plf::stack - which colony must use internally - is a std::stack equivalent container but with much better performance characteristics, which can be used outside of plf::colony. Asides from the performance characteristics, colonies have the following positive and negative properties: Positive: Lack of pointer/iterator invalidation makes interrelating between collections of data easier, without the cache misses associated with vectors of pointers to dynamically-allocated objects. Negative: Due to the additional overhead, performance can be worse for scalar types. Thanks in advance- Matt B
You cannot subtract the value of the .begin() iterator from the current
This looks interesting. I am trying to understand what colony does and how it from the docs, but I have some questions. I haven't looked at the code yet, but I hope this questions and feedback can help you improve the docs for those who might come after me. position iterator to get a colony "index" number like you would with vector. - What concept does the colony iterator model? This is never stated, but I would assume that they are RandomAccessIterators from the context. Still, the point 10. in the FAQ "Will a [ ] operator be added?" suggests that I'm wrong. This needs to be stated explicitly. Otherwise I don't know which algorithms can I use on colony iterators. - What is the algorithmic complexity of the operations? I am interested in add, erase, find, find_all, size. - What is the exception safety guarantees of the operations mentioned above? - What is the exact data-structure that the colony uses? You specify it up to some point, and add somewhere later that it grows like a vector. From the docs I don't know the details, and I really would like to: what is the growth factor, how much memory per element does it need (at least a flag specifying if it is in use or not right?). - Why does colony need a find member function? (Or in other words, why doesn't std::find work?) - Why is find_all a member function? (Why cannot/should not be implemented as a non-member non-friend with colony::find?) In general, I cannot interpret any single benchmark since they are unspecified. I would like to know exactly what each benchmarks does, that is, which steps are timed, and maybe how the benchmarks looks (a bit of pseudocode or C++ would be ok here, or a link to the benchmark file). Even without this information, a couple of benchmarks look "strange": - How come colony is 1.5x faster than vector in the "test 1: Iterating over all elements 100 times and add random unsigned int member to a total"? I would expect this to be the best possible case for vector, and from the description of colony's data-structure, I don't see how iteration can be faster for colony. - Clearing a colony is 20x faster than clearing a vector. Since clearing a vector just calls the destructors of the elements, I really don't know how a colony can do less work than this. Is it destroying the elements? What is vector doing that colony isn't? And that's it for the questions, i'm going to dive in the code next. P.S: I think there is a typo in the docs here:
iterator add(const the_type &&element) Moves the element supplied to the colony, using the object's move-constructor.
The "const" should not be there, or otherwise one cannot move "element" into the colony without const_cast tricks.
After going through the code and the benchmarks, I can make sense of the
results now: the benchmarks are for a vector<T>!
One thing comes to mind, it would be interesting to compare your container
agains `boost::stable_vector` and against a vector
Hi all, I've had some expressions of support outside of the Boost community for submission of the plf::colony container and it's associated container plf::stack, mainly from the review judges for my cppcon submission on the same topic (topic didn't make it through to the finals but got a positive response). I'd like to guage reactions here to see whether it is worth spending the time necessary to port it to Boost, with all the associated reviews etc.
plf::colony is a container for unordered data, where erasure and addition do not cause iterator/pointer invalidation, with superior performance characteristics to all std:: containers when either (a) Pointers and iterators which point to container elements must stay valid regardless of container additions and deletions and/or (b) Additions to or deletions from the container will be occurring in performance-critical code.
Erasure performance in particular can be in advance of 6000x of vectors, while still being faster than lists, maps, deques etc for add and iteration. Full details and benchmarks across four compilers are here: http://www.plflib.org/colony.htm#benchmarks
It has been thoroughly tested, including integrating it into the game engine I constructed last year as well as the usual unit tests. It supports both C++0x and C++11, and has been tested on the following compilers so far: Clang 3.4, GCC 4.6, 4.7, 4.8, 4.9 and 5.1 (both 32-bit and 64-bit), MS Visual C++ 2010 & 2013 (2015 supported but untested as yet).
Internally a colony is a combination of a linked-list of increasingly-larger memory blocks combined with boolean erasure fields, combined with a custom memory location recycle stack. The most common comment I get is "that sounds kinda like a deque", so to save some time here are the differences (from the FAQ): "A double-ended queue requires a very different internal framework. It may or may not use a linked-list of memory blocks, dependent on the compiler, and has no comparable performance characteristics. Deque element erasure is very slow, similar to a vector, and it does not free unused memory blocks to the OS once they are empty, unlike a colony. In addition a deque invalidates references to subsequent container elements when erasing elements, which a colony does not. A colony never invalidates pointers/iterators to elements unless the element being referred to is erased."
The rest of the faq is here and I would appreciate it if people read it before casting judgement: http://www.plflib.org/colony.htm
I am aware that it will be a lot of work to port colony to Boost, so I will only undertake this if there is genuine interest. In addition, plf::stack - which colony must use internally - is a std::stack equivalent container but with much better performance characteristics, which can be used outside of plf::colony.
Asides from the performance characteristics, colonies have the following positive and negative properties: Positive: Lack of pointer/iterator invalidation makes interrelating between collections of data easier, without the cache misses associated with vectors of pointers to dynamically-allocated objects. Negative: Due to the additional overhead, performance can be worse for scalar types.
Thanks in advance- Matt B _______________________________________________ Boost-users mailing list Boost...@lists.boost.org javascript: http://lists.boost.org/mailman/listinfo.cgi/boost-users
On 13 Jul 2015 at 20:22, Soul Studios wrote:
Hi all, I've had some expressions of support outside of the Boost community for submission of the plf::colony container and it's associated container plf::stack, mainly from the review judges for my cppcon submission on the same topic (topic didn't make it through to the finals but got a positive response). I'd like to guage reactions here to see whether it is worth spending the time necessary to port it to Boost, with all the associated reviews etc.
Firstly, you don't invest the multi-year effort to get a library into Boost because it's superior, as it probably isn't. You do it because of what you will become after you succeed: a Boost library author who has passed the judgement of his or her peers.
plf::colony is a container for unordered data, where erasure and addition do not cause iterator/pointer invalidation, with superior performance characteristics to all std:: containers when either (a) Pointers and iterators which point to container elements must stay valid regardless of container additions and deletions and/or (b) Additions to or deletions from the container will be occurring in performance-critical code.
I am struggling to see the gain over other combinations of STL containers. If you need very fast deletion and never invalidated iterators, unordered_map is very hard to beat - the cost is insertion performance, but even that goes away with Hinnant's node_ptr proposed extensions to the standard. Dense hash maps probably will be very competitive, and we sorely need one of those in Boost and the STL. I'd support a bitwise trie map (equal and near constant time insert/find/erase), then a dense hash map (really just a vector). Both of those I've seen a profound need for where no easy combination of STL containers solved the problem. Your particular solution I've not found a need for in my own coding yet. That's not to say there isn't one, just I haven't encountered your problem space where an application of std::vector combined with trivial customisation didn't solve the scaling problem as needed.
I am aware that it will be a lot of work to port colony to Boost, so I will only undertake this if there is genuine interest.
Don't do it because of that. Do it for yourself personally, or not at all. Trust me: my library is up for review this Friday. I started the process of getting it in in 2012. That's the commitment you need to make.
Asides from the performance characteristics, colonies have the following positive and negative properties: Positive: Lack of pointer/iterator invalidation makes interrelating between collections of data easier, without the cache misses associated with vectors of pointers to dynamically-allocated objects. Negative: Due to the additional overhead, performance can be worse for scalar types.
I'd have thought there is surely a space wastage going on here too if I understand the algorithm. That can have cache spillage effects under multithreaded use. I'd benchmark your algorithm when running four benchmarks in parallel. A lot of implementations which look faster single threaded look very slow when multithreaded due to LL cache pressure. The STL tries to balance those two off, and it generally succeeds, sacrificing some single threaded performance for much improved cache load. FYI dense hash maps and bitwise trie maps have *excellent* multithreaded performance, almost linear speedup with cores, thanks to their low LL cache loading. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Firstly, you don't invest the multi-year effort to get a library into Boost because it's superior, as it probably isn't.
You do it because of what you will become after you succeed: a Boost library author who has passed the judgement of his or her peers.
Personally, I actually want to do it to help others, as that is the focal point of my project. I'm not greatly looking for personal gain, but I know that this algorithm gives much greater efficiency under the circumstances described - and I believe efficiency is important. Also, it solves some problem scenarios that cannot be solved with current algorithms - at least easily.
I am struggling to see the gain over other combinations of STL containers. If you need very fast deletion and never invalidated iterators, unordered_map is very hard to beat - the cost is insertion performance, but even that goes away with Hinnant's node_ptr proposed extensions to the standard. Dense hash maps probably will be very competitive, and we sorely need one of those in Boost and the STL.
The problem with map and unordered_map, which you can see from the first benchmark on the colony page (map currently outperforms unordered_map under GCC, so I didn't include it), is that they have terrible iteration and search performance. A colony has better erase and insert performance, and the best iteration speeds against all std:: containers in the context of requiring valid pointers/iterators. http://www.plflib.org/colony.htm#benchmarks
Don't do it because of that. Do it for yourself personally, or not at all. Trust me: my library is up for review this Friday. I started the process of getting it in in 2012. That's the commitment you need to make.
Gotcha, that is good to know- thanks.
I'd have thought there is surely a space wastage going on here too if I understand the algorithm. That can have cache spillage effects under multithreaded use.
If you take a look at the diagram at the top of the colony page, those gaps are filled in when any new 'add' is called - basically, they get added to a 'recycle' stack and popped off at next insertion. All empty slots are reused before adding to the end of the colony. Hence the unordered schema. However you are correct in that if many elements are deleted and no insert's performed, iteration will suffer - slightly. It doesn't affect cache performance, but obviously if you are skipping over many element slots that takes more time than if those slots don't actually exist anymore! Also when a memory block is 100% empty ie. all elements erased, the memory block is removed from the chain, so you don't have that wastage anymore. Those blocks can be tuned for size in the constructor.
I'd benchmark your algorithm when running four benchmarks in parallel. A lot of implementations which look faster single threaded look very slow when multithreaded due to LL cache pressure. The STL tries to balance those two off, and it generally succeeds, sacrificing some single threaded performance for much improved cache load.
To be fair, I have not looked into multithreaded performance. I pretty much followed the STL guidelines as per under what conditions containers can be used concurrently. I am potentially looking at implementing colonies on CUDA however.
FYI dense hash maps and bitwise trie maps have *excellent* multithreaded performance, almost linear speedup with cores, thanks to their low LL cache loading.
You may correct me here, but my understanding is that vector pretty much outperforms the other std:: containers in terms of iteration as cache performance is better, due to contiguous data location. According to Chandler Carruth, this also holds true for key-value searchs vs map, as the same rules apply (400ns for L1-cache-search beats multiple 200ns main memory fetches). www.youtube.com/watch?v=fHNmRkzxHWs If I'm missing something do let me know here. Genuine thanks, M
On 15 July 2015 at 01:07, Soul Studios
Firstly, you don't invest the multi-year effort to get a library into
Boost because it's superior, as it probably isn't.
You do it because of what you will become after you succeed: a Boost library author who has passed the judgement of his or her peers.
Personally, I actually want to do it to help others, as that is the focal point of my project. I'm not greatly looking for personal gain, but I know that this algorithm gives much greater efficiency under the circumstances described - and I believe efficiency is important. Also, it solves some problem scenarios that cannot be solved with current algorithms - at least easily.
I am struggling to see the gain over other combinations of STL
containers. If you need very fast deletion and never invalidated iterators, unordered_map is very hard to beat - the cost is insertion performance, but even that goes away with Hinnant's node_ptr proposed extensions to the standard. Dense hash maps probably will be very competitive, and we sorely need one of those in Boost and the STL.
The problem with map and unordered_map, which you can see from the first benchmark on the colony page (map currently outperforms unordered_map under GCC, so I didn't include it), is that they have terrible iteration and search performance. A colony has better erase and insert performance, and the best iteration speeds against all std:: containers in the context of requiring valid pointers/iterators. http://www.plflib.org/colony.htm#benchmarks
Don't do it because of that. Do it for yourself personally, or not at
all. Trust me: my library is up for review this Friday. I started the process of getting it in in 2012. That's the commitment you need to make.
Gotcha, that is good to know- thanks.
I'd have thought there is surely a space wastage going on here too if
I understand the algorithm. That can have cache spillage effects under multithreaded use.
If you take a look at the diagram at the top of the colony page, those gaps are filled in when any new 'add' is called - basically, they get added to a 'recycle' stack and popped off at next insertion. All empty slots are reused before adding to the end of the colony. Hence the unordered schema.
However you are correct in that if many elements are deleted and no insert's performed, iteration will suffer - slightly. It doesn't affect cache performance, but obviously if you are skipping over many element slots that takes more time than if those slots don't actually exist anymore!
Also when a memory block is 100% empty ie. all elements erased, the memory block is removed from the chain, so you don't have that wastage anymore. Those blocks can be tuned for size in the constructor.
I'd benchmark your algorithm when running four benchmarks in
parallel. A lot of implementations which look faster single threaded look very slow when multithreaded due to LL cache pressure. The STL tries to balance those two off, and it generally succeeds, sacrificing some single threaded performance for much improved cache load.
To be fair, I have not looked into multithreaded performance. I pretty much followed the STL guidelines as per under what conditions containers can be used concurrently. I am potentially looking at implementing colonies on CUDA however.
I lack time right now but I'm interested in trying your container in the context of implementations (mostly experimental) of components systems. I don't think annything related to multithreading should be done by such kind of container. Most high performance usage will be going through all the elements to transform them (if "alive") and as long as you provide iterators it can be done in parallel the user wants.
FYI dense hash maps and bitwise trie maps have *excellent*
multithreaded performance, almost linear speedup with cores, thanks to their low LL cache loading.
You may correct me here, but my understanding is that vector pretty much outperforms the other std:: containers in terms of iteration as cache performance is better, due to contiguous data location. According to Chandler Carruth, this also holds true for key-value searchs vs map, as the same rules apply (400ns for L1-cache-search beats multiple 200ns main memory fetches). www.youtube.com/watch?v=fHNmRkzxHWs If I'm missing something do let me know here. Genuine thanks,
There are cases where vector is not the right solution but yeah in my experience too it fits most use cases. Also: http://bannalia.blogspot.fr/2015/06/cache-friendly-binary-search.html I remember Stroustrup making the same statement at Going Native conference too, and it's also common observation in game devs circles (several other CPPcon talks talked about it too). By the way, it seems that you might also be interested in recent discussions taking place in SG14: https://groups.google.com/a/isocpp.org/forum/?fromgroups#!topic/sg14/csLjKbL...
M _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
I lack time right now but I'm interested in trying your container in the context of implementations (mostly experimental) of components systems. I don't think annything related to multithreading should be done by such kind of container. Most high performance usage will be going through all the elements to transform them (if "alive") and as long as you provide iterators it can be done in parallel the user wants.
Yes, the iterators simply 'jump' over the erased element areas. It's very fast. I am glad for your support, but I'm not sure why you think it shouldn't be used with multithreading - it follows the same rules for multithreading as the std:: containers ie. reads can be synchronous, writes must be serialized.
There are cases where vector is not the right solution but yeah in my experience too it fits most use cases. Also: http://bannalia.blogspot.fr/2015/06/cache-friendly-binary-search.html I remember Stroustrup making the same statement at Going Native conference too, and it's also common observation in game devs circles (several other CPPcon talks talked about it too).
Yes, I feel it's a real shame that more developers don't understand these basic principles. In fact, one of the reviewers for the cppcon talk felt I didn't understand the "performance guarantees" of std:: containers - which is ironic, as what he means is "complexity guarantees", which in no way relate to performance, because of the aforementioned problem of cache saturation.
On 16 July 2015 at 02:36, Soul Studios
I lack time right now but I'm interested in trying your container in the
context of implementations (mostly experimental) of components systems. I don't think annything related to multithreading should be done by such kind of container. Most high performance usage will be going through all the elements to transform them (if "alive") and as long as you provide iterators it can be done in parallel the user wants.
Yes, the iterators simply 'jump' over the erased element areas. It's very fast. I am glad for your support, but I'm not sure why you think it shouldn't be used with multithreading - it follows the same rules for multithreading as the std:: containers ie. reads can be synchronous, writes must be serialized.
That's not what I meant: I'm saying that you are right to follow the way the stl containers work and don't add any multithreading support _inside_ the containers themselves. Basically, I believe that you did it the right way.
There are cases where vector is not the right solution but yeah in my
experience too it fits most use cases. Also: http://bannalia.blogspot.fr/2015/06/cache-friendly-binary-search.html I remember Stroustrup making the same statement at Going Native conference too, and it's also common observation in game devs circles (several other CPPcon talks talked about it too).
Yes, I feel it's a real shame that more developers don't understand these basic principles. In fact, one of the reviewers for the cppcon talk felt I didn't understand the "performance guarantees" of std:: containers - which is ironic, as what he means is "complexity guarantees", which in no way relate to performance, because of the aforementioned problem of cache saturation.
That's not what I meant: I'm saying that you are right to follow the way the stl containers work and don't add any multithreading support _inside_ the containers themselves. Basically, I believe that you did it the right way.
Ah, I understand what you mean now - Thank you. Matt
Hi there- visual studio compiler information appears to be out of date or innaccurate in 1.59- Visual Studio 2012 and upwards supports type traits, anything less does not. The appropriate macro defines for this should be added to visualc.hpp.
On Thu, Aug 13, 2015 at 7:58 PM, Soul Studios
visual studio compiler information appears to be out of date or innaccurate in 1.59- Visual Studio 2012 and upwards supports type traits, anything less does not. The appropriate macro defines for this should be added to visualc.hpp.
Which macro should be in config/compiler/visualc.hpp? Right now boost/config/stdlib/dinkumware.hpp has BOOST_NO_CXX11_HDR_TYPE_TRAITS defined for Dinkumware versions 520 and lower - i.e. that macro is would not be defined for VC11 and higher. Glen
On 14/08/2015 1:15 p.m., Glen Fernandes wrote:
On Thu, Aug 13, 2015 at 7:58 PM, Soul Studios
wrote: visual studio compiler information appears to be out of date or innaccurate in 1.59- Visual Studio 2012 and upwards supports type traits, anything less does not. The appropriate macro defines for this should be added to visualc.hpp.
Which macro should be in config/compiler/visualc.hpp?
Right now boost/config/stdlib/dinkumware.hpp has BOOST_NO_CXX11_HDR_TYPE_TRAITS defined for Dinkumware versions 520 and lower - i.e. that macro is would not be defined for VC11 and higher.
Okay, I'm no good at understanding the sprawl of dependencies which constitute boost's macro defines, but I'll take your word for it. Why that particular C++11 macro define is in another file, while almost all the others are defined in the compiler-specific hpp, is beyond comprehension.
On 14/08/2015 9:47 p.m., Soul Studios wrote:
On 14/08/2015 1:15 p.m., Glen Fernandes wrote:
Right now boost/config/stdlib/dinkumware.hpp has BOOST_NO_CXX11_HDR_TYPE_TRAITS defined for Dinkumware versions 520 and lower - i.e. that macro is would not be defined for VC11 and higher.
Okay, I'm no good at understanding the sprawl of dependencies which constitute boost's macro defines, but I'll take your word for it. Why that particular C++11 macro define is in another file, while almost all the others are defined in the compiler-specific hpp, is beyond comprehension.
Standard library related macros are defined in the standard library
configuration header (i.e. config/stdlib not config/compiler). For VC,
this is dinkumware.h. That particular macro is not special; many
library-specific macros are defined in that header:
- BOOST_NO_CXX11_HDR_INITIALIZER_LIST
- BOOST_NO_CXX11_HDR_ATOMIC
- BOOST_NO_CXX11_STD_ALIGN
etc.
Language-specific macros are defined in the config/compiler headers - such as:
- BOOST_NO_CXX11_DELETED_FUNCTIONS
- BOOST_NO_CXX11_UNIFIED_INITIALIZATION_SYNTAX
- BOOST_NO_CXX11_TEMPLATE_ALIASES
etc.
You should only ever include
On 14/08/2015 1:15 p.m., Glen Fernandes wrote:
On Thu, Aug 13, 2015 at 7:58 PM, Soul Studios
wrote: visual studio compiler information appears to be out of date or innaccurate in 1.59- Visual Studio 2012 and upwards supports type traits, anything less does not. The appropriate macro defines for this should be added to visualc.hpp.
Which macro should be in config/compiler/visualc.hpp?
Right now boost/config/stdlib/dinkumware.hpp has BOOST_NO_CXX11_HDR_TYPE_TRAITS defined for Dinkumware versions 520 and lower - i.e. that macro is would not be defined for VC11 and higher.
Glen
Thanks BTW
participants (5)
-
Glen Fernandes
-
Gonzalo BG
-
Klaim - Joël Lamotte
-
Niall Douglas
-
Soul Studios