Final benchmark graphs for Colony vs std:: containers now available
Re-wrote my benchmark code from scratch a month or two ago with some help from Baptiste Wiche, have run over std::list, std::map, std::multiset, std::deque, std::vector, and for GCC, Chris's segmented_tree implementation. To be honest there's not much point comparing against segmented_tree, as the goals are totally different (efficient random access insertion for segtree, vs efficient unordered erasure/insertion/iteration without pointer invalidation colony), but there was some interest last time I wrote here. In addition I compared against other methods for ensuring pointer/index validity using deques and vectors, to see how those things pan out. Graphs are available here: http://www.plflib.org/colony.htm#benchmarks and here: http://www.plflib.org/stack.htm There are also MSVC 2013 results linked to in the first paragraph of the benchmarks. Comments are welcome, but please read the page first. Stack performs particularly well, outperforming all alternatives in both GCC/MSVC.
The first test measures time to insert N elements into a given container,
I think that too much information is missing from the benchmark setup which makes the results hard to interpret (see below for some of my questions). A good rule of thumb for how much information to include when describing the setup of a benchmark is that anybody should, based solely on the information provided, be able to re-implement the benchmarks and get the same (or very similar) results. the second measures the time taken to erase 25% of those same elements from the container, and the third test measures iteration performance after the second test has taken place.
Note: because plf::colony is an unordered container and subsequently we
are not concerned about insert order, push_front has been used with std::list instead of push_back, in order to provide a fair performance comparison. Are you using "push_back"/ "insert(end, N, T())" on vector and deque? Reserving memory upfront? Is the final size greater than the capacity before the insertion? For example if the containers have initially zero size and capacity prior to the insertion I cannot understand how anything can be faster than "std::vector::vector(N, T())". If the initial capacity is larger than the container size after the insertion I cannot understand how anything can be faster than "std::vector::insert(end, N, T())". Both of these cases are very relevant to the game development application you mention since there one typically has an upper bound of the number of elements within a container (like maximum number of entities).
Erase Performance
The curve for vector and deque looks like O(N^2): are you using erase-remove-if?
I think that too much information is missing from the benchmark setup which makes the results hard to interpret (see below for some of my questions). A good rule of thumb for how much information to include when describing the setup of a benchmark is that anybody should, based solely on the information provided, be able to re-implement the benchmarks and get the same (or very similar) results.
The first test measures time to insert N elements into a given container, the second measures the time taken to erase 25% of those same elements from the container, and the third test measures iteration performance after the second test has taken place.
Note: because plf::colony is an unordered container and subsequently we are not concerned about insert order, push_front has been used with std::list instead of push_back, in order to provide a fair performance comparison.
Are you using "push_back"/ "insert(end, N, T())" on vector and deque? Reserving memory upfront? Is the final size greater than the capacity before the insertion?
For example if the containers have initially zero size and capacity prior to the insertion I cannot understand how anything can be faster than "std::vector::vector(N, T())". If the initial capacity is larger than the container size after the insertion I cannot understand how anything can be faster than "std::vector::insert(end, N, T())". Both of these cases are very relevant to the game development application you mention since there one typically has an upper bound of the number of elements within a container (like maximum number of entities).
Erase Performance
The curve for vector and deque looks like O(N^2): are you using erase-remove-if?
Hi Gonzalo, I take your points, but the source code for the tests is listed on the site: http://www.plflib.org/colony.htm#download You can look there if you have further questions. Regards vector insertion I think you need to check your assumptions there. It is not a fast insertor due to the fact that it reallocates to a new memory block upon expansion past capacity. Deque does not do this. Nor does colony. Regards insertion, it is all the fastest method available to that class, so push_back for most, push_front for list. As noted at the top of the benchmarks, "I have not included results involving 'reserve()' functions as the differences to overall insertion performance were not adequate.". This is true, and what's also true is that I find reserve irrelevant. If you know the size you need, use an array, that's my philosophy.
Hi Gonzalo, I take your points, but the source code for the tests is listed on the site: http://www.plflib.org/colony.htm#download You can look there if you have further questions.
Regards vector insertion I think you need to check your assumptions
Downloading a zip is a higher barrier of entry than what is usually expected. It would help the feedback if the source code would be browsable online. Boost uses github and other libraries being submitted do as well but that is not the only alternative. This would probably increase the amount of feedback. there. If the vector is initially empty my assumption is "nothing can beat vector sized constructor". The text just says that N elements are inserted into a container. Is the container empty before the insertion? If not, how many elements does it have? Is it the same number of elements for all N? Important information to understand the benchmarks is missing _in the text_. IMO "download this zip file and take a look at the source code" is not a good solution to this problem. A better solution would be to detail in the text exactly what is the status of the containers size-wise (and capacity-wise where appropiate) before and after insertion and erasure, as well as where is the range of N elements being inserted for each container and how, how are the elements being erased, etc. This could be done in a table and would _significantly_ help understanding the results.
It is not a fast insertor due to the fact that it reallocates to a new memory block upon expansion past capacity. Deque does not do this. Nor does
colony.
I explicitly asked in my previous email what is the status of the size and capacity before/after insertion. The text doesn't mention this, and it is useful information to know. I am not arguing about whether one should consider or ignore the capacity for, e.g., std::vector. I am just arguing that this information is _required_ to help the reader properly understand the benchmark results.
Regards insertion, it is all the fastest method available to that class,
As noted at the top of the benchmarks, "I have not included results involving 'reserve()' functions as the differences to overall insertion
No, it is not. From the source code it looks like you use push_back for vector when inserting a range of N elements at the end instead of vector::insert(end, N, T()) or vector::insert(end, begin, end). This is why I would like to see a table in the text indicating which method it is used because this is relevant performance-wise. performance were not adequate.". This is true, and what's also true is that I find reserve irrelevant. If you know the size you need, use an array, that's my philosophy. Since the text is omitting relevant details to me as a reader it feels like it is trying to hide something. It is not about whether reserve is relevant or not. There are many use cases where one doesn't have a meaningful value for reserve. It is more about giving the reader all the information it needs to make sense of the results and decide for itself whether colony is appropriate for its use case. You haven't answered my question about erasure, but from the source code it seems that the benchmarking code is erasing single elements of a vector and a deque which is N^2 and something nobody would do in a real performance-sensitive application. In my opinion a fair (and realistic) erasure benchmark should use erase_remove_if which is O(N) (and can be also used for all containers). In a nutshell, the thing that I think would benefit the benchmark text the most is including more details about what exactly are the benchmarks doing and how (a table would probably suffice). Making the code browsable online would certainly help with getting more feedback and improving the benchmarks to further pinpoint the best use cases of colony. Choosing the range insert methods for the containers that offer them as well as erase_remove_if for the erasure benchmarks is something that is worth exploring (since in the use cases considered nobody would use vector and deque that way).
Downloading a zip is a higher barrier of entry than what is usually expected. It would help the feedback if the source code would be browsable online. Boost uses github and other libraries being submitted do as well but that is not the only alternative. This would probably increase the amount of feedback.
You said you were concerned about people's ability to replicate the tests. I pointed out they have that ability. I don't consider a zip file a barrier but I accept it may be easier for some people.
The text just says that N elements are inserted into a container. Is the container empty before the insertion? If not, how many elements does it have? Is it the same number of elements for all N?
You say you've read the source code, so I take it you're using this as a way of pointing out what you think is necessary to be communicated, and that's fine. But for the record we are starting with empty containers and inserting singly on the fly. This replicates typical game engine code and the scenario which colony is made for, described throughout the benchmarks as fast insertion/erasure in realtime without invalidating pointers/iterators/indexes. Single insertion is also the only insertion criteria for any insertion benchmark I've seen, so I personally feel it's a logical leap to say it needs a large amount of explanation, but I accept that you feel that should be made more clear.
Regards insertion, it is all the fastest method available to that class,
No, it is not. From the source code it looks like you use push_back for vector when inserting a range of N elements at the end instead of vector::insert(end, N, T()) or vector::insert(end, begin, end). This is why I would like to see a table in the text indicating which method it is used because this is relevant performance-wise.
That doesn't typically replicate game engine code, as it assumes you know the number of elements in advance and have all data in advance, in which case, all the AAA dev's I have talked to or listened to have said, if you know the size in advance, and already have all the data, you would use an C-style array for performance/vectorizing/compatibility reasons. Watch the SG14's and mike acton's cppcon talks for more info.
Since the text is omitting relevant details to me as a reader it feels like it is trying to hide something.
I don't feel this is a polite or appropriate thing to say. There is a lot of information and a LOT of writing in between those benchmarks, I'm sorry that it doesn't include the Specific information you are interested in, but that's not a reason to go making assumptions.
You haven't answered my question about erasure, but from the source code it seems that the benchmarking code is erasing single elements of a vector and a deque which is N^2 and something nobody would do in a real performance-sensitive application.
Actually this happens in game engines a lot, but a raw unmodified vector or deque wouldn't be used in this context because of the performance hit you've mentioned. But what I'm doing in those first tests is measuring raw performance for comparison in later situations where the vectors etc are modified to avoid pointer/index invalidation - as described in the text. In the situations where colony would be used - where pointer/index validity matters ie. the bulk of situations for game engine code - remove_if will invalidate pointers and indexes to elements within the container, and is not an appropriate solution by itself. In the context of modified containers, a remove_if-*like* command might be used, but only when the modification prevents invalidation via the erase command, when more than a few erasures were happening per-frame, the data set is reasonably large, and the erase command was expensive. In that case, an additional flag would have to be added to the stored structure or the modified container, as conditions for erasure are not likely to be the same between elements. Then a remove_if-like command would fire at the end-of-frame. As mentioned in detail, the first, raw benchmarks are ONLY to measure which containers might be useful, modified or otherwise, in comparative tests. But I grant that the above information is probably something that most people would not know, and should possibly be included. At some point in the future, if I have the time, I may construct a variant of one of the modification tests featuring extremely high modification and compare using a remove_if-like command. That is the only context within which it might be useful.
In a nutshell, the thing that I think would benefit the benchmark text the most is including more details about what exactly are the benchmarks doing and how (a table would probably suffice).
I think this is a reasonable point, but I think for the most part this is pretty clear. I take your point that a differentiation as to what erasure method is being used and why is appropriate, but I don't agree with you for the insertion details. There's one benchmark where data is re-inserted back in, and that is clearly labelled, and toward the end of all tests. Regardless if I update the page, which I might, then I will include the specification that we are inserting into empty containers, though I fail to see how the initial insertion benchmarks would be useful or rational if we weren't. Making the code browsable online
would certainly help with getting more feedback and improving the benchmarks to further pinpoint the best use cases of colony. Choosing the range insert methods for the containers that offer them as well as erase_remove_if for the erasure benchmarks is something that is worth exploring (since in the use cases considered nobody would use vector and deque that way).
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
In a nutshell, the thing that I think would benefit the benchmark text the most is including more details about what exactly are the benchmarks doing and how (a table would probably suffice). Making the code browsable online would certainly help with getting more feedback and improving the benchmarks to further pinpoint the best use cases of colony. Choosing the range insert methods for the containers that offer them as well as erase_remove_if for the erasure benchmarks is something that is worth exploring (since in the use cases considered nobody would use vector and deque that way).
In case you're interested, there are now remove_if erasures included in comparative benchmarks, and the modification benchmarks. M
n 8 May 2016 at 03:58, Soul Studios
In case you're interested, there are now remove_if erasures included in comparative benchmarks, and the modification benchmarks. M
What would be interesting to see is a bench-mark against tbb's (Thread Building Blocks) concurrent_vector (and maybe some of the other containers in that library) as this has a similar growth strategy (and other characteristics), but alllowing for (some) concurrent access... Bench-marking anything against std::deque on Windows is a rather futile exercise as the implementation is broken, the result of a maximum chunk-size of 16 bytes (no, no typo) or the size of the (one) object if larger. Changing this is on the M$-to-do-list, but will not feature untill a major version change (source STL)... degski
What would be interesting to see is a bench-mark against tbb's (Thread Building Blocks) concurrent_vector (and maybe some of the other containers in that library) as this has a similar growth strategy (and other characteristics), but alllowing for (some) concurrent access...
Actually tbb concurrent_vector doesn't allow erasures other than clear(), so it wouldn't be suitable for comparisons in the situations where you'd use a colony. Interesting structure though. A colony should be able to have more concurrent accesses than an equivalently-multithreaded vector, as the block-based approach means you can have mutexes on individual blocks rather than the whole thing. Also some reads and writes can occur at the same time.
Bench-marking anything against std::deque on Windows is a rather futile exercise as the implementation is broken, the result of a maximum chunk-size of 16 bytes (no, no typo) or the size of the (one) object if larger. Changing this is on the M$-to-do-list, but will not feature untill a major version change (source STL)...
Yes, I noticed the MSVC deque performance results were weak, that's a pretty extraordinarily bad implementation! Wow. Hence why the main benchmarks are in GCC. I've noticed deque is actually better than vector under GCC, for most circumstances. Though I don't know how that holds up under vectorisation. M@
On 10 May 2016 at 02:01, Soul Studios
Actually tbb concurrent_vector doesn't allow erasures other than clear(), so it wouldn't be suitable for comparisons in the situations where you'd use a colony. Interesting structure though.
You're obviously right.
...you can have mutexes on individual blocks rather than the whole thing. Also some reads and writes can occur at the same time.
Is there currently in your implementation a way to know to which block an item belongs to (or to get the size of its block, which I guess comes down to the same), i.e. to know which mutex should be considered. If not, could that be a future feature?
Yes, I noticed the MSVC deque performance results were weak, that's a pretty extraordinarily bad implementation! Wow.
I would think it's historical and hasn't been changed since the beginning of M$-time to preserve binary-backward-compatibility (which from what I was made to understand a change of chunk-size entails). It's a shame actually that the chunk-size is not user-settable (as per the standard from what I understand) for a std::deque or boost::deque, the latter with a chunk size of 512b last time I looked. I always use the boost one... maybe on Windows (with VC) you should test against the boost::deque instead, in order to get some sensible data... I see your implementation gives control over the first block on construction (and the rest then follows the growth policy), NICE! I like it! degski -- (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!
...you can have mutexes on individual blocks rather than the whole thing. Also some reads and writes can occur at the same time.
Is there currently in your implementation a way to know to which block an item belongs to (or to get the size of its block, which I guess comes down to the same), i.e. to know which mutex should be considered. If not, could that be a future feature?
There's no way currently, I'd be more inclined to implement a separate 100% threadsafe version specifically - for both performance reasons and to not expose unnecessary details to the end user.
It's a shame actually that the chunk-size is not user-settable (as per the standard from what I understand) for a std::deque or boost::deque, the latter with a chunk size of 512b last time I looked. I always use the boost one... maybe on Windows (with VC) you should test against the boost::deque instead, in order to get some sensible data...
Might be a good idea for a future test, but I think the GCC results cover it more-or-less for now- plus it's good for people to know how terrible some of MS/Dinkumware's implementations are :)
I see your implementation gives control over the first block on construction (and the rest then follows the growth policy), NICE! I like it!
Yes, it also gives control to maximum block size, which might be of use in some streaming applications where the cache size is known, such as games, or in memory-constrained environments. Plus I've just finished creating a proper reserve function - though it shouldn't be online for another few days-
On 12 May 2016 at 03:59, Soul Studios
threadsafe version specifically - for both performance reasons and to not expose unnecessary details to the end user.
'Unnecessary' is a qualification that is a function of ones' objectives...
Might be a good idea for a future test, but I think the GCC results cover it more-or-less for now- plus it's good for people to know how terrible some of MS/Dinkumware's implementations are :)
Yes, bashing Dinkumware is fun, but they might have been given rules about not breaking binary compatibility. I do agree though, that some of their stuff is horrible. For this one they might have an excuse. The std::random lib is another really horrible one and since it's new, this time no excuse. boost::random (drop-in replacement) is much much faster (and more extensive, for games taus88 is the one in my view). The git-repository you reference in another post seems to be behind the references download on your site, 3.05 vs 3.07. The web-site states plf::colony is a vector-like container, I'm missing the indexes, iterating is good in some cases but not all. I've implemented a (simple) Concurrent Graph Library, based on ideas (the base structure) borrowed from the LEMON Graph Library, using tbb::concurrent_vector, instead of std::vector. This comes with some limitations as you correctly pointed out. You did give me some good ideas to actually make that CGL better and more flexible. I've read your paper on jump-counting-skipfields, there are some fun ideas in there that I would like to explore as well. degski
'Unnecessary' is a qualification that is a function of ones' objectives...
Well, it's unnecessary to the vast majority of users then, much as exposing deques inner workings would be for the vast majority of users.
The git-repository you reference in another post seems to be behind the references download on your site, 3.05 vs 3.07.
Yes, aware of that.
The web-site states plf::colony is a vector-like container, I'm missing the indexes, iterating is good in some cases but not all.
Yes as per the graphs, compared to a completely consolidated memory space iteration will be slower, but in many cases the time savings in insertion and erasure make up for it. Some of the text on that website needs to be updated :)
I've implemented a (simple) Concurrent Graph Library, based on ideas (the base structure) borrowed from the LEMON Graph Library, using tbb::concurrent_vector, instead of std::vector. This comes with some limitations as you correctly pointed out. You did give me some good ideas to actually make that CGL better and more flexible. I've read your paper on jump-counting-skipfields, there are some fun ideas in there that I would like to explore as well.
Good to hear and good to know- Thank you- Matt
On 13 May 2016 at 07:30, Soul Studios
iteration will be slower, but in many cases the time savings in insertion and erasure make up for it. Some of the text on that website needs to be updated :)
I've implemented a (simple) Concurrent Graph Library, based on ideas (the
base structure) borrowed from the LEMON Graph Library, using tbb::concurrent_vector, instead of std::vector. This comes with some limitations as you correctly pointed out. You did give me some good ideas to actually make that CGL better and more flexible. I've read your paper on jump-counting-skipfields, there are some fun ideas in there that I would like to explore as well.
I wasn't going on about indexes because of speed concerns regarding iterators. Node/arc/edge-Id's ARE the indexes into the node/arc vectors. This (indexes-) idea makes LEMON Graph Library much faster than BGL (it's also easier to use by the way, no need to read a book). Sure one could store iterators, but given the iterators' size, this "solution" would make the whole thing rather cache-unfriendly as compared to just storing uint32_t's. degski
I wasn't going on about indexes because of speed concerns regarding iterators. Node/arc/edge-Id's ARE the indexes into the node/arc vectors. This (indexes-) idea makes LEMON Graph Library much faster than BGL (it's also easier to use by the way, no need to read a book). Sure one could store iterators, but given the iterators' size, this "solution" would make the whole thing rather cache-unfriendly as compared to just storing uint32_t's.
Why would you store iterators? Just store a pointer. And no, indexes aren't faster for referencing than pointes, even with std::vector - I've done the benchmarks.
On 17 May 2016 at 04:25, Soul Studios
And no, indexes aren't faster for referencing than pointes, even with std::vector - I've done the benchmarks.
My point was, you don't have an index-interface in plf::colony, iterating (that what plf::colony provides) over the entire plf::colony is not very usefull, not to say almost meaningless. The LEMON Graph(s) resemble a doubly linked list implemented on top of a std::vector. The range of uint32_t large is enough to cater for any DAG imaginable. A 64-bit app would need twice the storage using pointers, which would affect cache-locality. degski
My point was, you don't have an index-interface in plf::colony, iterating (that what plf::colony provides) over the entire plf::colony is not very usefull, not to say almost meaningless.
I'm sorry, but that statement has no basis in reality.
The LEMON Graph(s) resemble a doubly linked list implemented on top of a std::vector. The range of uint32_t large is enough to cater for any DAG imaginable. A 64-bit app would need twice the storage using pointers, which would affect cache-locality.
I'm not actually interested in your specific use-case, but as I said, in actual benchmarks, despite the doubling of necessary storage for a pointer, the reference is faster than indexing, and cache locality is not adversely affected. I've already said this, so there was no need for me to repeat. And to be honest, you seem to be interested in a one-man-upmanship competition, and I don't do those, so I'm going to bow out. Best of luck with your life.
ps. This stuff is also online as part of the SG14 repository: https://github.com/WG21-SG14/SG14
participants (3)
-
degski
-
Gonzalo BG
-
Soul Studios