[Block Pointer] Up to 600% faster than Shared Pointer
- I made the proxy explicit. Now it can be used 3 different ways:
1) Clothesline style:
block_proxy x;
block_ptr<int> v = block_ptr<int>(x, new block<int>(11));
block_ptr<int> w = block_ptr<int>(x, new block<int>(12));
2) Root style:
proxy_ptr
On Fri, Mar 11, 2016 at 9:38 PM, Phil Bouchard
new: auto_ptr: 23174753 ns shared_ptr: 49726615 ns block_ptr: 7812659 ns
Hi Phil, What exactly are you measuring here? The time taken to construct an std::auto_ptr<T>/std::shared_ptr<T>/block_ptr<T> inclusive of the time taken for ::operator new(std::size_t) and inclusive of the value-initialization T::T() ? Glen
On 03/11/2016 11:12 PM, Glen Fernandes wrote:
On Fri, Mar 11, 2016 at 9:38 PM, Phil Bouchard
wrote: new: auto_ptr: 23174753 ns shared_ptr: 49726615 ns block_ptr: 7812659 ns
Hi Phil,
What exactly are you measuring here? The time taken to construct an std::auto_ptr<T>/std::shared_ptr<T>/block_ptr<T> inclusive of the time taken for ::operator new(std::size_t) and inclusive of the value-initialization T::T() ?
Yes but it also calls ::operator delete(void *) and T::~T(). If I don't use fastblock<>() then it's a simple 150% speedup compared to shared_ptr<>: new: auto_ptr: 22279670 ns shared_ptr: 47428316 ns block_ptr: 31287440 ns My point with fastblock<>() is that it can be use very easily because it is already integrated.
On Sat, Mar 12, 2016 at 12:37 PM, Phil Bouchard wrote:
On 03/11/2016 11:12 PM, Glen Fernandes wrote:
Hi Phil,
What exactly are you measuring here? The time taken to construct an std::auto_ptr<T>/std::shared_ptr<T>/block_ptr<T> inclusive of the time taken for ::operator new(std::size_t) and inclusive of the value-initialization T::T() ?
Yes but it also calls ::operator delete(void *) and T::~T().
If I don't use fastblock<>() then it's a simple 150% speedup compared to shared_ptr<>:
new: auto_ptr: 22279670 ns shared_ptr: 47428316 ns block_ptr: 31287440 ns
My point with fastblock<>() is that it can be use very easily because it is already integrated.
Since everyone prefers make_shared<T>(args...) when possible instead of shared_ptr<T>(new T(args...)), I'm not sure that the benchmark of the latter is very useful. Rather, given that the preferred use of shared_ptr<T> is with make_shared() or allocate_shared(), I think you should at least include that benchmark in this list. Otherwise you're essentially measuring the cost of 2 ::operator new(std::size_T) calls (that most people wouldn't write) instead of the 1 ::operator new(std::size_t) call (that most people would write). You might as well start replacing auto_ptr with unique_ptr, just to honor the fact that the former is deprecated. :-) Glen
On 03/12/2016 01:06 PM, Glen Fernandes wrote:
On Sat, Mar 12, 2016 at 12:37 PM, Phil Bouchard wrote:
On 03/11/2016 11:12 PM, Glen Fernandes wrote:
Hi Phil,
What exactly are you measuring here? The time taken to construct an std::auto_ptr<T>/std::shared_ptr<T>/block_ptr<T> inclusive of the time taken for ::operator new(std::size_t) and inclusive of the value-initialization T::T() ?
Yes but it also calls ::operator delete(void *) and T::~T().
If I don't use fastblock<>() then it's a simple 150% speedup compared to shared_ptr<>:
new: auto_ptr: 22279670 ns shared_ptr: 47428316 ns block_ptr: 31287440 ns
My point with fastblock<>() is that it can be use very easily because it is already integrated.
Since everyone prefers make_shared<T>(args...) when possible instead of shared_ptr<T>(new T(args...)), I'm not sure that the benchmark of the latter is very useful.
Rather, given that the preferred use of shared_ptr<T> is with make_shared() or allocate_shared(), I think you should at least include that benchmark in this list.
Otherwise you're essentially measuring the cost of 2 ::operator new(std::size_T) calls (that most people wouldn't write) instead of the 1 ::operator new(std::size_t) call (that most people would write).
Good point. If I compare make_shared<>() with new block<>() then the speeds are almost the same: make: auto_ptr: 25614443 ns shared_ptr: 26603216 ns <- new: auto_ptr: 23529804 ns shared_ptr: 49876800 ns block_ptr: 29975876 ns <- If I compare make_shared<>() with new fastblock<>() I have a 300% speedup: make: auto_ptr: 25672479 ns shared_ptr: 27077638 ns <- new: auto_ptr: 22647604 ns shared_ptr: 53317702 ns block_ptr: 8911600 ns <-
You might as well start replacing auto_ptr with unique_ptr, just to honor the fact that the former is deprecated. :-)
I tried to quickly but for some reason auto_ptr<> is still the one dangling around in my GCC API. Regards, -Phil
On March 11, 2016 9:38:46 PM EST, Phil Bouchard
- I made the proxy explicit. Now it can be used 3 different ways:
1) Clothesline style:
block_proxy x; block_ptr<int> v(x, new block<int>(11)); block_ptr<int> w(x, new block<int>(12));
2) Root style:
proxy_ptr
u(new block ()); 3) Container style (very easy to write containers now):
struct list : block_proxy { public: list() : root(*this) {} [...] private: block_ptr<node> root; };
I simplified the syntax of the first two examples, but this still seems complicated given the various components and ways to combine and use them. I don't even know the point.
- If we use fastblock<>() then the speedup can go up to 600% compared to shared_ptr<>:
That's yet another piece to the complexity equation.
- Once again all the examples can be found here: https://github.com/philippeb8/block_ptr/tree/master/example
benchmark.cpp seems flawed for numerous reasons. You measure usage of the make function through a function pointer. You don't eliminate cold cache effects on timing each variation. You are timing free store allocations and deallocations, which can have unknown side effects. You aren't accounting for code elision by the optimizer.
- And the pointer itself here: https://github.com/philippeb8/block_ptr/blob/master/include/boost/smart_ptr/...
block_proxy provides little or no safety. First, destroying can be changed through the public interface, so why bother with the accessor/mutator pair? Second, init() can be called with any pointer of the required type. Does that type enforce the right semantics? I don't know. (The comment refers to stack pointers by which, again, I think you mean pointers that may belong to an object and can be on the free store as well, so the comment is misleading.) The loop in block_proxy::release() seems odd. First, are you purposely avoiding C++11? Range-based for would be better. You could also use for_each() with a lambda. As it is, why both m and n? I'd make more comments, but your comments often don't match the functions they adorn and the code is too hard to follow on this small screen. (Long lines and excess whitespace don't help.) HTH ___ Rob (Sent from my portable computation engine)
On 03/12/2016 05:18 AM, Rob Stewart wrote:
On March 11, 2016 9:38:46 PM EST, Phil Bouchard
wrote: - I made the proxy explicit. Now it can be used 3 different ways:
1) Clothesline style:
block_proxy x; block_ptr<int> v(x, new block<int>(11)); block_ptr<int> w(x, new block<int>(12));
2) Root style:
proxy_ptr
u(new block ()); 3) Container style (very easy to write containers now):
struct list : block_proxy { public: list() : root(*this) {} [...] private: block_ptr<node> root; };
I simplified the syntax of the first two examples, but this still seems complicated given the various components and ways to combine and use them. I don't even know the point.
1) and 3) are slightly more complicated but it'll be used by library
writers (to explicitly assign a proxy to a pointer).
- The end user will use simple assignments operations like:
t100.sub_[1].second = new block
- If we use fastblock<>() then the speedup can go up to 600% compared to shared_ptr<>:
That's yet another piece to the complexity equation.
If people write game engines then I think they will understand this part easily ;)
- Once again all the examples can be found here: https://github.com/philippeb8/block_ptr/tree/master/example
benchmark.cpp seems flawed for numerous reasons. You measure usage of the make function through a function pointer. You don't eliminate cold cache effects on timing each variation. You are timing free store allocations and deallocations, which can have unknown side effects. You aren't accounting for code elision by the optimizer.
I am not accounting for any optimization, specially for block_ptr<> because I don't even use move semantics yet, etc.
- And the pointer itself here: https://github.com/philippeb8/block_ptr/blob/master/include/boost/smart_ptr/...
block_proxy provides little or no safety. First, destroying can be changed through the public interface, so why bother with the accessor/mutator pair? Second, init() can be called with any pointer of the required type. Does that type enforce the right semantics? I don't know. (The comment refers to stack pointers by which, again, I think you mean pointers that may belong to an object and can be on the free store as well, so the comment is misleading.)
block_proxy is still in a beta state and it's in a "just works" state. But eventually it should be protected like "type_info".
The loop in block_proxy::release() seems odd. First, are you purposely avoiding C++11? Range-based for would be better. You could also use for_each() with a lambda. As it is, why both m and n?
Thanks for pointing this out but like I was saying it's still in a beta state. It's possible for me to re-add the unification of sets but I don't think it's going to be worth it.
I'd make more comments, but your comments often don't match the functions they adorn and the code is too hard to follow on this small screen. (Long lines and excess whitespace don't help.)
Thanks a lot for your comments, I'd rather know this right now than after the whole revision process. Honestly I've been analyzing pretty much all possibilities and I don't see any other efficient option out there that goes well amongst the C++ standards. Regards, -Phil
On March 12, 2016 1:25:03 PM EST, Phil Bouchard
On 03/12/2016 05:18 AM, Rob Stewart wrote:
On March 11, 2016 9:38:46 PM EST, Phil Bouchard
wrote:
block_proxy x; block_ptr<int> v(x, new block<int>(11)); proxy_ptr
u(new block ()); struct list : block_proxy { public: list() : root(*this) {} [...] private: block_ptr<node> root; };
I simplified the syntax of the first two examples, but this still seems complicated given the various components and ways to combine and use them.
[snip]
- If we use fastblock<>() then the speedup can go up to 600% compared to shared_ptr<>:
That's yet another piece to the complexity equation.
If people write game engines then I think they will understand this part easily ;)
My point is that there are many components and techniques needed to use your library, which makes it complex and, I imagine, will lead to confusion.
benchmark.cpp seems flawed for numerous reasons. You measure usage of the make function through a function pointer. You don't eliminate cold cache effects on timing each variation. You are timing free store allocations and deallocations, which can have unknown side effects. You aren't accounting for code elision by the optimizer.
I am not accounting for any optimization, specially for block_ptr<> because I don't even use move semantics yet, etc.
If that's the case, don't bother reporting comparisons between the different components and approaches. Unless you're measuring the right things correctly, the measurements are mostly useless. ___ Rob (Sent from my portable computation engine)
On 03/12/2016 02:03 PM, Rob Stewart wrote:
My point is that there are many components and techniques needed to use your library, which makes it complex and, I imagine, will lead to confusion.
I guess the "container style" one could be replaced easily by the "root style": struct list : block_proxy { public: list() : root(*this) {} [...] private: block_ptr<node> root; }; With: struct list { public: list() {} [...] private: proxy_ptr<node> root; }; Please forget about the "container style" then.
I am not accounting for any optimization, specially for block_ptr<> because I don't even use move semantics yet, etc.
If that's the case, don't bother reporting comparisons between the different components and approaches. Unless you're measuring the right things correctly, the measurements are mostly useless.
Ok.
On 03/11/2016 09:38 PM, Phil Bouchard wrote:
- Once again all the examples can be found here: https://github.com/philippeb8/block_ptr/tree/master/example
- And the pointer itself here: https://github.com/philippeb8/block_ptr/blob/master/include/boost/smart_ptr/...
I just did a heavy code cleanup and I took the stack / heap detection
code out because it is now based on explicit proxies.
Also block_ptr<> is now distinct from the allocator used.
Finally the class fastblock<> is properly abstracted and it is easy to
add user defined block types just like this one:
template <typename T>
class fastblock : public block
On Sat, Mar 12, 2016 at 9:08 PM, Phil Bouchard
Also block_ptr<> is now distinct from the allocator used.
Finally the class fastblock<> is properly abstracted and it is easy to add user defined block types just like this one:
template <typename T> class fastblock : public block
Hi Phil, To add to our earlier discussion, you should also include the benchmark for allocate_shared() with fast_pool_allocator. 1. shared_ptr (new) 2. shared_ptr (make_shared) 3. shared_ptr (allocate_shared with fast_pool_allocator) 4. block_ptr 5. block_ptr (fastblock) 6. unique_ptr/auto_ptr (new/make_unique) Glen
On 03/12/2016 09:22 PM, Glen Fernandes wrote:
On Sat, Mar 12, 2016 at 9:08 PM, Phil Bouchard
wrote: Also block_ptr<> is now distinct from the allocator used.
Finally the class fastblock<> is properly abstracted and it is easy to add user defined block types just like this one:
template <typename T> class fastblock : public block
Hi Phil,
To add to our earlier discussion, you should also include the benchmark for allocate_shared() with fast_pool_allocator.
1. shared_ptr (new) 2. shared_ptr (make_shared) 3. shared_ptr (allocate_shared with fast_pool_allocator) 4. block_ptr 5. block_ptr (fastblock) 6. unique_ptr/auto_ptr (new/make_unique)
Good point, I was not aware it could be used this way. I just added fast_pool_allocator<>() to the benchmark so now the comparison is fair and block_ptr<> still is faster by 125%: make: auto_ptr: 27539691 ns shared_ptr: 31368593 ns alloc: shared_ptr: 10995612 ns <- new: auto_ptr: 22648835 ns shared_ptr: 48179069 ns block_ptr: 8775982 ns <- Not to mention the code is more elegant using operator new. -Phil
On Sat, Mar 12, 2016 at 10:12 PM, Phil Bouchard wrote:
On 03/12/2016 09:22 PM, Glen Fernandes wrote:
To add to our earlier discussion, you should also include the benchmark for allocate_shared() with fast_pool_allocator.
Good point, I was not aware it could be used this way. I just added fast_pool_allocator<>() to the benchmark so now the comparison is fair and block_ptr<> still is faster by 125%:
Hi Phil, The other thing I observed in your benchmarks is the distinction between 'new T' and 'new T()'. i.e. make_shared() and allocate_shared() (and make_unique() too) all value-initialize. 'new T' (and T is 'int' in your example) is default-initialization (while 'new T()' would be value-initialization). You can use boost::make_shared_noinit() and boost::allocate_shared_noinit() (and boost::make_unique_noinit() also) if you want default-initialization. Or you can use 'new T()' instead of 'new T' and have value-initialization. (I haven't looked at the implementation of block_ptr yet, or the design, or the purpose of it, I just saw the thread on benchmarks and examined your benchmark program source). Glen
On 03/13/2016 12:00 AM, Glen Fernandes wrote:
On Sat, Mar 12, 2016 at 10:12 PM, Phil Bouchard wrote:
On 03/12/2016 09:22 PM, Glen Fernandes wrote:
To add to our earlier discussion, you should also include the benchmark for allocate_shared() with fast_pool_allocator.
Good point, I was not aware it could be used this way. I just added fast_pool_allocator<>() to the benchmark so now the comparison is fair and block_ptr<> still is faster by 125%:
Hi Phil,
The other thing I observed in your benchmarks is the distinction between 'new T' and 'new T()'. i.e. make_shared() and allocate_shared() (and make_unique() too) all value-initialize. 'new T' (and T is 'int' in your example) is default-initialization (while 'new T()' would be value-initialization).
You can use boost::make_shared_noinit() and boost::allocate_shared_noinit() (and boost::make_unique_noinit() also) if you want default-initialization. Or you can use 'new T()' instead of 'new T' and have value-initialization.
There is not much difference, now block_ptr<> is still 132% faster than shared_ptr<>: make: auto_ptr: 25564043 ns shared_ptr: 26421429 ns make alloc: shared_ptr: 11725542 ns make alloc noinit: shared_ptr: 11665724 ns <- new: auto_ptr: 23546857 ns shared_ptr: 49000954 ns block_ptr: 8791379 ns <-
(I haven't looked at the implementation of block_ptr yet, or the design, or the purpose of it, I just saw the thread on benchmarks and examined your benchmark program source).
I keep cleaning up the code and now I think it is very easy to understand: proxy_ptr<node> x; block_ptr<node> v = block_ptr<node>(x, new block<node>(x)); v->next = v; You have 2 types of pointers: proxy_ptr<> and block_ptr<>. proxy_ptr<> is the root of the tree block_ptr<> will build on. When proxy_ptr<> gets deleted then all block_ptr<> associated to it will be wiped out as well. The purpose of block_ptr<> is to delete cyclic pointers.
On Sun, Mar 13, 2016 at 11:14 AM, Phil Bouchard wrote:
I keep cleaning up the code and now I think it is very easy to understand:
I quickly threw together a more tidy basis for your benchmark, you can find the code at: https://github.com/philippeb8/block_ptr/issues/1 I tested it on g++ 5.3. You can compile and run with: g++ -std=c++14 -O2 -o benchmark benchmark.cpp && ./benchmark Results on one machine: unique_ptr (new): 81.2868 unique_ptr (make_unique): 83.2583 shared_ptr (new): 139.141 shared_ptr (make_shared): 84.8653 It might address one of the concerns Rob had with your benchmark (e.g. your use of function pointers). Should be trivial for you to add the additional cases: shared_ptr (allocate_shared with fast_pool_allocator) and block_ptr to it. Glen
On 03/13/2016 11:24 AM, Glen Fernandes wrote:
Results on one machine:
unique_ptr (new): 81.2868 unique_ptr (make_unique): 83.2583 shared_ptr (new): 139.141 shared_ptr (make_shared): 84.8653
It might address one of the concerns Rob had with your benchmark (e.g. your use of function pointers).
I just overwrote the benchmark and now I have: unique_ptr (new): 47.7686 unique_ptr (make_unique): 46.8545 shared_ptr (new): 77.8261 shared_ptr (make_shared): 50.8072 shared_ptr (make_shared_alloc_noinit): 33.021 block_ptr (new): 69.6554 -Phil
On Sun, Mar 13, 2016 at 12:44 PM, Phil Bouchard
unique_ptr (new): 47.7686 unique_ptr (make_unique): 46.8545 shared_ptr (new): 77.8261 shared_ptr (make_shared): 50.8072 shared_ptr (make_shared_alloc_noinit): 33.021 block_ptr (new): 69.6554
Excellent. The wording you want to use inside the parenthesis is better expressed as "allocate_shared_noinit" (instead of "make_shared_alloc_noinit"). Glen
On 03/13/2016 12:51 PM, Glen Fernandes wrote:
On Sun, Mar 13, 2016 at 12:44 PM, Phil Bouchard
wrote: unique_ptr (new): 47.7686 unique_ptr (make_unique): 46.8545 shared_ptr (new): 77.8261 shared_ptr (make_shared): 50.8072 shared_ptr (make_shared_alloc_noinit): 33.021 block_ptr (new): 69.6554
Excellent.
The wording you want to use inside the parenthesis is better expressed as "allocate_shared_noinit" (instead of "make_shared_alloc_noinit").
Thanks for your help. So block_ptr<> wouldn't be a drop-in replacement for shared_ptr<> but an add-on for: - complex containers, - neural networks, - C# / Java / Javascript engines, - etc. Regards, -Phil
participants (3)
-
Glen Fernandes
-
Phil Bouchard
-
Rob Stewart