Re: [Boost-users] [iostreams] Devices and WOULD_BLOCK

27 Jan 2015

      On 27/01/2015 15:08, Niall Douglas wrote:
...
On 27 Jan 2015 at 10:58, Gavin Lambert wrote:
...
There are negative performance consequences to copying a shared_ptr (ie.
incrementing or decrementing its refcount).  *Most* applications don't
need to care about this (it's very small) but sometimes it's worthy of
note, and there's no harm in avoiding copies in silly places (which is
why I thwack people that pass a shared_ptr as a value parameter).
As food for thought, AFIO which uses shared_ptr very heavily indeed
to avoid any locking at all passes them around all by value. It was
bugging me whether this was costing me performance, so I tried
replaced the lot with reference semantics.
Total effect on performance: ~0.1%.
As I said, it's not a big difference (atomic ops are typically ~1us, and 
that was on the previous CPU generation), but it's still one of my pet 
peeves, as while there are many places where shared_ptrs do need to get 
copied for correctness, parameter passing is not one of those places. 
(And performance gets worse if you end up passing the object through 
many layers as part of keeping methods short or similar "tidiness" or 
abstraction guidelines; and it wastes more stack too.)

You're going to have to make lots of copies anyway in an asynchronous 
library like AFIO, because binding an asynchronous callback is one of 
those places that you *do* need to copy a shared_ptr, so if you have a 
high percentage of async code (which is what I would expect with that 
sort of library) then it's not going to make much difference either way.
...
The key is that AFIO very, very rarely has more than one thread touch
a shared_ptr at once. That, on Intel at least, makes their atomic
reference counting almost as cheap as non-atomic reference counting.
Combine that with the compiler judiciously folding out copies for you
where it can, and the overhead for the benefits to debugging and
maintenance is irrelevant.
Writing a single shared_ptr instance from multiple threads requires even 
more overhead from the extra spinlock (via the atomic_*(&sp...) family 
of functions).  Though an uncontended spinlock basically only costs 2 
atomic-ops, so it's usually not too bad.

(But those functions do mildly irritate me in that they're also passing 
by value, but at least in that case they're inlined template methods so 
the compiler will almost certainly elide the parameter copy.  Another 
case where generic library code may "win" over application code.)

Multi-writers is one case where it may be better to create separate 
per-thread copies from some "safe" context up front, if you can 
(assuming you're ok with operating on stale data until some sync point). 
  But again, to a certain extent async code patterns may already be 
doing these copies "for you".  And if you're limiting yourself to WORM 
access only, you can skip the spinlock if you're careful.
...
Of course, I'm currently seeing a 300k CPU cycle per op average.
shared_ptr is tiny compared to that. With a 10k CPU cycle per op
average I might care a bit more.
I'm probably biased the other way, because about half of the code I work 
on has sub-millisecond budgets. :)