On 3 Dec 2014 at 20:48, Benedek Thaler wrote:
I was reading this Intel paper [0], and this section grabbed my attention:
"One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free."
As I can tell by looking at the source code, spinlock spins on atomic consume.
Spinlock does a speculative consume load to check if the lock is locked, and if it is it spins on that. If the consume load says the lock is unlocked, it then tries a compare exchange with acquire on success and consume on failure.
I wonder if a volatile read would produce better performance characteristic?
I think the Intel paper was referring to MSVC only, which is an unusual compiler in that its atomics all turn into InterlockedXXX functions irrespective of what you ask for. In other words, all seq_cst. One way of working around that is to use the volatile read = acquire and volatile write = release semantics MSVC added in I think VS2005. Now, I did benchmark the difference originally, and found no benefit to one or the other on VS2013, so I left it with the non-undefined behaviour variant which a cast from atomic to a volatile T * requires. However I went ahead and put it back if the BOOST_SPINLOCK_USE_VOLATILE_READ_FOR_AVOIDING_CMPXCHG macro is defined just in case you'd like to see for yourself.
2) AFAIK spinlocking is not necessarily fair on a NUMA architecture. Is there something already implemented or planned in Boost.Spinlock to ensure fairness? I'm thinking of something like this: [1]
If you want fairness, use the forthcoming C11 permit object, which is effectively a fair CAS lock, which will be the base kernel wait object in forthcoming non-allocating constexpr basic_future. That object has been tuned to back off and create fairness when heavily contended. Such fairness tuning is very much not free unfortunately. On 3 Dec 2014 at 23:29, Andrey Semashev wrote:
Generally speaking, things are more complicated than that. First, you would probably be spinning with a relaxed read, not consume, which is promoted to acquire on most, if not all, platforms.
Currently all platforms I believe. Consume semantics have not proven themselves worth compiler vendor effort in their present design. Therefore a consume is currently equal to an acquire.
Acquire memory ordering is not required for spinning, and on architectures that support it it can be much more expensive than relaxed. Second, even a relaxed atomic read is formally not equivalent to a volatile read. The latter is not guaranteed to be atomic. Lastly, on x86 all this is mostly moot because compilers typically generate small volatile reads as a single instruction, which is equivalent to an acquire or relaxed atomic read on this architecture, as long as alignment is correct.
I'll be honest: benchmarking whether I can drop that precheck to relaxed is on my todo list. As Intel can't do relaxed loads, I had been waiting for my ARM board, which actually arrived some months ago. I'm also pretty conservative when it comes to memory ordering, and I would default to stronger atomic semantics rather than weaker until I see a compelling reason why not.
2) AFAIK spinlocking is not necessarily fair on a NUMA architecture. Is there something already implemented or planned in Boost.Spinlock to ensure fairness? I'm thinking of something like this: [1]
I can't tell for Boost.Spinlock (do we have that library?), but IMHO when you need fairness, spinlocks are not the best choice.
It's forthcoming. It contains proposed concurrent_unordered_map and will contain the non-allocating constexpr basic_future. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/