[spinlock] Spin on volatile read, NUMA fairness?
Hi All, Niall, 1) I was reading this Intel paper [0], and this section grabbed my attention: "One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free." As I can tell by looking at the source code, spinlock spins on atomic consume. I wonder if a volatile read would produce better performance characteristic? 2) AFAIK spinlocking is not necessarily fair on a NUMA architecture. Is there something already implemented or planned in Boost.Spinlock to ensure fairness? I'm thinking of something like this: [1] Thanks, Benedek [0]: https://software.intel.com/en-us/articles/implementing-scalable-atomic-locks... [1]: http://www.google.com/patents/US7334102
On Wed, Dec 3, 2014 at 10:48 PM, Benedek Thaler
Hi All, Niall,
1) I was reading this Intel paper [0], and this section grabbed my attention:
"One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free."
As I can tell by looking at the source code, spinlock spins on atomic consume. I wonder if a volatile read would produce better performance characteristic?
Generally speaking, things are more complicated than that. First, you would probably be spinning with a relaxed read, not consume, which is promoted to acquire on most, if not all, platforms. Acquire memory ordering is not required for spinning, and on architectures that support it it can be much more expensive than relaxed. Second, even a relaxed atomic read is formally not equivalent to a volatile read. The latter is not guaranteed to be atomic. Lastly, on x86 all this is mostly moot because compilers typically generate small volatile reads as a single instruction, which is equivalent to an acquire or relaxed atomic read on this architecture, as long as alignment is correct.
2) AFAIK spinlocking is not necessarily fair on a NUMA architecture. Is there something already implemented or planned in Boost.Spinlock to ensure fairness? I'm thinking of something like this: [1]
I can't tell for Boost.Spinlock (do we have that library?), but IMHO when you need fairness, spinlocks are not the best choice.
On 3 Dec 2014 at 20:48, Benedek Thaler wrote:
I was reading this Intel paper [0], and this section grabbed my attention:
"One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free."
As I can tell by looking at the source code, spinlock spins on atomic consume.
Spinlock does a speculative consume load to check if the lock is locked, and if it is it spins on that. If the consume load says the lock is unlocked, it then tries a compare exchange with acquire on success and consume on failure.
I wonder if a volatile read would produce better performance characteristic?
I think the Intel paper was referring to MSVC only, which is an unusual compiler in that its atomics all turn into InterlockedXXX functions irrespective of what you ask for. In other words, all seq_cst. One way of working around that is to use the volatile read = acquire and volatile write = release semantics MSVC added in I think VS2005. Now, I did benchmark the difference originally, and found no benefit to one or the other on VS2013, so I left it with the non-undefined behaviour variant which a cast from atomic to a volatile T * requires. However I went ahead and put it back if the BOOST_SPINLOCK_USE_VOLATILE_READ_FOR_AVOIDING_CMPXCHG macro is defined just in case you'd like to see for yourself.
2) AFAIK spinlocking is not necessarily fair on a NUMA architecture. Is there something already implemented or planned in Boost.Spinlock to ensure fairness? I'm thinking of something like this: [1]
If you want fairness, use the forthcoming C11 permit object, which is effectively a fair CAS lock, which will be the base kernel wait object in forthcoming non-allocating constexpr basic_future. That object has been tuned to back off and create fairness when heavily contended. Such fairness tuning is very much not free unfortunately. On 3 Dec 2014 at 23:29, Andrey Semashev wrote:
Generally speaking, things are more complicated than that. First, you would probably be spinning with a relaxed read, not consume, which is promoted to acquire on most, if not all, platforms.
Currently all platforms I believe. Consume semantics have not proven themselves worth compiler vendor effort in their present design. Therefore a consume is currently equal to an acquire.
Acquire memory ordering is not required for spinning, and on architectures that support it it can be much more expensive than relaxed. Second, even a relaxed atomic read is formally not equivalent to a volatile read. The latter is not guaranteed to be atomic. Lastly, on x86 all this is mostly moot because compilers typically generate small volatile reads as a single instruction, which is equivalent to an acquire or relaxed atomic read on this architecture, as long as alignment is correct.
I'll be honest: benchmarking whether I can drop that precheck to relaxed is on my todo list. As Intel can't do relaxed loads, I had been waiting for my ARM board, which actually arrived some months ago. I'm also pretty conservative when it comes to memory ordering, and I would default to stronger atomic semantics rather than weaker until I see a compelling reason why not.
2) AFAIK spinlocking is not necessarily fair on a NUMA architecture. Is there something already implemented or planned in Boost.Spinlock to ensure fairness? I'm thinking of something like this: [1]
I can't tell for Boost.Spinlock (do we have that library?), but IMHO when you need fairness, spinlocks are not the best choice.
It's forthcoming. It contains proposed concurrent_unordered_map and will contain the non-allocating constexpr basic_future. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
[Niall Douglas]
I think the Intel paper was referring to MSVC only, which is an unusual compiler in that its atomics all turn into InterlockedXXX functions irrespective of what you ask for. In other words, all seq_cst.
Absolutely untrue for VC-ARM. Also untrue for VC 2015 x86/x64, where <atomic> avoids Interlocked machinery when an ordinary load/store with a compiler barrier will suffice due to the strength of the architecture's guarantees. (The versions are blurring together, so I forget when we made this change.) STL
Stephan T. Lavavej wrote:
[Niall Douglas]
I think the Intel paper was referring to MSVC only, which is an unusual compiler in that its atomics all turn into InterlockedXXX functions irrespective of what you ask for. In other words, all seq_cst.
Absolutely untrue for VC-ARM. Also untrue for VC 2015 x86/x64, where <atomic> avoids Interlocked machinery when an ordinary load/store with a compiler barrier will suffice due to the strength of the architecture's guarantees. (The versions are blurring together, so I forget when we made this change.)
If I remember correctly, on x86/x64 all atomic loads can be a plain MOV, even seq_cst. The acquire is implicit, and the sequential consistency is guaranteed by seq_cst stores being XCHG. (Relaxed/release stores are also a plain MOV.) The paper doesn't mean that though. It says that the typical spinlock acquisition is: // atomic_flag f_; while( f_.test_and_set( std::memory_order_acquire ) ) { /* maybe yield */ } and that it's better to do this: // atomic_bool f_; do { while( f_.load( std::memory_order_relaxed ) ) { /* maybe yield */ } } while( f_.exchange( true, std::memory_order_acquire ) ); instead.
On 4/12/2014 14:39, Peter Dimov wrote:
The paper doesn't mean that though. It says that the typical spinlock acquisition is:
// atomic_flag f_;
while( f_.test_and_set( std::memory_order_acquire ) ) { /* maybe yield */ }
and that it's better to do this:
// atomic_bool f_;
do { while( f_.load( std::memory_order_relaxed ) ) { /* maybe yield */ } } while( f_.exchange( true, std::memory_order_acquire ) );
instead.
Out of curiosity, while I agree that the latter is better than the former, how would this compare: while( f_.exchange( true, std::memory_order_acquire ) ) { while( f_.load( std::memory_order_relaxed ) ) { /* maybe yield */ } } ? The only difference is that this assumes that acquiring the lock should succeed most of the time, so it skips the initial speculative relaxed load. It still avoids spinning directly on the exchange. (Also please correct me if I'm wrong but I thought on x86 at least relaxed and acquire have similar performance anyway, so there's no benefit to doing an initial relaxed read.)
Gavin Lambert wrote:
Out of curiosity, while I agree that the latter is better than the former, how would this compare:
while( f_.exchange( true, std::memory_order_acquire ) ) { while( f_.load( std::memory_order_relaxed ) ) { /* maybe yield */ } }
I think that you're right, this should indeed be better.
On 4 Dec 2014 at 17:08, Gavin Lambert wrote:
? The only difference is that this assumes that acquiring the lock should succeed most of the time, so it skips the initial speculative relaxed load. It still avoids spinning directly on the exchange.
That form has a guaranteed cache line invalidation. It might be a win on a dual core CPU, I would doubt a win on a heavily contended eight core CPU. As far as NUMA goes though, it might be indeed more fair.
(Also please correct me if I'm wrong but I thought on x86 at least relaxed and acquire have similar performance anyway, so there's no benefit to doing an initial relaxed read.)
On Intel the least strong read you can do is an acquire. It makes me very wary of ever using atomics with relaxed because you simply can't test them on Intel. It was a big reason I invested in that ARM board. The initial relaxed read I just tested there now and when paired with an exchange I see a 8.7% speed bump over a straight exchange with no relaxed precheck. Though it does depend on how much work you do inside the spinlock. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 5/12/2014 02:32, Niall Douglas wrote:
On Intel the least strong read you can do is an acquire. It makes me very wary of ever using atomics with relaxed because you simply can't test them on Intel. It was a big reason I invested in that ARM board.
I use Relacy Race Detector whenever I want to test anything to do with atomics, particularly the weaker orderings. It does require extracting the code in question to a little self-contained example along with a bit of extra decoration but it works quite well.
On 5 Dec 2014 at 11:41, Gavin Lambert wrote:
On 5/12/2014 02:32, Niall Douglas wrote:
On Intel the least strong read you can do is an acquire. It makes me very wary of ever using atomics with relaxed because you simply can't test them on Intel. It was a big reason I invested in that ARM board.
I use Relacy Race Detector whenever I want to test anything to do with atomics, particularly the weaker orderings. It does require extracting the code in question to a little self-contained example along with a bit of extra decoration but it works quite well.
Never used that one myself. I generally use valgrind drd which requires annotation plus clang threadsanitiser which usually does not. My biggest problem is with windows only code paths, I have been known to run valgrind on wine :( Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 4 Dec 2014 at 3:39, Peter Dimov wrote:
and that it's better to do this:
// atomic_bool f_;
do { while( f_.load( std::memory_order_relaxed ) ) { /* maybe yield */ } } while( f_.exchange( true, std::memory_order_acquire ) );
instead.
I just tested this form instead, and on VS2013 dual core x86 I saw a 45% performance increase. Very nice. Thank you. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 4 Dec 2014 at 13:25, Niall Douglas wrote:
On 4 Dec 2014 at 3:39, Peter Dimov wrote:
and that it's better to do this:
// atomic_bool f_;
do { while( f_.load( std::memory_order_relaxed ) ) { /* maybe yield */ } } while( f_.exchange( true, std::memory_order_acquire ) );
instead.
I just tested this form instead, and on VS2013 dual core x86 I saw a 45% performance increase. Very nice. Thank you.
FYI on ARMv7 I saw a catastrophic performance loss with the same measure. Spinlock performance dropped from 17m ops/sec to 2.3m ops/sec. It would seem that the ARM really hates unconditional atomic exchange. Forthcoming Boost.Spinlock now does unconditional exchange on Intel only, and conditional exchange everywhere else. Sadly without more CPUs to test it is hard to do much better. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 3 Dec 2014 at 23:52, Stephan T. Lavavej wrote:
[Niall Douglas]
I think the Intel paper was referring to MSVC only, which is an unusual compiler in that its atomics all turn into InterlockedXXX functions irrespective of what you ask for. In other words, all seq_cst.
Absolutely untrue for VC-ARM. Also untrue for VC 2015 x86/x64, where <atomic> avoids Interlocked machinery when an ordinary load/store with a compiler barrier will suffice due to the strength of the architecture's guarantees. (The versions are blurring together, so I forget when we made this change.)
Which would explain why a volatile read has no difference to an atomic consume read, which is an obvious duh on my part. Thanks for the catch Stephan. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On Wednesday 03 December 2014 23:41:47 Niall Douglas wrote:
On 3 Dec 2014 at 23:29, Andrey Semashev wrote:
Acquire memory ordering is not required for spinning, and on architectures that support it it can be much more expensive than relaxed. Second, even a relaxed atomic read is formally not equivalent to a volatile read. The latter is not guaranteed to be atomic. Lastly, on x86 all this is mostly moot because compilers typically generate small volatile reads as a single instruction, which is equivalent to an acquire or relaxed atomic read on this architecture, as long as alignment is correct.
I'll be honest: benchmarking whether I can drop that precheck to relaxed is on my todo list. As Intel can't do relaxed loads, I had been waiting for my ARM board, which actually arrived some months ago.
I'm also pretty conservative when it comes to memory ordering, and I would default to stronger atomic semantics rather than weaker until I see a compelling reason why not.
There's no point in issuing an acquire barrier on every read while spinning. You don't accomplish anything with it. What matters is the acquire barrier when you actually lock the mutex, which happens unconditionally after spinning.
participants (6)
-
Andrey Semashev
-
Benedek Thaler
-
Gavin Lambert
-
Niall Douglas
-
Peter Dimov
-
Stephan T. Lavavej