[atomic] Generalized read-modify-write operation
Hi, I was wondering is there would be interest in a generalized read-modify-write operation in Boost.Atomic. The idea is to have a atomic<T>::read_modify_write method that would take a function that would perform the modify on the atomic value. The interface could be roughly this: template< typename T > class atomic { public: // Returns the original value T read_modify_write(F modify, memory_order order = memory_order_seq_cst); }; Usage example: // Saturating increment atomic< int > a; a.read_modify_write([](int n) { return n < std::numeric_limits< int >::max() ? n + 1 : n; }); The implementation could synthesize a CAS loop around the modify function or (ideally) wrap it inside the LL/SC instructions, also in a loop. I believe, in order for this to work the modify function should meet certain requirements: 1. It must not use atomic operations (on this or other variables) or thread synchronization primitives. 2. It must not call functions that might involve context switching (e.g. syscalls, i/o). 3. It should be as small as possible and have the least possible working set (that is the number of accessed variables). 4. It must be prepared to be called multiple times. 5. It must be prepared to receive a value that has never been stored in the atomic variable. The first restriction is essential to avoid deadlocking in case of the emulated (lock-based) atomic<>. Thread synchronization primitives may possibly be using atomic ops internally, so are prohibited as well. The 2 and 3, besides just being something rational one would strive for anyway, are mostly targeted to allow to inject the function call inside the LL/SC pair. It is known that some architectures are very sensitive to what is happening between the LL and SC instructions, so accessing some memory from the modify function or switching the thread context will result in SC always failing and an infinite loop. The 5 would allow to make the first load non-atomic, which can be faster. The CAS or SC instruction will fail if the loaded value was inconsistent. I know there's probably no way to ensure that the function meets these requirements, and it is possible to get an infinite loop in runtime, so maybe always using a CAS loop would be a safer approach, even on an architecture with LL/SC instructions. My hope is that compilers will someday provide intrinsics for this and will be able to generate optimal code, when possible. Does this look interesting to anyone? Comments?
On 1 Sep 2015 10:35 am, "Andrey Semashev"
Hi,
I was wondering is there would be interest in a generalized read-modify-write operation in Boost.Atomic.
Yes. At the very least it can cleanup some idioms if not make them faster. [...]
I know there's probably no way to ensure that the function meets these requirements, and it is possible to get an infinite loop in runtime, so maybe always using a CAS loop would be a safer approach, even on an architecture with LL/SC instructions.
The implementation could fall back on the CAS loop after the first failed LL/SC. You might want to do it even in the case if contention to reduce the window the cachelines must be held. It is not optimal in case that the reservation always fail, but is always correct and the performance can be corrected with profiling and some sort of CAS hint. This is always possible because CAS must always be a valid implementation strategy. -- gpd
On Tue, Sep 1, 2015 at 2:14 PM, Giovanni Piero Deretta
On 1 Sep 2015 10:35 am, "Andrey Semashev"
wrote: [...] I know there's probably no way to ensure that the function meets these requirements, and it is possible to get an infinite loop in runtime, so maybe always using a CAS loop would be a safer approach, even on an architecture with LL/SC instructions.
The implementation could fall back on the CAS loop after the first failed LL/SC. You might want to do it even in the case if contention to reduce the window the cachelines must be held. It is not optimal in case that the reservation always fail, but is always correct and the performance can be corrected with profiling and some sort of CAS hint.
Thanks for the suggestion. Yes, that's probably the best way to implement it. I can provide two functions in the API - one that always implements a CAS loop right away and the other that might try injecting modify function between LL and SC and do the CAS loop if that fails. On a CAS-only architecture the two functions would be equivalent.
On 09/01/2015 04:35 AM, Andrey Semashev wrote:
Hi,
I was wondering is there would be interest in a generalized read-modify-write operation in Boost.Atomic. The idea is to have a atomic<T>::read_modify_write method that would take a function that would perform the modify on the atomic value. The interface could be roughly this:
template< typename T > class atomic { public: // Returns the original value T read_modify_write(F modify, memory_order order = memory_order_seq_cst); };
Usage example:
// Saturating increment atomic< int > a; a.read_modify_write([](int n) { return n < std::numeric_limits< int >::max() ? n + 1 : n; });
The implementation could synthesize a CAS loop around the modify function or (ideally) wrap it inside the LL/SC instructions, also in a loop.
I believe, in order for this to work the modify function should meet certain requirements:
1. It must not use atomic operations (on this or other variables) or thread synchronization primitives. 2. It must not call functions that might involve context switching (e.g. syscalls, i/o). 3. It should be as small as possible and have the least possible working set (that is the number of accessed variables). 4. It must be prepared to be called multiple times. 5. It must be prepared to receive a value that has never been stored in the atomic variable.
The first restriction is essential to avoid deadlocking in case of the emulated (lock-based) atomic<>. Thread synchronization primitives may possibly be using atomic ops internally, so are prohibited as well.
The 2 and 3, besides just being something rational one would strive for anyway, are mostly targeted to allow to inject the function call inside the LL/SC pair. It is known that some architectures are very sensitive to what is happening between the LL and SC instructions, so accessing some memory from the modify function or switching the thread context will result in SC always failing and an infinite loop.
The 5 would allow to make the first load non-atomic, which can be faster. The CAS or SC instruction will fail if the loaded value was inconsistent.
I know there's probably no way to ensure that the function meets these requirements, and it is possible to get an infinite loop in runtime, so maybe always using a CAS loop would be a safer approach, even on an architecture with LL/SC instructions. My hope is that compilers will someday provide intrinsics for this and will be able to generate optimal code, when possible.
Does this look interesting to anyone? Comments?
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
It's interesting although ensuring that all the requirements are met for the function is impossible. How about also a stand-alone version so that it can be used with std::atomic<T>?
On 02.09.2015 01:17, Yiannis Papadopoulos wrote:
How about also a stand-alone version so that it can be used with std::atomic<T>?
I wasn't planning that. With a third-party atomic<> implementation it would be impossible to achieve any performance benefits, so the only improvement would be a cleaner code. Boost.Atomic currently doesn't support atomic functions, so providing just one looks awkward and implementing all of them requires more time.
On 1/09/2015 21:35, Andrey Semashev wrote:
I was wondering is there would be interest in a generalized read-modify-write operation in Boost.Atomic. The idea is to have a atomic<T>::read_modify_write method that would take a function that would perform the modify on the atomic value. The interface could be roughly this: [...] Does this look interesting to anyone? Comments?
It's not strictly necessary since it's equivalent to a do { r = modify(n); } while (!var.compare_exchange_weak(n, r, order)) loop, which isn't that much more typing. (Although I elided the initial load, so it's a bit more typing than it appears here.) (And the above is also supposed to use LL/SC on architectures where this is cheaper than CAS, although I'm not sure if this is the case.) But provided that the code generated is no worse than this (in particular that the compiler can inline the function call in optimised builds, at least where the function is relatively simple) then it would still be useful, particularly for people who aren't used to using weak exchange loops.
On 02.09.2015 02:20, Gavin Lambert wrote:
On 1/09/2015 21:35, Andrey Semashev wrote:
I was wondering is there would be interest in a generalized read-modify-write operation in Boost.Atomic. The idea is to have a atomic<T>::read_modify_write method that would take a function that would perform the modify on the atomic value. The interface could be roughly this: [...] Does this look interesting to anyone? Comments?
It's not strictly necessary since it's equivalent to a do { r = modify(n); } while (!var.compare_exchange_weak(n, r, order)) loop, which isn't that much more typing. (Although I elided the initial load, so it's a bit more typing than it appears here.)
Yes, correct. r and n also have to be declared outside the loop, which is kind of annoying.
(And the above is also supposed to use LL/SC on architectures where this is cheaper than CAS, although I'm not sure if this is the case.)
I'm not sure there are architectures that implement both CAS and LL/SC instructions, at least I'm not aware of such. On the architectures that support LL/SC, the instructions will be used to implement compare_exchange_weak. The modify function in this CAS loop will not be executed within the LL/SC region, which is why the additional load before the loop is required. There is also a probability of CAS failure. There is another point to consider. compare_exchange_weak/strong opereation on an LL/SC architecure is more complex than what is required to implement a simple RMW operation. For example, let's see it in Boost.Atomic code for ARM. Here is fetch_add, which can be used as a prototype of what could be done with a generic RMW operation: "1:\n" "ldrex %[original], %[storage]\n" "add %[result], %[original], %[value]\n" // modify "strex %[tmp], %[result], %[storage]\n" "teq %[tmp], #0\n" "bne 1b\n" Frankly, I'd like to be able to generate code like this for operations other than those defined by the standard atomic<> interface. And here is CAS (weak): "mov %[success], #0\n" "ldrex %[original], %[storage]\n" "cmp %[original], %[expected]\n" "itt eq\n" "strexeq %[success], %[desired], %[storage]\n" "eoreq %[success], %[success], #1\n" To this we will also have to add the load, the modify and the loop.
But provided that the code generated is no worse than this (in particular that the compiler can inline the function call in optimised builds, at least where the function is relatively simple) then it would still be useful, particularly for people who aren't used to using weak exchange loops.
I certainly hope the compiler will be able to inline the modify function. If it doesn't, for relatively simple functions, then it probably isn't worth it.
On 2/09/2015 12:11, Andrey Semashev wrote:
(And the above is also supposed to use LL/SC on architectures where this is cheaper than CAS, although I'm not sure if this is the case.)
I'm not sure there are architectures that implement both CAS and LL/SC instructions, at least I'm not aware of such. On the architectures that support LL/SC, the instructions will be used to implement compare_exchange_weak. The modify function in this CAS loop will not be executed within the LL/SC region, which is why the additional load before the loop is required. There is also a probability of CAS failure.
Right. I meant that on LL/SC architectures, compare_exchange_weak is "more native", while compare_exchange_strong is "more native" on CAS architectures.
There is another point to consider. compare_exchange_weak/strong opereation on an LL/SC architecure is more complex than what is required to implement a simple RMW operation. For example, let's see it in Boost.Atomic code for ARM. Here is fetch_add, which can be used as a prototype of what could be done with a generic RMW operation:
"1:\n" "ldrex %[original], %[storage]\n" "add %[result], %[original], %[value]\n" // modify "strex %[tmp], %[result], %[storage]\n" "teq %[tmp], #0\n" "bne 1b\n"
Frankly, I'd like to be able to generate code like this for operations other than those defined by the standard atomic<> interface.
That's true, and so would I. The danger is that what you can do while holding the X bit is quite limited, sometimes extremely so. If the function call is *not* inlined then even that might be enough to tip it to always fail. So you'd need a fallback to basic compare_exchange_weak (which should eventually succeed unless the machine is really hammered) in such cases. Of course, that's basically what you said in the OP. :)
participants (4)
-
Andrey Semashev
-
Gavin Lambert
-
Giovanni Piero Deretta
-
Yiannis Papadopoulos