[atomic] Help understanding consume order
Hi, I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha. I read the explanation here: http://en.cppreference.com/w/cpp/atomic/memory_order but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads. Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead? I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures?
hi andrey,
I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha.
I read the explanation here:
http://en.cppreference.com/w/cpp/atomic/memory_order
but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads.
Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead?
the alpha can speculatively load an memory region. the consume semantics prevents issues a memory barrier to make sure that the memory is read after the atomic pointer is written to. so from my understanding, on arm and x86, relaxed has the same guarantee than consume ... would be great if someone could confirm this, though. tim
On Monday 02 June 2014 09:45:46 Tim Blechmann wrote:
hi andrey,
I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha.
I read the explanation here:
http://en.cppreference.com/w/cpp/atomic/memory_order
but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads.
Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead?
the alpha can speculatively load an memory region. the consume semantics prevents issues a memory barrier to make sure that the memory is read after the atomic pointer is written to. so from my understanding, on arm and x86, relaxed has the same guarantee than consume ...
would be great if someone could confirm this, though.
There is some discussion in this proposal: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm In particular it mentions that ARM and PPC reorder data-dependent operations, so I assume it means a fence is required? BTW, gcc 4.8 generates fences for consume loads, similarly to acquire, for both ARM and PPC. It may be that gcc devs just went the simplest and safest route when implementing consume though.
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev
Hi,
I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha.
I read the explanation here:
http://en.cppreference.com/w/cpp/atomic/memory_order
but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads.
Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead?
ARM and many other RMO architectures (like PPC and unlike Alpha), guarantee that a load and the load it depends on won't be reordered, so, together with the release operation on the writer side, the load_consume guarantees the visibility of the string body. The exact definition of load dependency (basically the address of the dependent load is computed as a function of the value returned by the depending load) is defined at the instruction level and is quite tricky to recover at the high level C++ language. C++11 tried to do it, but according to a few the current working is both very hard to implement and both not strong enough and too strict in some cases. In the meantime GCC (and a few other compilers) punts on load_consume and simply promotes it to load_acquire. Note that x86, a TSO machine, has even stronger guarantees, any load is a load_aquire.
I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have started doing some quite aggressive optimisations (like value speculation and PGO) that can break loads dependencies. The linux kernel for example gets by by explicitly disabling those optimisations, not doing PGO and targeting a specific compiler. See n2664 and the recent epic thread on gcc-dev. HTH, -- gpd
On Monday 02 June 2014 12:42:04 Giovanni Piero Deretta wrote:
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev
wrote: ARM and many other RMO architectures (like PPC and unlike Alpha), guarantee that a load and the load it depends on won't be reordered, so, together with the release operation on the writer side, the load_consume guarantees the visibility of the string body.
Ok, so basically, on ARM and PPC consume operation should be equivalent to relaxed + a compiler barrier to prevent possible code reordering, am I right?
The exact definition of load dependency (basically the address of the dependent load is computed as a function of the value returned by the depending load) is defined at the instruction level and is quite tricky to recover at the high level C++ language. C++11 tried to do it, but according to a few the current working is both very hard to implement and both not strong enough and too strict in some cases.
In the meantime GCC (and a few other compilers) punts on load_consume and simply promotes it to load_acquire.
Yes, I noticed that. MSVC seems to follow the same route as well.
I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have started doing some quite aggressive optimisations (like value speculation and PGO) that can break loads dependencies. The linux kernel for example gets by by explicitly disabling those optimisations, not doing PGO and targeting a specific compiler.
I hope, speculative loads and stores can be prevented with compiler barriers. But it looks like PGO/IPO/LTO may ruin the code involving this memory ordering. I suppose, even if I put a compiler barrier (which in case of gcc is an empty asm statement with memory in the clobber list) it will not be in effect during the optimization phase?
See n2664 and the recent epic thread on gcc-dev.
Is it this thread? https://gcc.gnu.org/ml/gcc/2014-02/msg00052.html I'm reading it now, very interesting, thank you for the tip.
On Mon, Jun 2, 2014 at 6:11 PM, Andrey Semashev
On Monday 02 June 2014 12:42:04 Giovanni Piero Deretta wrote:
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev < andrey.semashev@gmail.com> wrote:
ARM and many other RMO architectures (like PPC and unlike Alpha), guarantee that a load and the load it depends on won't be reordered, so, together with the release operation on the writer side, the load_consume guarantees the visibility of the string body.
Ok, so basically, on ARM and PPC consume operation should be equivalent to relaxed + a compiler barrier to prevent possible code reordering, am I right?
I'm afraid things are a bit more complex. It is not just a problem of reordering, so memory barriers are not enough. See below.
I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have started doing some quite aggressive optimisations (like value speculation and PGO) that can break loads dependencies. The linux kernel for example gets by by explicitly disabling those optimisations, not doing PGO and targeting a specific compiler.
I hope, speculative loads and stores can be prevented with compiler barriers. But it looks like PGO/IPO/LTO may ruin the code involving this memory ordering. I suppose, even if I put a compiler barrier (which in case of gcc is an empty asm statement with memory in the clobber list) it will not be in effect during the optimization phase?
Modulo bugs, GCC will respect memory barriers for all optimizations. Lemme
show an example that can break. Suppose you have a pair of intrusive single
consumer single producer memory queues.
Two threads use the two queues for bidirectional access and most of the
time can enqueue nodes popped from one queue to the other queue; if the
threads are able to run in lock step, the queue size will always be of size
1, the same node is always reused. Occasionally, when there is some
contention, one thread might be slowed down and the other thread might need
to allocate a new node.
struct queue
{
void push_back(node * p); // release
node * pop_front(); // consume
};
The head and tail pointers of the queue are atomic
See n2664 and the recent epic thread on gcc-dev.
Is it this thread?
yes. HTH, -- gpd
On Tue, Jun 3, 2014 at 1:38 AM, Giovanni Piero Deretta
This is of course only one (the one that I'm familiar with) of the optimizations that can break seemly correct code, but there are others (like aggressive expression reduction) that not even require PGO.
I see. So there's basically no way I can protect myself from such code injection. The problem is further complicated by the fact that not all code is "mine" in case of atomic<> implementation.
The authors of the C++ memory model knew about this issue and added quite complex rules to prevent the compiler from breaking dependencies. Compilers that may perform dependency breaking optimizations, are required to preserve consume ordering at function calls boundaries by issuing explicit fences and upgrading the consumes to acquires. The alternative would be to require whole program compilation which is untenable. Alternatively, sprinkling the code carefully with [[carries_dependency]]supposedly can prevent the compiler from issuing the fence.
In practice implementing the required dependency tracking machinery on existing compilers has proven too hard so they issue the fence immediately when performing the initial load, so it is possible that the committee will change this part of the standard: see N4036 on the last mailing.
N4036 looks interesting, although I suspect adding another type qualifier to the language has the potential to break code, which now only accounts for const and volatile. This is more specific to C++. Also, it's not quite clear how atomic.load(consume) could return a value dep preserving-qualified type but not atomic.load(other_memory_order).
For the time being we are stuck with either load_acquire or very fragile workarounds.
Ok, so I did in Boost.Atomic. Thank you again for the most helpful comments.
participants (3)
-
Andrey Semashev
-
Giovanni Piero Deretta
-
Tim Blechmann