On 01/26/18 16:32, Peter Dimov via Boost wrote:
Andrey Semashev wrote:
While we're on the subject, on what architectures would opaque_sub be > more efficient than sub_and_test?
On x86 and gcc < 7 opaque_sub allows to use "lock sub" or "lock dec" without setting the bool according to the zero flag, i.e. it saves a register and an instruction.
Right, thanks. I was thinking that testing for zero comes for free, but it's not (entirely) free for the reason you give. Does this actually matter in practice? I would expect the atomic to dominate the `set(n)z al`.
Latency-wise, I expect that to be mostly true. But wasting a register may be undesirable if it causes a spill on the stack somewhere in the surrounding code, especially if this is a tight loop. In any case, I just want to be able to generate the best possible code with the interface atomic<> provides.
Gcc 7 introduced the ability to return flags from the asm statement, so the code can be written the same way. Although I noticed that the compiler tends to save the flag into a register early unless it is tested immediately, so in some cases opaque_sub might still be preferable where it suits.
Don't see how opaque_sub could be preferable if you need to test the flag later. :-)
Of course. :) I meant, in the case where you don't need the result, opaque_sub is still preferable to fetch_sub or sub_and_test.
Presumably, if you just call the function and discard the return value - the equivalent of opaque_ - the compiler would be smart enough to not save the flag.
Hopefully, but I wouldn't bet on it. I've seen gcc generate "setz" then a couple of "movs" which were moved from god knows where and then "test" and a conditional jump. Clearly, "movs" don't alter flags, so the spill and the test are useless. Admittedly, dropping "setz" when the result is unused is a different kind of optimization. But my point is that optimizations like these are generally unreliable, and if you really want to have the best possible code then you should better write it in a way so the compiler has less opportunity to screw up.
I remember some compilers being smart enough to notice that you don't use the result of the atomic fetch_op intrinsic and generating the `lock op` themselves, without a separate opaque_op being needed. We can't do that on the library level, of course.
Yes, I've seen gcc 7 (and maybe 6?) do that on occasion, but it seemed that it didn't always do that either. I didn't investigate that closely to find out why it didn't always optimize.