On Sat, Mar 21, 2015 at 2:33 PM, Niall Douglas
On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:
What's special about memory allocation here? Intuitively sharing futures on the stack might actually be worse especially if you have multiple futures being serviced by different threads.
I think memory allocation is entirely wise for shared_future. I think auto allocation is wise for future, as only exactly one of those can exist at any time per promise.
I wasn't talking about shared future. Think about something like this, assuming that the promise has a pointer to the future: future<X> a = async(...); future<X> b = async(...); ... do something which touch the stack... a and b are served by two different threads. if sizeof(future) is less than a cacheline, every time their corresponding threads move or signal the promise, they will interfere with each other and with this thread doing something. The cacheline starts bouncing around like crazy. And remember that with non allocating futures, promise and futures are touched by the other end at every move, not just on signal.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs everything else, but in the normal case of a exclusive cache line, mfence, xchg, cmpxchg and friends cost 30-50 Cycles and stall the CPU. Significantly more than the cost of non serialising instructions. Not something I want to do in a move constructor.
You're right and I'm wrong on this - I stated the claim above on empirical testing where I found no difference in the use of the LOCK prefix. It would appear I had an inefficiency in my testing code: Agner says that for Haswell:
XADD: 5 uops, 7 latency LOCK XADD: 9 uops, 19 latency
CMPXCHG 6 uops, 8 latency LOCK CMPXCHG 10 uops, 19 latency
(Source: http://www.agner.org/optimize/instruction_tables.pdf)
So approximately one halves the throughput and triples the latency with the LOCK prefix irrespective of the state of the cache line.
That's already pretty good actually. On Sandy and Ivy was above 20 clocks. Also the comparison shouldn't be with their non-locked counterparts (which aren't really ever used and complete), but with plain operations. Finally there is the hidden cost of preventing any OoO execution which won't appear in a synthetic benchmark.
Additionally as I reported on this list maybe a year ago, first gen Intel TSX provides no benefits and indeed a hefty penalty over simple atomic RMW.
hopefully will get better in the future. I haven't had the chance to try it yet, but TSX might help with implementing wait_all and wait_any whose setup and teardown require a large amount of atomic ops..
ARM and other CPUs provide load linked store conditional, so RMW with those is indeed close to penalty free if the cache line is exclusive to the CPU doing those ops. It's just Intel is still incapable of low latency lock gets, though it's enormously better than the Pentium 4.
The reason of the high cost is that RMW have sequential consistency semantics. On the other hand on intel plain load and stores have desirable load_acquire and store_release semantics and you do not need extra membars. -- gpd