Re: [boost] [thread] Alternate future implementation and future islands.

21 Mar 2015

      On Sat, Mar 21, 2015 at 2:33 PM, Niall Douglas
<s_sourceforge@nedprod.com> wrote:
...
On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:
...
What's special about memory allocation here? Intuitively sharing futures on
the stack might actually be worse especially if you have multiple futures
being serviced by different threads.
I think memory allocation is entirely wise for shared_future. I think
auto allocation is wise for future, as only exactly one of those can
exist at any time per promise.
I wasn't talking about shared future. Think about something like this,
assuming that the promise has a pointer to the future:

   future<X> a = async(...);
   future<X> b = async(...);
   ... do something which touch the stack...

a and b are served by two different threads. if sizeof(future) is less
than a cacheline, every time their corresponding threads move or
signal the promise, they will interfere with each other and with this
thread doing something. The cacheline starts bouncing around like
crazy. And remember that with non allocating futures, promise and
futures are touched by the other end at every move, not just on
signal.
...
...
...
On Intel RMW is the same speed as non-atomic ops unless the cache
line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs
everything else, but in the normal  case of a exclusive cache line, mfence,
xchg, cmpxchg and friends cost 30-50
Cycles and stall the CPU. Significantly more than the cost of non
serialising instructions. Not something I want to do in a move constructor.
You're right and I'm wrong on this - I stated the claim above on
empirical testing where I found no difference in the use of the LOCK
prefix. It would appear I had an inefficiency in my testing code:
Agner says that for Haswell:
XADD: 5 uops, 7 latency
LOCK XADD: 9 uops, 19 latency
CMPXCHG 6 uops, 8 latency
LOCK CMPXCHG 10 uops, 19 latency
(Source: http://www.agner.org/optimize/instruction_tables.pdf)
So approximately one halves the throughput and triples the latency
with the LOCK prefix irrespective of the state of the cache line.
That's already pretty good actually. On Sandy and Ivy was above 20
clocks. Also the comparison shouldn't be with their non-locked
counterparts (which aren't really ever used and complete), but with
plain operations. Finally there is the hidden cost of preventing any
OoO execution which won't appear in a synthetic benchmark.
...
Additionally as I reported on this list maybe a year ago, first gen
Intel TSX provides no benefits and indeed a hefty penalty over simple
atomic RMW.
hopefully will get better in the future. I haven't had the chance to
try it yet, but TSX might help with implementing wait_all and wait_any
whose setup and teardown require a large amount of atomic ops..
...
ARM and other CPUs provide load linked store conditional, so RMW with
those is indeed close to penalty free if the cache line is exclusive
to the CPU doing those ops. It's just Intel is still incapable of low
latency lock gets, though it's enormously better than the Pentium 4.
The reason of the high cost is that RMW have sequential consistency
semantics. On the other hand on intel plain load and stores have
desirable load_acquire and store_release semantics and you do not need
extra membars.

-- gpd