Re: [boost] [thread] Alternate future implementation and future islands.

23 Mar 2015

      On 21 Mar 2015 at 18:36, Giovanni Piero Deretta wrote:
...
a and b are served by two different threads. if sizeof(future) is less
than a cacheline, every time their corresponding threads move or
signal the promise, they will interfere with each other and with this
thread doing something. The cacheline starts bouncing around like
crazy. And remember that with non allocating futures, promise and
futures are touched by the other end at every move, not just on
signal.
I hadn't even considered a non-allocating future which isn't aligned 
to 64 byte multiples.
...
...
Additionally as I reported on this list maybe a year ago, first gen
Intel TSX provides no benefits and indeed a hefty penalty over simple
atomic RMW.
hopefully will get better in the future. I haven't had the chance to
try it yet, but TSX might help with implementing wait_all and wait_any
whose setup and teardown require a large amount of atomic ops..
I posted comprehensive benchmarks on this list a year or so ago, try 
http://boost.2283326.n4.nabble.com/Request-for-feedback-on-design-of-c
oncurrent-unordered-map-plus-notes-on-use-of-memory-transactions-td466
5594.html. My conclusion was that first gen Intel TSX isn't ready for 
real world usage - in real world code, the only benefits show up in 
heavily multithreaded 90-95% read only scenarios. I tried a TSX based 
hash table, and was surprised at just how much slower the single and 
dual threaded scenarios were, sufficiently so I concluded that it 
would be a bad idea to have TSX turned on by default for almost all 
general purpose algorithm implementations.

Where TSX does prove useful though is for transactional GCC which 
produces nothing like as penalised code as on non-TSX hardware.
...
...
ARM and other CPUs provide load linked store conditional, so RMW with
those is indeed close to penalty free if the cache line is exclusive
to the CPU doing those ops. It's just Intel is still incapable of low
latency lock gets, though it's enormously better than the Pentium 4.
The reason of the high cost is that RMW have sequential consistency
semantics. On the other hand on intel plain load and stores have
desirable load_acquire and store_release semantics and you do not need
extra membars.
Still, if Intel could do a no extra cost load linked store 
conditional, _especially_ if those could be nested to say two or four 
levels, it would be a big win. That 19 cycle latency and pipeline 
flush is expensive, plus the unnecessary cache coherency traffic when 
you don't update the cache line. Plus, a two or four level nesting 
would allow the atomic update of a pair of pointers or two pairs of 
pointers, something which is 98% of the use case of Intel TSX anyway. 
That said, if Intel TSX v2 didn't have the same overheads as a lock 
xchg, that's even better again.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/

Re: [boost] [thread] Alternate future implementation and future islands.

Niall Douglas