On 21 Mar 2015 at 18:36, Giovanni Piero Deretta wrote:
a and b are served by two different threads. if sizeof(future) is less than a cacheline, every time their corresponding threads move or signal the promise, they will interfere with each other and with this thread doing something. The cacheline starts bouncing around like crazy. And remember that with non allocating futures, promise and futures are touched by the other end at every move, not just on signal.
I hadn't even considered a non-allocating future which isn't aligned to 64 byte multiples.
Additionally as I reported on this list maybe a year ago, first gen Intel TSX provides no benefits and indeed a hefty penalty over simple atomic RMW.
hopefully will get better in the future. I haven't had the chance to try it yet, but TSX might help with implementing wait_all and wait_any whose setup and teardown require a large amount of atomic ops..
I posted comprehensive benchmarks on this list a year or so ago, try http://boost.2283326.n4.nabble.com/Request-for-feedback-on-design-of-c oncurrent-unordered-map-plus-notes-on-use-of-memory-transactions-td466 5594.html. My conclusion was that first gen Intel TSX isn't ready for real world usage - in real world code, the only benefits show up in heavily multithreaded 90-95% read only scenarios. I tried a TSX based hash table, and was surprised at just how much slower the single and dual threaded scenarios were, sufficiently so I concluded that it would be a bad idea to have TSX turned on by default for almost all general purpose algorithm implementations. Where TSX does prove useful though is for transactional GCC which produces nothing like as penalised code as on non-TSX hardware.
ARM and other CPUs provide load linked store conditional, so RMW with those is indeed close to penalty free if the cache line is exclusive to the CPU doing those ops. It's just Intel is still incapable of low latency lock gets, though it's enormously better than the Pentium 4.
The reason of the high cost is that RMW have sequential consistency semantics. On the other hand on intel plain load and stores have desirable load_acquire and store_release semantics and you do not need extra membars.
Still, if Intel could do a no extra cost load linked store conditional, _especially_ if those could be nested to say two or four levels, it would be a big win. That 19 cycle latency and pipeline flush is expensive, plus the unnecessary cache coherency traffic when you don't update the cache line. Plus, a two or four level nesting would allow the atomic update of a pair of pointers or two pairs of pointers, something which is 98% of the use case of Intel TSX anyway. That said, if Intel TSX v2 didn't have the same overheads as a lock xchg, that's even better again. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/