On 20 Mar 2015 13:07, "Niall Douglas"
On 20 Mar 2015 at 9:19, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Why would a memory allocation necessarily imply a cache miss. Eh you are even assuming an L3 miss, that must be a poor allocator!
[Snip]
Newer architectures can be quite good at exchanging the S and O states if exactly the right two caches have that line (e.g. two CPUs on the same die). As soon as you go past two caches sharing a line though, all bets are off and everything gets synced through main memory.
All the above is why memory allocation is bad here because any shared state can usually ends up being worst case performance for MOESI.
What's special about memory allocation here? Intuitively sharing futures on the stack might actually be worse especially if you have multiple futures being serviced by different threads.
Modern memory allocators are very good at handling the same thread freeing a block it itself allocated when only that thread saw that allocation. They are poor at handling when a different thread frees, or even has visibility of, a block allocated by another thread.
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf
Because of the way ownership is relinquished, in my implementation the consumer usually both allocates and frees the state. Unless you use shared futures, but then that's the last of your problems. [Snip] the
cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs everything else, but in the normal case of a exclusive cache line, mfence, xchg, cmpxchg and friends cost 30-50 Cycles and stall the CPU. Significantly more than the cost of non serialising instructions. Not something I want to do in a move constructor. [Snip]
A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :).
It is unacceptable. Chris's big point in his Concurrency alternative paper before WG21 is that future-promise is useless for 100k-1M socket scalability because to reach that you must have much lower latency than future-promise is capable of. ASIO's async_result system can achieve 100k-1M scalability. Future promise (as currently implemented) cannot.
Thankfully WG21 appear to have accepted this about resumable functions in so far as I am aware.
I'm a big fan of Chris' proposal as well. I haven't seen any new papers on resumable functions, I would love to know were the committee is heading. -- gpd