On 20 Mar 2015 at 9:19, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Why would a memory allocation necessarily imply a cache miss. Eh you are even assuming an L3 miss, that must be a poor allocator!
LL caching is irrelevant, the problem here is the MOESI protocol which renders caching void. What you ideally want is unidirectional forward only cache line movement between threads. So thread A modifies a line (M), signals thread B who then shares that line (S) and thread A's line becomes owned (O). If thread A nor thread B never modifies that line after thread B has shared it, performance will be excellent, even under very heavy load, because main memory is never involved. If on the other hand thread A or thread B does modify a shared or owned line, it gets more complicated. Newer architectures can be quite good at exchanging the S and O states if exactly the right two caches have that line (e.g. two CPUs on the same die). As soon as you go past two caches sharing a line though, all bets are off and everything gets synced through main memory. All the above is why memory allocation is bad here because any shared state can usually ends up being worst case performance for MOESI. Modern memory allocators are very good at handling the same thread freeing a block it itself allocated when only that thread saw that allocation. They are poor at handling when a different thread frees, or even has visibility of, a block allocated by another thread. One brutally effective way of making memory allocation MOESI friendly is forcing 64 byte alignment on all blocks. So you're right that a custom allocator could be useful here, though the problem with shared states is that they very easily leak visibility to more than two threads. However because future-promise is extremely likely to have a different thread free a block than the thread which allocated it, you're going to have to introduce quite a bit of latency inducing machinery in your custom allocator to handle that. Even if you get the average low, you're going to see big latency spikes from time to time. One technique is to have a dedicated worker thread which never sleeps, only spins, whose sole purpose is to free allocations made by all the other threads. You can really get memory allocation latencies down with that technique, but it's unacceptable on anything mobile or consumer focused.
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf the cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared. There is a small penalty on ARM admittedly, but on the Cortex A15 I have here it is not much.
I understand what you are aiming at, but I think that the elidability is orthogonal. Right now I'm focusing on making the actual synchronisation fast and composable in the scenario where the program has committed to make a computation async.
This is fine until your compiler supports resumable functions.
This is funny :). A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :).
It is unacceptable. Chris's big point in his Concurrency alternative paper before WG21 is that future-promise is useless for 100k-1M socket scalability because to reach that you must have much lower latency than future-promise is capable of. ASIO's async_result system can achieve 100k-1M scalability. Future promise (as currently implemented) cannot. Thankfully WG21 appear to have accepted this about resumable functions in so far as I am aware. My hope is we can demonstrate that future-promise is easily 1M scalable using C++ 14 constexpr and lazy or no shared state allocation.
OK my compromise is to not allocate while the async operation is merely deferred but can still be executed synchronously. Lazily convert to heap allocation only when the operation needs to be executed truly asynchronously, basically until you actually create the promise (at that point the cost of the async setup will provably dwarf the allocation; and even in this case the allocation can be skipped if we know we will sync before a move, then it us safe to allocate the shared state on the stack). This should allow to compiler to remove the abstraction completely if it can prove it safe. Still working on it, should have something in a few days.
I guess I'm converging to your design.
I think the lazy conversion to shared state is a very wise design goal. Simple single future waits shouldn't need it, but for simplicity of implementation it's probably easier, and you're deallocating in the same thread as allocation, so on that you're good.
It provides a hook API with filter C functions which can, to a limited extent, provide some custom functionality. Indeed the file descriptor/HANDLE toggling is implemented that way. There is only so much genericity which can be done with C.
I believe my design is much simpler and flexible; then again is trying to solve a different and narrower problem than full scale synchronization of arbitrary threads.
I think compatibility with C is highly important. It is after all the most popular systems programming language, with excellent support from almost all other programming languages. As AFIO is expected to come with fully featured C bindings in the future, my need for a C compatible event/result object is obviously unavoidable. If C had one of course we'd be good, but it does not and POSIX did not want to add the one I proposed. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/