[thread] Alternate future implementation and future islands.
[cc'ing a few of the participants of the old thread] Hi All, Back in January, there was an interesting long thread about the problem of interoperation between futures of different libraries and the ability wait for heterogeneous futures. Different solutions were discussed like standardizing the future shared state or having an unified kernel wait object. I just wanted to share my attempt on a solution on the problem. The code can be found at https://github.com/gpderetta/libtask/blob/master/future.hpp (other interesting files are shared_future.hpp event.hpp and tests/future_test.cpp). It is an implementation a subset of the current boost and std future interface. In particular it has promise, future, shared_future, future::then, wait_any, wait_all. Most important missing piece are timed waits (for lack of, ahem, time, but should be easy to implement). The implementation requires c++14 and is only lightly tested, it should be treated as a proof of concept, not production-ready code. A few key design points: * The wait strategy is not part of the future implementation (although a default is provided). future::get, future::wait, wait_all and wait_any are parametrized by the wait strategy. * The design aims to streamline and make as fast as possible future and promise at the cost of a slower shared_future (although there is room for improvement). * The wait strategy only deals with 'event' objects, which act as a bridge between the future and the promise. The event object is really my the core of my contribution; it can be thought of as the essence of future<void>::then; alternatively it can be seen as a pure user space synchronization primitive. Other features of this implementation: * Other than in the blocking strategy itself, the future and promise implementation have no sources of blocking (no mutexes, not even spin locks). * The shared state is not reference counted. To demonstrate the generality of the waiting concept, I have implemented a few waiter objects. * cv_waiter: this is simply a waiter on top of an std::mutex + cond var. * futex_waiter (linux): an atomic counter + a futex, possibly more efficient than the cv_waiter * sem_waiter (posix): a waiter implemented on top of a posix semaphore. More portable than the futex waiter and possibly more efficient than the cv_waiter * fd_waiter(linux, possibly posix): a waiter implemented on top of linux eventfd (for portability it can also be implemented on top of a pipe); you can use select with futures! * task_waiter: a completely userspace based coroutine waiter which switches to another coroutine on wait and resumes the original coroutine on signal. * scheduler_waiter: another coroutine based waiter, but on top of an userspace task scheduler. On wait switch to the next ready task, on signal enqueue the waiting task to the back of the ready queue. I know that there is a plan to reimplement a new version of boost::threads, hopefully this implementation can contribute a few ideas. HTH, -- gpd
Very nice work! This is pretty much _the_ way to implement real composable futures, right? A few comments: 1. Since shared_state_union<T> is basically expected<T>, it could be nice to unify them explicitly. The atomic option would also be useful for expected<T>, I would think. 2. I noticed a number of bugs in the implementation of future::then and future::next: a. future::then throws if the state is ready and f throws, but instead should just return a ready future with the exception. More generally, calling a function with some arguments, and setting a promise/shared_state_union<T>/expected<T> either with the result or the exception is a useful operation to expose directly to users. This could then be used by both future::then as well has async, for instance. b. future::then and future::next should probably consistently release ownership of their state even if the state is already ready. future::then also has to be fixed to call f with an rvalue future representing this, rather than an lvalue; currently, calling future::then with a standard "then" function declared as taking a future by value wouldn't compile. To ensure that the state is consistently released, you could use f(future(std::move(*this))), or perhaps better yet use a local variable to be robust to move elision potentially allowed by a future c++ standard. Additionally the return type needs to be the appropriate future type based on the return type of the function, rather than assuming it is the same as the current future type. 3. Since invoking future::then and future::next release ownership, it is worth considering making them only work on rvalues. Whether the extra compile-time error checking this buys is worth the added verbosity of having to add std::move in some cases is not clear, though. 4. In some contexts a thread-unsafe future/event that saves the cost of atomic operations would be useful (unless it turns out that this cost is negligible on modern CPUs, which I don't think is the case). However, it would be important for the thread-safe and thread-unsafe futures/events to interoperate in a reasonable way. (I'm not sure exactly what reasonable semantics would be.) Obviously there would have to be some caveats, but we would want to avoid another future-island situation. I strongly encourage you to continue this work, and see about getting this incorporated into boost::future and possibly std::future, as it seems to be a cleaner and leaner abstraction than what is currently done. Benchmarks (and unit tests) would also be particularly helpful.
On Wed, Mar 18, 2015 at 4:08 AM, Jeremy Maitin-Shepard
Very nice work!
Thanks!
This is pretty much _the_ way to implement real composable futures, right?
well, at least I think it is.
A few comments:
1. Since shared_state_union<T> is basically expected<T>, it could be nice to unify them explicitly. The atomic option would also be useful for expected<T>, I would think.
thats a great insight. I need to read the expect papers again. Regarding the 'atomic option', I was actually thinking of removing it, instead switching to specifying the memory order in get()/set_{value,exception}.
2. I noticed a number of bugs in the implementation of future::then and future::next:
a. future::then throws if the state is ready and f throws, but instead should just return a ready future with the exception. More generally, calling a function with some arguments, and setting a promise/shared_state_union<T>/expected<T> either with the result or the exception is a useful operation to expose directly to users. This could then be used by both future::then as well has async, for instance.
b. future::then and future::next should probably consistently release ownership of their state even if the state is already ready. future::then also has to be fixed to call f with an rvalue future representing this, rather than an lvalue; currently, calling future::then with a standard "then" function declared as taking a future by value wouldn't compile. To ensure that the state is consistently released, you could use f(future(std::move(*this))), or perhaps better yet use a local variable to be robust to move elision potentially allowed by a future c++ standard. Additionally the return type needs to be the appropriate future type based on the return type of the function, rather than assuming it is the same as the current future type.
I complely removed the current then/next implementation in favor of a generic then that works with any waitable. This makes it explicit that you can compose future types (and in fact any waitable) as long as they provide a get_event. I removed 'next' completely to simplify the implementation. The optimization opportunity to save an allocation on the except case is probably not worth the complexity.
3. Since invoking future::then and future::next release ownership, it is worth considering making them only work on rvalues. Whether the extra compile-time error checking this buys is worth the added verbosity of having to add std::move in some cases is not clear, though.
the new free function then takes its waitable by value (so you have to move it). I'll leave the current signature of then as is for now for compatibility with the standard proposal and boost::future.
4. In some contexts a thread-unsafe future/event that saves the cost of atomic operations would be useful (unless it turns out that this cost is negligible on modern CPUs, which I don't think is the case). However, it would be important for the thread-safe and thread-unsafe futures/events to interoperate in a reasonable way. (I'm not sure exactly what reasonable semantics would be.) Obviously there would have to be some caveats, but we would want to avoid another future-island situation.
The required RMW are definitely non neglegible. I tried to keep them at a minimum but they still have a cost. My original future implementation actually had thread safety switch but it would be terribly error prone. Instead I'm studying a way for future to start as simply a deferred synchronous computation, but that can become asynchronous on request of another thread (via work stealing or work requesting for example).
I strongly encourage you to continue this work, and see about getting this incorporated into boost::future and possibly std::future, as it seems to be a cleaner and leaner abstraction than what is currently done.
Thanks!
Benchmarks (and unit tests) would also be particularly helpful.
I have yet to find a good, fair, small and realistic benchmark for futures. Ideas are welcome. -- gpd
Giovanni Piero Deretta
On Wed, Mar 18, 2015 at 4:08 AM, Jeremy Maitin-Shepard
wrote:
[snip]
I complely removed the current then/next implementation in favor of a generic then that works with any waitable. This makes it explicit that you can compose future types (and in fact any waitable) as long as they provide a get_event.
I removed 'next' completely to simplify the implementation. The optimization opportunity to save an allocation on the except case is probably not worth the complexity.
I think it is good to have, even if it is implemented using `then', since it has the advantage that the function passed does not have to deal in futures. It seems like implementing it more efficiently as you did could make sense in a more polished version, though, since futures are a basic primitive and it is best to minimize the overhead as much as possible.
3. Since invoking future::then and future::next release ownership, it is worth considering making them only work on rvalues. Whether the extra compile-time error checking this buys is worth the added verbosity of having to add std::move in some cases is not clear, though.
the new free function then takes its waitable by value (so you have to move it). I'll leave the current signature of then as is for now for compatibility with the standard proposal and boost::future.
The required RMW are definitely non neglegible. I tried to keep them at a minimum but they still have a cost. My original future implementation actually had thread safety switch but it would be terribly error prone.
Certainly it would be more error-prone than the thread-safe future, since there would be no compile-time way to verify that the user isn't incorrectly accessing it from multiple threads without external locking, but I'm not sure that is a reason not to provide it. It seems it would be possible to upgrade a thread-unsafe future to a thread-safe future, provided that there is no race between the upgrade and marking the future ready. I think the implementation would be pretty simple, in fact.
Instead I'm studying a way for future to start as simply a deferred synchronous computation, but that can become asynchronous on request of another thread (via work stealing or work requesting for example).
This would be pretty useful functionality to add support for; I guess it would complicate your approach a bit, though. One other issue that came to mind: your current approach of using a singly-linked list of waiters seems to pose a problem for a timed wait or other "recovable" waits, since you can only remove a waiter if it is at the front of the list. With only unique-ownership futures this might be okay, if you prohibit waiting on a future from multiple threads at once, since then the waiter for be guaranteed to remain at the front of the list. However, with shared-ownership futures the problem would be unavoidable. Therefore it might make sense to switch to a doubly-linked list; I guess this will make the lock-free algorithms trickier. [snip]
Benchmarks (and unit tests) would also be particularly helpful.
I have yet to find a good, fair, small and realistic benchmark for futures. Ideas are welcome.
I was thinking microbenchmarks, that would at least show the cost of individual operations and allow comparing different design choices/implementations. For instance, cost of getting value from ready future, cost of waiting on ready future, cost of calling then on ready or not-yet-ready future, etc.
[On a phone right now, just wanted ro comment on something]
On 19 Mar 2015 00:23, "Jeremy Maitin-Shepard"
One other issue that came to mind: your current approach of using a singly-linked list of waiters seems to pose a problem for a timed wait or other "recovable" waits, since you can only remove a waiter if it is at
the
front of the list. With only unique-ownership futures this might be okay, if you prohibit waiting on a future from multiple threads at once, since then the waiter for be guaranteed to remain at the front of the list. However, with shared-ownership futures the problem would be unavoidable. Therefore it might make sense to switch to a doubly-linked list; I guess this will make the lock-free algorithms trickier.
There is actually no lock free wait list. The event object only allows one waiter. Shared future is implemented with a boring lock protected deque of N hidden future/promise pairs used only for signaling. The singly linked list you are seeing is just the ready queue for the coroutine scheduler which is unrelated to the future implementation. I have considered adding multiple waiter support to event, but if I do it it will be via a spinlock protected queue. A lock free list which supports removal in the middle (which BTW is also required for wait any, not just timed waits) requires too much infrastructure (hazard ptrs or some other deferred reclamation scheme) and wouldn't be competitive with a plain spinlock. I have in mind an hybrid design that allows wait free signal and waits and pushes the complexity to dismiss_wait. -- gpd
On 18 Mar 2015 at 22:07, Giovanni Piero Deretta wrote:
Regarding the 'atomic option', I was actually thinking of removing it, instead switching to specifying the memory order in get()/set_{value,exception}.
That's a good idea. Default heuristics for avoiding atomic are necessarily going to have to be too conservative.
I complely removed the current then/next implementation in favor of a generic then that works with any waitable. This makes it explicit that you can compose future types (and in fact any waitable) as long as they provide a get_event.
I removed 'next' completely to simplify the implementation. The optimization opportunity to save an allocation on the except case is probably not worth the complexity.
I think the ship has sailed on this at WG21. The present Concurrency TS delights no one, but it is the product of many years of negotiation and compromise. What you see now is highly likely to persist unless you can demonstrate a showstopping technical reason why not (e.g. like std::async).
4. In some contexts a thread-unsafe future/event that saves the cost of atomic operations would be useful (unless it turns out that this cost is negligible on modern CPUs, which I don't think is the case). However, it would be important for the thread-safe and thread-unsafe futures/events to interoperate in a reasonable way. (I'm not sure exactly what reasonable semantics would be.) Obviously there would have to be some caveats, but we would want to avoid another future-island situation.
The required RMW are definitely non neglegible. I tried to keep them at a minimum but they still have a cost. My original future implementation actually had thread safety switch but it would be terribly error prone.
The cost is to the lack of optimisation options, not at the hardware level. Atomic operations, even on weakly ordered systems, are not high when no other CPU core has visibility of that cache line.
Instead I'm studying a way for future to start as simply a deferred synchronous computation, but that can become asynchronous on request of another thread (via work stealing or work requesting for example).
I think that has the maximum chance of being useful to functional programming frameworks too.
Benchmarks (and unit tests) would also be particularly helpful.
I have yet to find a good, fair, small and realistic benchmark for futures. Ideas are welcome.
https://github.com/ned14/c11-permit-object/blob/master/pthread_permit_ speedtest.cpp might have some ideas for you. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 17 Mar 2015 at 21:42, Giovanni Piero Deretta wrote:
I just wanted to share my attempt on a solution on the problem. The code can be found at https://github.com/gpderetta/libtask/blob/master/future.hpp (other interesting files are shared_future.hpp event.hpp and tests/future_test.cpp). It is an implementation a subset of the current boost and std future interface. In particular it has promise, future, shared_future, future::then, wait_any, wait_all. Most important missing piece are timed waits (for lack of, ahem, time, but should be easy to implement). The implementation requires c++14 and is only lightly tested, it should be treated as a proof of concept, not production-ready code.
Firstly, thanks for your input, especially one supported with code.
* The wait strategy is not part of the future implementation (although a default is provided). future::get, future::wait, wait_all and wait_any are parametrized by the wait strategy.
* The design aims to streamline and make as fast as possible future and promise at the cost of a slower shared_future (although there is room for improvement).
Your future still allocates memory, and is therefore costing about 1000 CPU cycles. That is not "as fast as possible". The use of std::atomic also prevents the compiler optimiser from eliding the future implementation entirely, because atomic always forces code to be generated. As much as futures today are big heavy things, tomorrow's C++17 futures especially resumable function integrated ones must be completely elidable, then the compiler can completely eliminate and/or collapse a resumable function call or sequence of such calls where appropriate. If you have an atomic in there, it has no choice but to do the atomic calls needlessly. I think memory allocation is unavoidable for shared_future, or at least any realistically close to the STL implementation of one. But a future I think can be both STL compliant and never allocate memory and be optimisable out of existence.
* The wait strategy only deals with 'event' objects, which act as a bridge between the future and the promise.
The event object is really my the core of my contribution; it can be thought of as the essence of future<void>::then; alternatively it can be seen as a pure user space synchronization primitive.
Exactly as my C11 permit object is. Except mine allows C code and C++ code to interoperate and compose waits together.
Other features of this implementation:
* Other than in the blocking strategy itself, the future and promise implementation have no sources of blocking (no mutexes, not even spin locks).
* The shared state is not reference counted.
To demonstrate the generality of the waiting concept, I have implemented a few waiter objects.
* cv_waiter: this is simply a waiter on top of an std::mutex + cond var.
* futex_waiter (linux): an atomic counter + a futex, possibly more efficient than the cv_waiter
* sem_waiter (posix): a waiter implemented on top of a posix semaphore. More portable than the futex waiter and possibly more efficient than the cv_waiter
* fd_waiter(linux, possibly posix): a waiter implemented on top of linux eventfd (for portability it can also be implemented on top of a pipe); you can use select with futures!
* task_waiter: a completely userspace based coroutine waiter which switches to another coroutine on wait and resumes the original coroutine on signal.
* scheduler_waiter: another coroutine based waiter, but on top of an userspace task scheduler. On wait switch to the next ready task, on signal enqueue the waiting task to the back of the ready queue.
I know that there is a plan to reimplement a new version of boost::threads, hopefully this implementation can contribute a few ideas.
My current best understanding of Vicente's plans is that each thread has a thread local condvar. The sole cause of a thread sleeping, apart from i/o, is on that thread local condvar. One therefore has a runtime which keeps a registry of all thread local condvars, and can therefore deduce the correct thread local condvars to wake when implementing a unified wait system also compatible with Fibers/resumable functions. I haven't played yet with proposed Boost.Fiber (this summer after C++ Now I will), but I expect it surely uses a similar runtime. Ideally I'd like Thread v5's runtime to be equal to the Fiber runtime if this is true, but as I mentioned I haven't played with it yet. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 19 Mar 2015 11:09, "Niall Douglas"
On 17 Mar 2015 at 21:42, Giovanni Piero Deretta wrote:.
* The wait strategy is not part of the future implementation (although a default is provided). future::get, future::wait, wait_all and wait_any are parametrized by the wait strategy.
* The design aims to streamline and make as fast as possible future and promise at the cost of a slower shared_future (although there is room for improvement).
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation. Anyways, the plan is to add support to a custom allocator. I do not think you can realistically have a non allocating future *in the general case* ( you might optimise some cases of course).
That is not "as fast as possible". The use of std::atomic also prevents the compiler optimiser from eliding the future implementation entirely, because atomic always forces code to be generated. As much as futures today are big heavy things, tomorrow's C++17 futures especially resumable function integrated ones must be completely elidable, then the compiler can completely eliminate and/or collapse a resumable function call or sequence of such calls where appropriate. If you have an atomic in there, it has no choice but to do the atomic calls needlessly.
I understand what you are aiming at, but I think that the elidability is orthogonal. Right now I'm focusing on making the actual synchronisation fast and composable in the scenario where the program has committed to make a computation async.
I think memory allocation is unavoidable for shared_future, or at least any realistically close to the STL implementation of one. But a future I think can be both STL compliant and never allocate memory and be optimisable out of existence.
* The wait strategy only deals with 'event' objects, which act as a bridge between the future and the promise.
The event object is really my the core of my contribution; it can be thought of as the essence of future<void>::then; alternatively it can be seen as a pure user space synchronization primitive.
Exactly as my C11 permit object is. Except mine allows C code and C++ code to interoperate and compose waits together.
Not at all. I admit not having studied permit in detail (the doc size is pretty daunting) but as far as I can tell the waiting thread will block in the kernel. It provides a variety of ways on how to block, the user can't add more. With event the decision of how to block ( or even whether to block at all) is done by the consumer and most importantly it can be delayed till the last moment. You can think of event as the bridge between the signal side and wait side of permit, or alternatively as a future<void> which only supports then (no get or wait). Regarding interfacing with a C event source , it can be done trivially right now now as long as it provides a callback (function pointer + context pointer) API, which is very common already. No need to wait for existing libraries to embrace a new synchronisation object (cue xkcd vignette about standards :) ) Allowing C to wait for events would require replacing the waiter vtable with the c equivalent, but is not something I care much right now.
Other features of this implementation:
* Other than in the blocking strategy itself, the future and promise implementation have no sources of blocking (no mutexes, not even spin locks).
* The shared state is not reference counted.
To demonstrate the generality of the waiting concept, I have implemented a few waiter objects.
* cv_waiter: this is simply a waiter on top of an std::mutex + cond
var.
* futex_waiter (linux): an atomic counter + a futex, possibly more efficient than the cv_waiter
* sem_waiter (posix): a waiter implemented on top of a posix semaphore. More portable than the futex waiter and possibly more efficient than the cv_waiter
* fd_waiter(linux, possibly posix): a waiter implemented on top of linux eventfd (for portability it can also be implemented on top of a pipe); you can use select with futures!
* task_waiter: a completely userspace based coroutine waiter which switches to another coroutine on wait and resumes the original coroutine on signal.
* scheduler_waiter: another coroutine based waiter, but on top of an userspace task scheduler. On wait switch to the next ready task, on signal enqueue the waiting task to the back of the ready queue.
I know that there is a plan to reimplement a new version of boost::threads, hopefully this implementation can contribute a few ideas.
My current best understanding of Vicente's plans is that each thread has a thread local condvar. The sole cause of a thread sleeping, apart from i/o, is on that thread local condvar. One therefore has a runtime which keeps a registry of all thread local condvars, and can therefore deduce the correct thread local condvars to wake when implementing a unified wait system also compatible with Fibers/resumable functions.
That doesn't work if a program wants to block for example in select or spin on a memory location, or on an hardware register, or wait for a signal, or interoperate with a different userspace thread library, or some other event queue (asio, qt or whatever) etc. and still also wait for a future. Well, you can use future::then, but it has overhead. -- gpd
On 19 Mar 2015 at 18:05, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Anyways, the plan is to add support to a custom allocator. I do not think you can realistically have a non allocating future *in the general case* ( you might optimise some cases of course).
We disagree. They are not just feasible, but straightforward, though if you try doing a composed wait on them then yes they will need to be converted to shared state. Tony van Eerd did a presentation a few C++ Now's ago on non-allocating futures. I did not steal his idea subconsciously one little bit! :)
I understand what you are aiming at, but I think that the elidability is orthogonal. Right now I'm focusing on making the actual synchronisation fast and composable in the scenario where the program has committed to make a computation async.
This is fine until your compiler supports resumable functions.
Exactly as my C11 permit object is. Except mine allows C code and C++ code to interoperate and compose waits together.
Not at all. I admit not having studied permit in detail (the doc size is pretty daunting) but as far as I can tell the waiting thread will block in the kernel.
It can spin or sleep or toggle a file descriptor or HANDLE.
It provides a variety of ways on how to block, the user can't add more.
It provides a hook API with filter C functions which can, to a limited extent, provide some custom functionality. Indeed the file descriptor/HANDLE toggling is implemented that way. There is only so much genericity which can be done with C.
My current best understanding of Vicente's plans is that each thread has a thread local condvar. The sole cause of a thread sleeping, apart from i/o, is on that thread local condvar. One therefore has a runtime which keeps a registry of all thread local condvars, and can therefore deduce the correct thread local condvars to wake when implementing a unified wait system also compatible with Fibers/resumable functions.
That doesn't work if a program wants to block for example in select or spin on a memory location, or on an hardware register, or wait for a signal, or interoperate with a different userspace thread library, or some other event queue (asio, qt or whatever) etc. and still also wait for a future. Well, you can use future::then, but it has overhead.
I think where we are vaguely heading is that anything Boost.Coroutine capable will convert blocking into coroutine scheduling. That way Thread v5 programs if they block doing ASIO or AFIO i/o under the bonnet it'll schedule other fibre work to do where possible, and any condvar or mutex blocks could turn into ASIO/AFIO work. A bit like a "userspace WinRT" I suppose. But sure, C++ is not WinRT. If the programmer writes an infinite for loop, he gets blocking behaviour. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 19 Mar 2015 19:51, "Niall Douglas"
On 19 Mar 2015 at 18:05, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Why would a memory allocation necessarily imply a cache miss. Eh you are even assuming an L3 miss, that must be a poor allocator!
Anyways, the plan is to add support to a custom allocator. I do not
think
you can realistically have a non allocating future *in the general case* ( you might optimise some cases of course).
We disagree. They are not just feasible, but straightforward, though if you try doing a composed wait on them then yes they will need to be converted to shared state. Tony van Eerd did a presentation a few C++ Now's ago on non-allocating futures. I did not steal his idea subconsciously one little bit! :)
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf the cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
I understand what you are aiming at, but I think that the elidability is orthogonal. Right now I'm focusing on making the actual synchronisation fast and composable in the scenario where the program has committed to make a computation async.
This is fine until your compiler supports resumable functions.
This is funny :). A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :). OK my compromise is to not allocate while the async operation is merely deferred but can still be executed synchronously. Lazily convert to heap allocation only when the operation needs to be executed truly asynchronously, basically until you actually create the promise (at that point the cost of the async setup will provably dwarf the allocation; and even in this case the allocation can be skipped if we know we will sync before a move, then it us safe to allocate the shared state on the stack). This should allow to compiler to remove the abstraction completely if it can prove it safe. Still working on it, should have something in a few days. I guess I'm converging to your design.
Exactly as my C11 permit object is. Except mine allows C code and C++ code to interoperate and compose waits together.
Not at all. I admit not having studied permit in detail (the doc size is pretty daunting) but as far as I can tell the waiting thread will block
in
the kernel.
It can spin or sleep or toggle a file descriptor or HANDLE.
It provides a variety of ways on how to block, the user can't add more.
It provides a hook API with filter C functions which can, to a limited extent, provide some custom functionality. Indeed the file descriptor/HANDLE toggling is implemented that way. There is only so much genericity which can be done with C.
I believe my design is much simpler and flexible; then again is trying to solve a different and narrower problem than full scale synchronization of arbitrary threads. -- gpd
On 20 Mar 2015 at 9:19, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Why would a memory allocation necessarily imply a cache miss. Eh you are even assuming an L3 miss, that must be a poor allocator!
LL caching is irrelevant, the problem here is the MOESI protocol which renders caching void. What you ideally want is unidirectional forward only cache line movement between threads. So thread A modifies a line (M), signals thread B who then shares that line (S) and thread A's line becomes owned (O). If thread A nor thread B never modifies that line after thread B has shared it, performance will be excellent, even under very heavy load, because main memory is never involved. If on the other hand thread A or thread B does modify a shared or owned line, it gets more complicated. Newer architectures can be quite good at exchanging the S and O states if exactly the right two caches have that line (e.g. two CPUs on the same die). As soon as you go past two caches sharing a line though, all bets are off and everything gets synced through main memory. All the above is why memory allocation is bad here because any shared state can usually ends up being worst case performance for MOESI. Modern memory allocators are very good at handling the same thread freeing a block it itself allocated when only that thread saw that allocation. They are poor at handling when a different thread frees, or even has visibility of, a block allocated by another thread. One brutally effective way of making memory allocation MOESI friendly is forcing 64 byte alignment on all blocks. So you're right that a custom allocator could be useful here, though the problem with shared states is that they very easily leak visibility to more than two threads. However because future-promise is extremely likely to have a different thread free a block than the thread which allocated it, you're going to have to introduce quite a bit of latency inducing machinery in your custom allocator to handle that. Even if you get the average low, you're going to see big latency spikes from time to time. One technique is to have a dedicated worker thread which never sleeps, only spins, whose sole purpose is to free allocations made by all the other threads. You can really get memory allocation latencies down with that technique, but it's unacceptable on anything mobile or consumer focused.
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf the cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared. There is a small penalty on ARM admittedly, but on the Cortex A15 I have here it is not much.
I understand what you are aiming at, but I think that the elidability is orthogonal. Right now I'm focusing on making the actual synchronisation fast and composable in the scenario where the program has committed to make a computation async.
This is fine until your compiler supports resumable functions.
This is funny :). A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :).
It is unacceptable. Chris's big point in his Concurrency alternative paper before WG21 is that future-promise is useless for 100k-1M socket scalability because to reach that you must have much lower latency than future-promise is capable of. ASIO's async_result system can achieve 100k-1M scalability. Future promise (as currently implemented) cannot. Thankfully WG21 appear to have accepted this about resumable functions in so far as I am aware. My hope is we can demonstrate that future-promise is easily 1M scalable using C++ 14 constexpr and lazy or no shared state allocation.
OK my compromise is to not allocate while the async operation is merely deferred but can still be executed synchronously. Lazily convert to heap allocation only when the operation needs to be executed truly asynchronously, basically until you actually create the promise (at that point the cost of the async setup will provably dwarf the allocation; and even in this case the allocation can be skipped if we know we will sync before a move, then it us safe to allocate the shared state on the stack). This should allow to compiler to remove the abstraction completely if it can prove it safe. Still working on it, should have something in a few days.
I guess I'm converging to your design.
I think the lazy conversion to shared state is a very wise design goal. Simple single future waits shouldn't need it, but for simplicity of implementation it's probably easier, and you're deallocating in the same thread as allocation, so on that you're good.
It provides a hook API with filter C functions which can, to a limited extent, provide some custom functionality. Indeed the file descriptor/HANDLE toggling is implemented that way. There is only so much genericity which can be done with C.
I believe my design is much simpler and flexible; then again is trying to solve a different and narrower problem than full scale synchronization of arbitrary threads.
I think compatibility with C is highly important. It is after all the most popular systems programming language, with excellent support from almost all other programming languages. As AFIO is expected to come with fully featured C bindings in the future, my need for a C compatible event/result object is obviously unavoidable. If C had one of course we'd be good, but it does not and POSIX did not want to add the one I proposed. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 20 Mar 2015 13:07, "Niall Douglas"
On 20 Mar 2015 at 9:19, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Why would a memory allocation necessarily imply a cache miss. Eh you are even assuming an L3 miss, that must be a poor allocator!
[Snip]
Newer architectures can be quite good at exchanging the S and O states if exactly the right two caches have that line (e.g. two CPUs on the same die). As soon as you go past two caches sharing a line though, all bets are off and everything gets synced through main memory.
All the above is why memory allocation is bad here because any shared state can usually ends up being worst case performance for MOESI.
What's special about memory allocation here? Intuitively sharing futures on the stack might actually be worse especially if you have multiple futures being serviced by different threads.
Modern memory allocators are very good at handling the same thread freeing a block it itself allocated when only that thread saw that allocation. They are poor at handling when a different thread frees, or even has visibility of, a block allocated by another thread.
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf
Because of the way ownership is relinquished, in my implementation the consumer usually both allocates and frees the state. Unless you use shared futures, but then that's the last of your problems. [Snip] the
cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs everything else, but in the normal case of a exclusive cache line, mfence, xchg, cmpxchg and friends cost 30-50 Cycles and stall the CPU. Significantly more than the cost of non serialising instructions. Not something I want to do in a move constructor. [Snip]
A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :).
It is unacceptable. Chris's big point in his Concurrency alternative paper before WG21 is that future-promise is useless for 100k-1M socket scalability because to reach that you must have much lower latency than future-promise is capable of. ASIO's async_result system can achieve 100k-1M scalability. Future promise (as currently implemented) cannot.
Thankfully WG21 appear to have accepted this about resumable functions in so far as I am aware.
I'm a big fan of Chris' proposal as well. I haven't seen any new papers on resumable functions, I would love to know were the committee is heading. -- gpd
On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:
What's special about memory allocation here? Intuitively sharing futures on the stack might actually be worse especially if you have multiple futures being serviced by different threads.
I think memory allocation is entirely wise for shared_future. I think auto allocation is wise for future, as only exactly one of those can exist at any time per promise.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs everything else, but in the normal case of a exclusive cache line, mfence, xchg, cmpxchg and friends cost 30-50 Cycles and stall the CPU. Significantly more than the cost of non serialising instructions. Not something I want to do in a move constructor.
You're right and I'm wrong on this - I stated the claim above on empirical testing where I found no difference in the use of the LOCK prefix. It would appear I had an inefficiency in my testing code: Agner says that for Haswell: XADD: 5 uops, 7 latency LOCK XADD: 9 uops, 19 latency CMPXCHG 6 uops, 8 latency LOCK CMPXCHG 10 uops, 19 latency (Source: http://www.agner.org/optimize/instruction_tables.pdf) So approximately one halves the throughput and triples the latency with the LOCK prefix irrespective of the state of the cache line. Additionally as I reported on this list maybe a year ago, first gen Intel TSX provides no benefits and indeed a hefty penalty over simple atomic RMW. ARM and other CPUs provide load linked store conditional, so RMW with those is indeed close to penalty free if the cache line is exclusive to the CPU doing those ops. It's just Intel is still incapable of low latency lock gets, though it's enormously better than the Pentium 4. All that said, I don't see a 50 cycle cost per move constructor as at all being a problem. Compilers are also pretty good at applying RVO (if you don't get in the way) to elide moves with observable effects which any move constructor using atomic must cause. The total number of actual 50 cycle move constructors therefore generated is usually a minimum.
[Snip]
A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :).
It is unacceptable. Chris's big point in his Concurrency alternative paper before WG21 is that future-promise is useless for 100k-1M socket scalability because to reach that you must have much lower latency than future-promise is capable of. ASIO's async_result system can achieve 100k-1M scalability. Future promise (as currently implemented) cannot.
Thankfully WG21 appear to have accepted this about resumable functions in so far as I am aware.
I'm a big fan of Chris' proposal as well. I haven't seen any new papers on resumable functions, I would love to know were the committee is heading.
The argument between those two camps is essentially concurrency vs parallelism. I had thought the latter had won? My urgings by private email to those involved is that a substantial reconciliation needs to happen between the Concurrency TS and the Networking TS such that they are far more tightly integrated and not these separate opposing paradigm islands. A future-promise as efficient as async_result would be an excellent first step along such a reconciliation. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On Sat, Mar 21, 2015 at 2:33 PM, Niall Douglas
On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:
What's special about memory allocation here? Intuitively sharing futures on the stack might actually be worse especially if you have multiple futures being serviced by different threads.
I think memory allocation is entirely wise for shared_future. I think auto allocation is wise for future, as only exactly one of those can exist at any time per promise.
I wasn't talking about shared future. Think about something like this, assuming that the promise has a pointer to the future: future<X> a = async(...); future<X> b = async(...); ... do something which touch the stack... a and b are served by two different threads. if sizeof(future) is less than a cacheline, every time their corresponding threads move or signal the promise, they will interfere with each other and with this thread doing something. The cacheline starts bouncing around like crazy. And remember that with non allocating futures, promise and futures are touched by the other end at every move, not just on signal.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs everything else, but in the normal case of a exclusive cache line, mfence, xchg, cmpxchg and friends cost 30-50 Cycles and stall the CPU. Significantly more than the cost of non serialising instructions. Not something I want to do in a move constructor.
You're right and I'm wrong on this - I stated the claim above on empirical testing where I found no difference in the use of the LOCK prefix. It would appear I had an inefficiency in my testing code: Agner says that for Haswell:
XADD: 5 uops, 7 latency LOCK XADD: 9 uops, 19 latency
CMPXCHG 6 uops, 8 latency LOCK CMPXCHG 10 uops, 19 latency
(Source: http://www.agner.org/optimize/instruction_tables.pdf)
So approximately one halves the throughput and triples the latency with the LOCK prefix irrespective of the state of the cache line.
That's already pretty good actually. On Sandy and Ivy was above 20 clocks. Also the comparison shouldn't be with their non-locked counterparts (which aren't really ever used and complete), but with plain operations. Finally there is the hidden cost of preventing any OoO execution which won't appear in a synthetic benchmark.
Additionally as I reported on this list maybe a year ago, first gen Intel TSX provides no benefits and indeed a hefty penalty over simple atomic RMW.
hopefully will get better in the future. I haven't had the chance to try it yet, but TSX might help with implementing wait_all and wait_any whose setup and teardown require a large amount of atomic ops..
ARM and other CPUs provide load linked store conditional, so RMW with those is indeed close to penalty free if the cache line is exclusive to the CPU doing those ops. It's just Intel is still incapable of low latency lock gets, though it's enormously better than the Pentium 4.
The reason of the high cost is that RMW have sequential consistency semantics. On the other hand on intel plain load and stores have desirable load_acquire and store_release semantics and you do not need extra membars. -- gpd
On 21 Mar 2015 at 18:36, Giovanni Piero Deretta wrote:
a and b are served by two different threads. if sizeof(future) is less than a cacheline, every time their corresponding threads move or signal the promise, they will interfere with each other and with this thread doing something. The cacheline starts bouncing around like crazy. And remember that with non allocating futures, promise and futures are touched by the other end at every move, not just on signal.
I hadn't even considered a non-allocating future which isn't aligned to 64 byte multiples.
Additionally as I reported on this list maybe a year ago, first gen Intel TSX provides no benefits and indeed a hefty penalty over simple atomic RMW.
hopefully will get better in the future. I haven't had the chance to try it yet, but TSX might help with implementing wait_all and wait_any whose setup and teardown require a large amount of atomic ops..
I posted comprehensive benchmarks on this list a year or so ago, try http://boost.2283326.n4.nabble.com/Request-for-feedback-on-design-of-c oncurrent-unordered-map-plus-notes-on-use-of-memory-transactions-td466 5594.html. My conclusion was that first gen Intel TSX isn't ready for real world usage - in real world code, the only benefits show up in heavily multithreaded 90-95% read only scenarios. I tried a TSX based hash table, and was surprised at just how much slower the single and dual threaded scenarios were, sufficiently so I concluded that it would be a bad idea to have TSX turned on by default for almost all general purpose algorithm implementations. Where TSX does prove useful though is for transactional GCC which produces nothing like as penalised code as on non-TSX hardware.
ARM and other CPUs provide load linked store conditional, so RMW with those is indeed close to penalty free if the cache line is exclusive to the CPU doing those ops. It's just Intel is still incapable of low latency lock gets, though it's enormously better than the Pentium 4.
The reason of the high cost is that RMW have sequential consistency semantics. On the other hand on intel plain load and stores have desirable load_acquire and store_release semantics and you do not need extra membars.
Still, if Intel could do a no extra cost load linked store conditional, _especially_ if those could be nested to say two or four levels, it would be a big win. That 19 cycle latency and pipeline flush is expensive, plus the unnecessary cache coherency traffic when you don't update the cache line. Plus, a two or four level nesting would allow the atomic update of a pair of pointers or two pairs of pointers, something which is 98% of the use case of Intel TSX anyway. That said, if Intel TSX v2 didn't have the same overheads as a lock xchg, that's even better again. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On Fri, Mar 20, 2015 at 5:19 AM, Giovanni Piero Deretta
On 19 Mar 2015 19:51, "Niall Douglas"
wrote: On 19 Mar 2015 at 18:05, Giovanni Piero Deretta wrote:
Your future still allocates memory, and is therefore costing about 1000 CPU cycles.
1000 clock cycles seems excessive with a good malloc implementation.
Going to main memory due to a cache line miss costs 250 clock cycles, so no it isn't. Obviously slower processors spin less cycles for a cache line miss.
Why would a memory allocation necessarily imply a cache miss. Eh you are even assuming an L3 miss, that must be a poor allocator!
Anyways, the plan is to add support to a custom allocator. I do not
think
you can realistically have a non allocating future *in the general case* ( you might optimise some cases of course).
We disagree. They are not just feasible, but straightforward, though if you try doing a composed wait on them then yes they will need to be converted to shared state. Tony van Eerd did a presentation a few C++ Now's ago on non-allocating futures. I did not steal his idea subconsciously one little bit! :)
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf the cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
Just to be clear, my non-allocating promise/future was meant to be more of a proof-of-concept, not necessarily optimal efficiency. I would need to recheck, but I think most of it requires only acquire or release, not full sequential consistency, if that helps. I rarely write anything that requires full sequential consistency. Anyhow, carry on. I like that there are people looking into this stuff. Looking forward to the outcomes. Tony
On Fri, Mar 27, 2015 at 8:24 PM, Gottlob Frege
On Fri, Mar 20, 2015 at 5:19 AM, Giovanni Piero Deretta
wrote: On 19 Mar 2015 19:51, "Niall Douglas"
wrote: On 19 Mar 2015 at 18:05, Giovanni Piero Deretta wrote:
[snip]
Anyways, the plan is to add support to a custom allocator. I do not
think
you can realistically have a non allocating future *in the general case* ( you might optimise some cases of course).
We disagree. They are not just feasible, but straightforward, though if you try doing a composed wait on them then yes they will need to be converted to shared state. Tony van Eerd did a presentation a few C++ Now's ago on non-allocating futures. I did not steal his idea subconsciously one little bit! :)
I am aware of that solution My issue with that design is that it require an expensive rmw for every move. Do a few moves and it will quickly dwarf the cost of an allocation, especially considering that an OoO will happily overlap computation with a cache miss, while the required membar will stall the pipeline in current CPUs (I'm thinking of x86 of course). That might change in the near future though.
Just to be clear, my non-allocating promise/future was meant to be more of a proof-of-concept, not necessarily optimal efficiency. I would need to recheck, but I think most of it requires only acquire or release, not full sequential consistency, if that helps. I rarely write anything that requires full sequential consistency.
Hum, I do not think you can implement it with just plain load and stores, I think you need at least one RMW there. Probably acq_rel is enough for that, but that's no less expensive than seq_cst on x86. -- gpd
On Fri, Mar 27, 2015 at 5:35 PM, Giovanni Piero Deretta
On Fri, Mar 27, 2015 at 8:24 PM, Gottlob Frege
wrote: Just to be clear, my non-allocating promise/future was meant to be more of a proof-of-concept, not necessarily optimal efficiency. I would need to recheck, but I think most of it requires only acquire or release, not full sequential consistency, if that helps. I rarely write anything that requires full sequential consistency.
Hum, I do not think you can implement it with just plain load and stores, I think you need at least one RMW there. Probably acq_rel is enough for that, but that's no less expensive than seq_cst on x86.
You are probably right, but for each move, all you are doing is publishing where you moved to, so you need a release, and then an acquire when you want to read that, not acq_rel together on one operation. But I haven't thought about it since I gave the presentation!, so I don't know. Maybe I'll have time to think about it at C++Now, which is coming up soon. (Feels soon to me at least - more things to prepare...) Tony
On 27 Mar 2015 at 21:35, Giovanni Piero Deretta wrote:
Just to be clear, my non-allocating promise/future was meant to be more of a proof-of-concept, not necessarily optimal efficiency. I would need to recheck, but I think most of it requires only acquire or release, not full sequential consistency, if that helps. I rarely write anything that requires full sequential consistency.
Hum, I do not think you can implement it with just plain load and stores, I think you need at least one RMW there. Probably acq_rel is enough for that, but that's no less expensive than seq_cst on x86.
To the hardware yes. To the compiler no - acquire-release allows very significantly wider scope for optimisation. In my earlier benchmarking of test implementations, a CAS lock for each of the future and promise turned out to be faster than any other technique on non-transactional compilers. I had std::lock do the locking of both future and promise before modification. One could make that faster probably, but the ordering is tricky to make bug free, and I suspect maintenance would be expensive. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
participants (4)
-
Giovanni Piero Deretta
-
Gottlob Frege
-
Jeremy Maitin-Shepard
-
Niall Douglas