On Mon, Dec 29, 2014 at 1:19 PM, Thomas M
On 29/12/2014 04:40, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
Why block when only the source is on the host? Are you worried it might go out of scope?
If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.
Yes, that is one of the major motivations. Avoiding potential race-conditions with host code accessing the memory at the same time is another. I'd be very open to other solutions.
I find it truly confusing that an algorithm can run either synchronous or asynchronous, without its signature clearly and loudly indicating so. In template code (or in general) it can easily be +- unknown (or non-trivial to find out) if the input/output range refer to the host or the device, and thus if the algorithm will execute in synchronous or asynchronous mode -> and what that implies for the rest of the code around the algorithm.
If I understood the behavior of transform correctly (and assuming that for device_vector the input/output ranges count as device side [?]), am I correct that the following can easily fail?:
compute::command_queue queue1(context, device); compute::command_queue queue2(context, device);
compute::vector<float> device_vector(n, context); // copy some data to device_vector
// use queue1 boost::compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue1);
// use queue2 compute::copy(device_vector.begin(), device_vector.end(), some_host_vector.begin(), queue2);
And currently the way to make this behave properly would be to force queue1 to wait for completion of any enqueued job (note: it may be an out-of-order queue!) after transform has been called?
Well this is essentially equivalent to having two separate host-threads both reading and writing from the same region of memory at the same time, of course you need to synchronize them. For this specific case you could just enqueue a barrier to ensure queue2 doesn't begin its operation before queue1 completes: // before calling copy() on queue2: queue2.enqueue_barrier(queue1.enqueue_marker());
One way could be to make algorithms simply always treated as asynchronous at API level (even if internally they may run synchronous) and get always associated with an event. Another is providing a synchronous and asynchronous overload. I'd certainly prefer to know if it runs synchronous or asynchronous just by looking at the transform invocation itself.
Well let me make this more clear: transform() always runs asynchronously. The only algorithm you really have to worry about is copy() as it is responsible for moving data between the host and device and will do this synchronously. If you want an asynchronous copy then use copy_async() which will return a future that can be used to wait for the copy operation to complete.
With respect to exception safety, is there any proper behavior defined by your library if transform has been enqueued to run in asynchronous mode, but before it has completed device_vector goes out of scope (e.g. due to an exception thrown in the host code following the transform)? Or is it the user's responsibility to ensure that, whatever happens, device_vector must live until the transform has completed?
The user must ensure that the memory being written to remains valid until the operation completes. Simply imagine you are calling std::transform() on a std::vector<> from a separate std::thread, you must wait for that thread to complete its work before destroying the memory it is writing to. Operations on the compute device can be reasoned about in a similar manner.
I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.
Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips. I guess it'd be nice to have a scope guard that blocks on boost::compute::event.
Here's another sketch, also considering the points above - though I obviously don't know if it's doable given the implementation + other design considerations I might miss, so apologize if it's non-sense.
If input/output ranges generally refer to iterators from the boost::compute library, then: -) an iterator can store the container (or other data structure it belongs to, if any) -) an algorithm can, via the iterators, "communicate" with the container(s)
For an input operation the data must be available throughout & in unaltered manner from the time of enqueuing the input operation until its completion. So when transform (as example) is launched it can inform the input data container that before any subsequent modification of it to occur (including destruction / setting new values through iterators) it must wait until that input operation has completed - i.e. the first modifying operation blocks until that has finished. Similarly for the output range, just that for that also any read operation must block until the output data from the transform has been written to it. So: -) no matter what causes the destruction of containers (e.g. regularly end-of-block reached, exception etc.) the lifetime of the container/iterators extends until the asynchronous operation on it has finished; thus exceptions thrown are implicitly handled. -) to the user the code appears as synchronous with respect to visible behavior, but can run as asynchronous in the background.
Obviously a full-fledged version is neither trivial nor cheap with respect to performance (e.g. checking any reads/writes to containers if it must block), let alone threading aspects. But maybe just parts of it are useful, e.g. deferring container destruction until no OpenCL operation is enqueued to work on the container (-> handling exceptions). I think there's a wide range for balances between what the implementation does automagically and what constraints are placed on the user to not do "stupid" things.
While these are interesting ideas, I feel like this is sort of behavior is more high-level/advanced than what the Boost.Compute algorithms are meant to do. I have tried to mimic as close as possible the "iterators and algorithms" paradigm from the STL as I consider the design quite elegant. However, these sorts of techniques could definitely be implemented on top of Boost.Compute. I would be very interested to see a proof-of-concept demonstrating these ideas, would you be interested in working on this? -kyle