On 29/12/2014 04:40, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
Why block when only the source is on the host? Are you worried it might go out of scope?
If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.
Yes, that is one of the major motivations. Avoiding potential race-conditions with host code accessing the memory at the same time is another. I'd be very open to other solutions.
I find it truly confusing that an algorithm can run either synchronous or asynchronous, without its signature clearly and loudly indicating so. In template code (or in general) it can easily be +- unknown (or non-trivial to find out) if the input/output range refer to the host or the device, and thus if the algorithm will execute in synchronous or asynchronous mode -> and what that implies for the rest of the code around the algorithm. If I understood the behavior of transform correctly (and assuming that for device_vector the input/output ranges count as device side [?]), am I correct that the following can easily fail?: compute::command_queue queue1(context, device); compute::command_queue queue2(context, device); compute::vector<float> device_vector(n, context); // copy some data to device_vector // use queue1 boost::compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue1); // use queue2 compute::copy(device_vector.begin(), device_vector.end(), some_host_vector.begin(), queue2); And currently the way to make this behave properly would be to force queue1 to wait for completion of any enqueued job (note: it may be an out-of-order queue!) after transform has been called? One way could be to make algorithms simply always treated as asynchronous at API level (even if internally they may run synchronous) and get always associated with an event. Another is providing a synchronous and asynchronous overload. I'd certainly prefer to know if it runs synchronous or asynchronous just by looking at the transform invocation itself. With respect to exception safety, is there any proper behavior defined by your library if transform has been enqueued to run in asynchronous mode, but before it has completed device_vector goes out of scope (e.g. due to an exception thrown in the host code following the transform)? Or is it the user's responsibility to ensure that, whatever happens, device_vector must live until the transform has completed?
I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.
Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips. I guess it'd be nice to have a scope guard that blocks on boost::compute::event.
Here's another sketch, also considering the points above - though I obviously don't know if it's doable given the implementation + other design considerations I might miss, so apologize if it's non-sense. If input/output ranges generally refer to iterators from the boost::compute library, then: -) an iterator can store the container (or other data structure it belongs to, if any) -) an algorithm can, via the iterators, "communicate" with the container(s) For an input operation the data must be available throughout & in unaltered manner from the time of enqueuing the input operation until its completion. So when transform (as example) is launched it can inform the input data container that before any subsequent modification of it to occur (including destruction / setting new values through iterators) it must wait until that input operation has completed - i.e. the first modifying operation blocks until that has finished. Similarly for the output range, just that for that also any read operation must block until the output data from the transform has been written to it. So: -) no matter what causes the destruction of containers (e.g. regularly end-of-block reached, exception etc.) the lifetime of the container/iterators extends until the asynchronous operation on it has finished; thus exceptions thrown are implicitly handled. -) to the user the code appears as synchronous with respect to visible behavior, but can run as asynchronous in the background. Obviously a full-fledged version is neither trivial nor cheap with respect to performance (e.g. checking any reads/writes to containers if it must block), let alone threading aspects. But maybe just parts of it are useful, e.g. deferring container destruction until no OpenCL operation is enqueued to work on the container (-> handling exceptions). I think there's a wide range for balances between what the implementation does automagically and what constraints are placed on the user to not do "stupid" things. Thomas