Re: [boost] Synchronization (RE: [compute] review)

29 Dec 2014

      On 29/12/2014 22:51, Kyle Lutz wrote:
...
On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <firespot71@gmail.com> wrote:
...
On 29/12/2014 04:40, Gruenke,Matt wrote:
...
-----Original Message-----
From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 21:24
To: boost@lists.boost.org List
Subject: Re: [boost] Synchronization (RE: [compute] review)
...
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
If I understood the behavior of transform correctly (and assuming that for
device_vector the input/output ranges count as device side [?]), am I
correct that the following can easily fail?:
compute::command_queue queue1(context, device);
compute::command_queue queue2(context, device);
compute::vector<float> device_vector(n, context);
// copy some data to device_vector
// use queue1
boost::compute::transform(device_vector.begin(), device_vector.end(),
                           device_vector.begin(),
                           compute::sqrt<float>(),
                           queue1);
// use queue2
compute::copy(device_vector.begin(), device_vector.end(),
               some_host_vector.begin(), queue2);
And currently the way to make this behave properly would be to force queue1
to wait for completion of any enqueued job (note: it may be an out-of-order
queue!) after transform has been called?
Well this is essentially equivalent to having two separate
host-threads both reading and writing from the same region of memory
at the same time, of course you need to synchronize them.
For this specific case you could just enqueue a barrier to ensure
queue2 doesn't begin its operation before queue1 completes:
// before calling copy() on queue2:
queue2.enqueue_barrier(queue1.enqueue_marker());
Sorry I haven't expressed myself well. Yes, surely I must synchronize 
it; just with the OpenCL API itself I can normally provide a (pointer to 
an) cl_event when calling objectclEnqueue... functions, which can 
subsequently be used quite flexibly to coordinate other operations (i.e. 
not going to the command queue level).
...
...
One way could be to make algorithms simply always treated as asynchronous at
API level (even if internally they may run synchronous) and get always
associated with an event. Another is providing a synchronous and
asynchronous overload. I'd certainly prefer to know if it runs synchronous
or asynchronous just by looking at the transform invocation itself.
Well let me make this more clear: transform() always runs
asynchronously. The only algorithm you really have to worry about is
copy() as it is responsible for moving data between the host and
device and will do this synchronously. If you want an asynchronous
copy then use copy_async() which will return a future that can be used
to wait for the copy operation to complete.
Now I am really confused :) In this thread I have read, quoting:

[Gruenke, Matt]:
"My understanding, based on comments you've made to other reviewers, is 
that functions like boost::compute::transform() are asynchronous when 
the result is on the device, but block when the result is on the host. 
This is what I'm concerned about.  Is it true?"

[Kyle Lutz]
"Yes this is correct. In general, algorithms like transform() are
asynchronous when the input/output ranges are both on the device and
synchronous when one of the ranges is on the host."

This made me believe that some iterators in your library turn 
compute::boost::transform into a synchronous operation, and some into an 
asynchronous. So now I suppose that this does not seem to be the case ?
In comparison to the OpenCL runtime execution model can I consider all 
your algorithms, except copy, basically acting like 
clEnqueueNDRangeKernel calls, that is always asynchronous?
...
...
Here's another sketch, also considering the points above.
While these are interesting ideas, I feel like this is sort of
behavior is more high-level/advanced than what the Boost.Compute
algorithms are meant to do. I have tried to mimic as close as possible
the "iterators and algorithms" paradigm from the STL as I consider the
design quite elegant.
However, these sorts of techniques could definitely be implemented on
top of Boost.Compute. I would be very interested to see a
proof-of-concept demonstrating these ideas, would you be interested in
working on this?
Interested yes, time is currently a problem though; I'd need to 
familiarize myself much deeper with your implementation.
At this stage my main concern is exception safety - how one could relief 
users in a simplistic manner from the need to manually taking care that 
objects do not get out-of-scope (due to an exception thrown) while an 
OpenCL asynchronous operation still needs them. Note that because your 
API can throw I consider exception effects to be of much greater concern 
than with the (implicitly non-throwing) Khronos API; and just enqueuing 
a couple of commands can make a proper, manual cleanup really easily 
non-trivial.

cheers,
Thomas