Re: [boost] Synchronization (RE: [compute] review)

29 Dec 2014

      On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <firespot71@gmail.com> wrote:
...
On 29/12/2014 04:40, Gruenke,Matt wrote:
...
-----Original Message-----
From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 21:24
To: boost@lists.boost.org List
Subject: Re: [boost] Synchronization (RE: [compute] review)
...
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
...
...
Why block when only the source is on the host?  Are you worried it might
go out of scope?
...
If so, that's actually not a bad point.  I was just pondering how to
write exception-safe
code using local host datastructures.  I guess blocking on all
operations involving them
is a simple way to ensure nothing is read or written after it's out of
scope.  Not the
only way that comes to mind (nor the most efficient), but it does the
job.
Yes, that is one of the major motivations. Avoiding potential
race-conditions with host
code accessing the memory at the same time is another. I'd be very open
to other solutions.
I find it truly confusing that an algorithm can run either synchronous or
asynchronous, without its signature clearly and loudly indicating so. In
template code (or in general) it can easily be +- unknown (or non-trivial to
find out) if the input/output range refer to the host or the device, and
thus if the algorithm will execute in synchronous or asynchronous mode ->
and what that implies for the rest of the code around the algorithm.
If I understood the behavior of transform correctly (and assuming that for
device_vector the input/output ranges count as device side [?]), am I
correct that the following can easily fail?:
compute::command_queue queue1(context, device);
compute::command_queue queue2(context, device);
compute::vector<float> device_vector(n, context);
// copy some data to device_vector
// use queue1
boost::compute::transform(device_vector.begin(), device_vector.end(),
                          device_vector.begin(),
                          compute::sqrt<float>(),
                          queue1);
// use queue2
compute::copy(device_vector.begin(), device_vector.end(),
              some_host_vector.begin(), queue2);
And currently the way to make this behave properly would be to force queue1
to wait for completion of any enqueued job (note: it may be an out-of-order
queue!) after transform has been called?
Well this is essentially equivalent to having two separate
host-threads both reading and writing from the same region of memory
at the same time, of course you need to synchronize them.

For this specific case you could just enqueue a barrier to ensure
queue2 doesn't begin its operation before queue1 completes:

// before calling copy() on queue2:
queue2.enqueue_barrier(queue1.enqueue_marker());
...
One way could be to make algorithms simply always treated as asynchronous at
API level (even if internally they may run synchronous) and get always
associated with an event. Another is providing a synchronous and
asynchronous overload. I'd certainly prefer to know if it runs synchronous
or asynchronous just by looking at the transform invocation itself.
Well let me make this more clear: transform() always runs
asynchronously. The only algorithm you really have to worry about is
copy() as it is responsible for moving data between the host and
device and will do this synchronously. If you want an asynchronous
copy then use copy_async() which will return a future that can be used
to wait for the copy operation to complete.
...
With respect to exception safety, is there any proper behavior defined by
your library if transform has been enqueued to run in asynchronous mode, but
before it has completed device_vector goes out of scope (e.g. due to an
exception thrown in the host code following the transform)? Or is it the
user's responsibility to ensure that, whatever happens, device_vector must
live until the transform has completed?
The user must ensure that the memory being written to remains valid
until the operation completes. Simply imagine you are calling
std::transform() on a std::vector<> from a separate std::thread, you
must wait for that thread to complete its work before destroying the
memory it is writing to. Operations on the compute device can be
reasoned about in a similar manner.
...
...
I have some rough ideas, but they'd probably have a deeper impact on your
API than you want, at this stage.
Instead, I'm thinking mostly about how to make exception-safe use of the
async copy commands to/from host memory.  I think async copies will quickly
gain popularity with advanced users, and will probably be one of the top
optimization tips.
I guess it'd be nice to have a scope guard that blocks on
boost::compute::event.
Here's another sketch, also considering the points above - though I
obviously don't know if it's doable given the implementation + other design
considerations I might miss, so apologize if it's non-sense.
If input/output ranges generally refer to iterators from the boost::compute
library, then:
-) an iterator can store the container (or other data structure it belongs
to, if any)
-) an algorithm can, via the iterators, "communicate" with the container(s)
For an input operation the data must be available throughout & in unaltered
manner from the time of enqueuing the input operation until its completion.
So when transform (as example) is launched it can inform the input data
container that before any subsequent modification of it to occur (including
destruction / setting new values through iterators) it must wait until that
input operation has completed - i.e. the first modifying operation blocks
until that has finished. Similarly for the output range, just that for that
also any read operation must block until the output data from the transform
has been written to it. So:
-) no matter what causes the destruction of containers (e.g. regularly
end-of-block reached, exception etc.) the lifetime of the
container/iterators extends until the asynchronous operation on it has
finished; thus exceptions thrown are implicitly handled.
-) to the user the code appears as synchronous with respect to visible
behavior, but can run as asynchronous in the background.
Obviously a full-fledged version is neither trivial nor cheap with respect
to performance (e.g. checking any reads/writes to containers if it must
block), let alone threading aspects. But maybe just parts of it are useful,
e.g. deferring container destruction until no OpenCL operation is enqueued
to work on the container (-> handling exceptions).
I think there's a wide range for balances between what the implementation
does automagically and what constraints are placed on the user to not do
"stupid" things.
While these are interesting ideas, I feel like this is sort of
behavior is more high-level/advanced than what the Boost.Compute
algorithms are meant to do. I have tried to mimic as close as possible
the "iterators and algorithms" paradigm from the STL as I consider the
design quite elegant.

However, these sorts of techniques could definitely be implemented on
top of Boost.Compute. I would be very interested to see a
proof-of-concept demonstrating these ideas, would you be interested in
working on this?

-kyle