On Sun, Dec 21, 2014 at 3:44 AM, Thomas M
On 20/12/2014 09:32, Kyle Lutz wrote:
Can you give examples of where the OpenCL wrapper types in Boost.Compute are lacking?
I missed any interface to cl::... (Khronos C++ wrapping API)
True. As of now you can convert between any of the types (by just using the C++ wrapper types function call operator to access the OpenCL C type and then passing that to the Boost.Compute types constructor) like so: cl::CommandQueue q1 = ...; boost::compute::command_queue q2(q1()); I'm working on adding more direct support for the C++ wrapper types so in the future you will also be able to do this: cl::CommandQueue q1 = ...; boost::compute::command_queue q2 = q1;
One particular issue that makes me hesitant is the lack of OpenCL 2.0 support in the "official" C++ bindings. The OpenCL 2.0 specification was released over a year ago (November 2013). The first public OpenCL 2.0 implementation was released by Intel three months ago (September 2014, followed closely by an AMD implementation).
It is correct that Khronos presently only offers a C++ API for OpenCL 1.2, but in practice the ways more serious, limiting constraint is that Nvidia only offers a 1.1 _implementation_ anyway. Any portable OpenCL code must restrict itself to 1.1.
Well portable code need not restrict itself, but merely provide implementations of its functionality which will also work with only OpenCL 1.1 (or even 1.0). For instance, calling some of the higher-level algorithms in Boost.Compute will automatically dispatch based on the version of OpenCL supported by the device and call more efficient APIs if available, otherwise fall back to the older APIs. Boost.Compute tries as much as possible to shield users from these sorts of backwards-compatibility issues.
We could discuss issues of the Khronos C++ bindings and your wrapper to no end, but my main point remains: The Khronos version is official and established; an alternative version intended for wider use must be clearly superior and give people strong reasons to migrate (I am not only referring to programming itself, but also people writing books on OpenCL, blogs etc.). In my eyes yours (that is core + utilities parts of your library) surely fails to meet that.
Could you let me know what parts of the core wrapper API fail to meet your bar for quality?
Just as one analysis example:
For the plain C API clEnqueueReadBuffer (surely a commonly used API function) has this signature (we use it as reference):
cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t size, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event);
The Khronos C++ bindings version is:
cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer, cl_bool blocking_read, ::size_t offset, ::size_t size, const void * ptr, const VECTOR_CLASS<Event> * events = NULL, Event * event = NULL);
And your "main" version is:
void enqueue_read_buffer(const buffer & buffer, size_t offset, size_t size, void * host_ptr, const wait_list & events = wait_list());
The Khronos C++ version matches in arguments logic and order fully the C version, and also returns a cl_int as error code - they correspond to each other, easy to use one or the other. Your version a) misses the cl_bool for blocking/non-blocking, b) misses the last event argument, and c) throws (I suppose) instead of returning an error code. Let's go through these: a) blocking: If I remember my code inspection correctly your version is automatically always blocking, and a non-blocking version is (I guess?) given by enqueue_read_buffer_async, which has a different signature (return value) besides it's different name. So instead of being able to set a single cl_bool and pass it as standard argument I need to take care which function (name) to call, what it returns etc. Studying your library docs I find very little information, what makes them different etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a blocking operation, it only says that it _enqueues_ a read command. Both functions then simply refer to clEnqueueReadBuffer() which does not help matters at all given the different signature. Now take a look at the Khronos C++ documentation and one encounters a ways, ways more detailed description. It's signature is already quite clear, and with the elaborate description it's really clear.
Yes, I've split the blocking and non-blocking memory copy operations into separate functions. Personally, I've never been fond of APIs which drastically change behavior based on a single boolean flag. Also, this is more in line with the API provided by other libraries like Boost.ASIO (e.g. boost::asio::read() vs. boost::asio::async_read()). I'll also definitely work on improving the documentation for these functions.
b) why your version lacks the last event argument entirely, I have no idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL runtime-level profiling; controlling an out-of-order command-queue / another commmand-queue) seems to be undoable.
My main motivation for returning "void" was to prevent users from attempting to build asynchronous pipelines but accidentally using an event object returned from a synchronous API function causing the whole thing to become synchronous. But I see your point that there may be legitimate uses for event objects associated with synchronous operations. It should be relatively easy to update these APIs (return void -> return event). I'll look into this.
c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour).
This policy setting is implemented using the Boost Exception library [1]. Users may define the BOOST_NO_EXCEPTIONS configuration macro which will keep the library from throwing exceptions and instead call a user-defined error handling function. This technique is used nearly universally in Boost and I think is superior to approaches which would change the function signature based on a policy configuration.
reviewer note: My original review said that the documentation is well done, but this did not refer to the parts core + utilities of which I had always been critical.
I'll definitely continue to work on improving the documentation, especially to address the points you've made above.
Another, huge issue in practice, is reliability + stability. Not only does the OpenCL execution model by itself make committing errors relatively easy, but I guess many of us can tell stories of bugged OpenCL implementations having hit us deeply. When something goes wrong figuring out where/why exactly can be a really slow, frustrating experience. So with respect to this the bar for any additional layer, developed outside Khronos (who has the OpenCL implementers on board !) must also be set very very high.
I've worked hard to ensure that all error codes from OpenCL functions are checked and any errors are properly propagated back to the user. If you find any place where this error handling is missing, please let me know.
Let me reiterate that my main concern is not fixing the example above; my main point is that any alternative version must be rock-solid from ground up on the one hand, plus offer considerable benefits to warrant migration considerations. Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there yet, have you ever given it a thought of getting in touch with the Khronos group if they are interested in doing work together, merging efforts?
Note for clarity: I am personally neither involved with the Khronos Groups, nor an OpenCL implementer, nor otherwise with distributing OpenCL [no book author etc.]; just a programmer using OpenCL.
On the other hand to those rather new to OpenCL a simplified, less error-prone design would be beneficial ...
I have put much consideration into this and ultimately I don't feel it is right for the data-copying to be made implicit and hidden away from the user. ...> In any case, there is a simplified, high-level API available in Boost.Compute which allows kernels to directly operate on host memory. See the mapped_view class [2].
mapped_view uses a direct host-memory mapped buffer scheme, through the CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code to assert that initial suspicion - the docs don't say it, and you know other implementation schemes would have also been possible, e.g. it could have used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only very exceptionally useful in practice.
Umm, CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR do very different things. In Boost.Compute, the mapped_view class provides an abstraction around the former, while the pinned_allocator class provides an abstraction around the latter.
However I fully agree with you that users should not be forced to use exclusively the one or the other. My proposal aimed at offering a variety of high-level abstractions from which users can choose:
-) if an input or output to a Boost.compute algorithm call is a plain host side object (e.g. a std:: container), automatically copy the data to the OpenCL runtime (if input), run the kernel, and automatically copy the data from the OpenCL runtime to the host (if output).
This is actually already implemented for the copy() and sort() algorithms. For example, this code will automatically copy the data to the device, sort it, and copy it back to the host: std::vector<int> vec = ...; boost::compute::sort(vec.begin(), vec.end()); I plan on implementing this support for host iterators more widely in the API, just haven't had the time.
-) if an input or output to an algorithm call is a plain OpenCL object (e.g. cl_mem or some C++ class wrapping it) the user will need to take care of any copying to host memory.
I'm not sure I understand this completely (maybe could you provide some example code?). Currently, like in the STL, the algorithms accept iterators rather than containers/memory objects. This is the same approach taken in Boost.Compute. However, it's easy to wrap an arbitrary OpenCL buffer with an iterator by using the buffer_iterator class [2]: // opencl memory buffer cl_mem mem = ...; // fill mem with zeros boost::compute::fill_n(make_buffer_iterator<int>(mem, 0), memory_size / sizeof(int), 0, queue);
-) if an input or output to an algorithm call is something which links a plain host object (e.g. std:: container) to an OpenCL object (e.g. cl_mem) - I have sketched such a class in my original post - ensure that upon any access to either the data become automatically synchronized.
One can imagine additional / mixed schemes as well, temporarily disable default behaviour etc.
This is synchronized container concept is an interesting idea. If you have the time/desire to draw up a working proof-of-concept I'd be very interested in getting it integrated into Boost.Compute.
These are definitely ideas I've thought about and these kinds of tools could all be built upon the current API...
...
See my response above, I'm not a huge fan of these automatic data transfer abstractions. However, I could be in favor of offering this functionality in a higher-level API.
My recommendation and reviewer comment is: Your library shall offer high(er)-level abstractions of things NOT already offered by Khronos, by building on top of the Khronos APIs (incl. their C++ wrapper). Your STL-ish algorithms and their supporting functionality are one aspect here; the stuff just discussed another example: those who want low-level control already have it and can use it (through the Khronos APIs), while those who desire an "auto-synchronization" between host and OpenCL runtime presently do NOT have anything. Here the real benefit would come in.
2 other examples that come to mind (real-world applications from my work):
-) smoothly support utilizing multiple devices (e.g. auto-splitting kernel workload across devices and data synchronization among devices; details will depend quite whether devices do or do not belong to the same context).
This is something I have had on the road-map for a long while. I agree that building the infrastructure for these device-distributed algorithms would be very useful.
-) specifying "operations-sequences", like: "Copy input data A to OpenCL, copy input data B, run kernel 1, 2, an intermediate result decides if next run kernel 3 or 4, and then copy data C to host". Such pre-specified sequences implicitly also help to improve performance (e.g. only the last command must be non-blocking).
This is also something I have played around with. Basically I'd like to have any API which allows users to define "pipelines" or "task-graphs" which hook up several different kernels/algorithms/memory-copies and produce an efficient set of operations to stream data through and extract the results. Any ideas you have on a potential API you'd like to see for this would be great. There is potentially some prior art in the C++ pipelines proposal [3] which may be interesting.
3) for float/double input containers compute::accumulate falls back to a plain serial reduction.
Because floating-point addition is not associative. Doing this would lead to accumulate() producing different results on the device versus the host.
This argument is not clear to me. Floating-point results must, generally speaking, always be expected to deviate slightly between a plain C++ and OpenCL execution (subject to the specific device hardware + driver version + compilation settings used; for pretty much any kernel). I'd say that when someone defers any floating-point calculation to OpenCL then this is implicitly acknowledged, at the benefit of a parallelized execution.
Strongly disagree, the floating-point operations on the device are well defined and their output should be identical to the host results (barring optimizations like "-cl-fast-relaxed-math"). More concretely, this test should always pass for any set of input data: float data[] = { 1.1f, 2.3f, 3.4f, 4.5f }; std::vector<float> host_vec(data, data + 4); boost::compute::vector<float> device_vec(data, data + 4, queue); BOOST_CHECK_EQUAL( std::accumulate(host_vec.begin(), host_vec.end(), 0), boost::compute::accumulate(device_vec.begin(), device_vec.end(), 0, queue) ); I don't feel that sacrificing precision is always implicitly acknowledged and prefer to leave the choice of precision vs. performance up to the user.
True, the run-time compilation model provided by OpenCL does have some associated overhead. There a few techniques in Boost.Compute which help mitigate this ... The other is support for offline-caching of program binaries. This is enabled by defining "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache binaries for programs so that they are only compiled once the very first time they are run on the system.
Has this been well-tested, including scenarios such as OpenCL driver version changes between runs and how it interacts with Nvidia's own auto-caching mechanism (which can drive one nuts when it fails to detect changes and happily uses an outdated binary).
I am very confident that the caching system works and it will properly deal with changing driver versions. It was added nearly a year ago and I haven't encountered any problems with it or received any bug reports. And it essentially duplicates the functionality of NVIDIA's offline caching functionality but makes it available to all OpenCL platforms. I haven't had any ill effects using both together on the same system.
I would find it very useful if smart algorithms dispatch the algorithm to a plain C++ algorithm if it's really predictable that a GPU execution will just waste time.
I disagree, I think the call on whether to execute the algorithm on the host or on an accelerator (with its associated overhead) should be left up to the user (or potentially a higher-level API built on top of Boost.Compute)...While I agree that this would a useful feature, I just don't think Boost.Compute is the right place for that logic to live.
It is meant as another example of a higher-level abstraction that can be done; in my opinion Boost.Compute would be the perfect place to put it in.
There is no need to implement a number of the suggested higher-level abstractions right away (or for some: ever), it shall help finding the right niche for your library (now and in the long run). As such the recommendations / critics raised here are by no means intended to be destructive, I just bring in my view how I see your library best serving the community.
These are definitely some good ideas. Thanks for all the feedback! -kyle [1] http://www.boost.org/doc/libs/1_57_0/libs/exception/doc/boost-exception.html [2] http://kylelutz.github.io/compute/boost/compute/buffer_iterator.html [3] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3534.html