Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014
On Fri, Dec 19, 2014 at 1:07 PM, Thomas M
Dear all,
Review of the Compute library starts today
Having spent the last 3 years on a larger-scale C++ project utilizing OpenCL as main computing engine, without doubt a C++ GPGPU library is worthwhile. My evaluation was based on studying the full docs, tutorial examples, creating own (simple) applications, and inspecting selected implementation details.
Thanks for the review! I've addressed your comments in-line below:
I liked most the portable STL-like algorithms; combined with fairly straightforward on-the-fly specifications of kernel functions (including lambda expression support) GPGPU utilization becomes much more accessible for every-day C++ programs. The design of this library part is rather clean and aligns well with the C++ standard customs. I have checked a handful of function interfaces and they correspond to the C++ STL variants.
However I have also encountered a number of issues, of which I consider most severe the overall library's design/aim:
Khronos Group already provides (since years) a C++ bindings API itself (https://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf). Frankly the Khronos API is not an example of a clean, modern C++, but it provides very fine-grained operations control (which can be crucial for effective GPGPU performance), is developed (althouth a bit lagged) with OpenCL itself, and covered in every elaborate OpenCL textbook. It is thus IMHO the current de-facto C++ OpenCL wrapper standard. The proposed boost library's parts core & utility (heavy in total!) seem to do just the same interface-wise, yet lack some important features e.g. detailed control flow (blocking / event setting), image classes, or deviate in subtle signature details making it really difficult to grasp which does exactly what / behaves differently. A boost library should not start from scratch but integrate with and extend the Khronos API, both at C and C++ bindings level (e.g. providing the STL-like algorithms to them in straightforward to use manners). Programmers can thus rely on established practices (personally I wouldn't switch away from the Khronos C++ API as main underlying workhorse!) yet benefit from the extended functionality provided by the boost library.
Can you give examples of where the OpenCL wrapper types in Boost.Compute are lacking? I am aware of the issue with image classes. While they do exist in Boost.Compute, their APIs are just not documented yet (though there are quite a few examples/tests which demonstrate their usage). I'll work on this. One particular issue that makes me hesitant is the lack of OpenCL 2.0 support in the "official" C++ bindings. The OpenCL 2.0 specification was released over a year ago (November 2013). The first public OpenCL 2.0 implementation was released by Intel three months ago (September 2014, followed closely by an AMD implementation). Boost.Compute had OpenCL 2.0 support implemented a week later. As of today (over a year since the specification was released), there is still no support for OpenCL 2.0 in the Khronos C++ OpenCL wrapper. I don't think it would be prudent to restrict Boost.Compute to a subset of the OpenCL API merely because of shortcomings in the "official" C++ wrapper. Other issues include lack of support for move-semantics (as well as various other C++11 features) and differing object sizes for wrapped types versus the OpenCL C API types (e.g. "sizeof(cl::Device) != sizeof(cl_device_id)" which prevents the ability to pass arrays of wrapped object types directly to C APIs). Overall I think implementing our own wrapper directly based on the C API reduces overall complexity (no workarounds for bugs/shortcomings in the C++ wrapper layer), allows us to fix bugs and issue updates quickly, and provides a more consistent and, IMHO, better API for OpenCL. Also, as they both just wrap the underlying OpenCL C API, interoperating between Boost.Compute wrapper types and the Khronos C++ wrapper types is trivial. If there are specific issues with the wrappers in Boost.Compute please submit a bug report [1].
On the other hand to those rather new to OpenCL a simplified, less error-prone design would be beneficial; equally this can raise the productivity of everyone. The current design/implementation follows the typical OpenCL execution model incl. some of its caveats (see tutorial "Transforming Data"): explicitly copying input data to the device, executing kernel(s), and then copying the output back to the host. Frequently however the whole emphasis is on the kernel invocation (run an algorithm on the data), rendering the copying an implementation detail that's "done because it must be done" but otherwise makes code longer and comprises an error source if forgotten.
I have put much consideration into this and ultimately I don't feel it is right for the data-copying to be made implicit and hidden away from the user. I prefer the API which makes the memory copying operations and the host-device synchronization points explicit. These "under-the-hood" copy operations can be major sources of under-performing GPU-based applications and I don't think insulating the programmer from them makes the situation better. In any case, there is a simplified, high-level API available in Boost.Compute which allows kernels to directly operate on host memory. See the mapped_view class [2].
I hence wonder if the following overall design would be more appropriate (this is by no means a request to doing it this way, I just try to bring in alternative perspectives):
1) build on top the Khronos C / C++ bindings API, i.e. use that as base instead of the own core + utilities parts 2) offering a high-level interface for algorithm execution that exposes users to as little OpenCL internals as possible while giving the algorithms lots of flexibility 3) offering a high-level interface that auto-connects STL-containers with OpenCL memory objects, implemented based on standard C++ / Khronos API classes. 4) offering a high-level interface that applies the algorithms directly to objects of the Khronos C / C++ API.
The first point is rather obvious. To me the proposed library parts core + utility appear as just another C++ wrapper, and unless this is done in extremely (!!!) well manner (i.e. offering every functionality the Khronos API does, yet making it clean from scratch, aligning it with Standard C++, extending it by essential features etc., ensuring a rock-solid quality control for reliability etc.; I'd set the bar really high here) I see no reason to do it. If people are forced to use the proposed library core wrapper just to gain access to the other functionality (and there is good other functionality in there !) then I think there is a serious risk that a considerable number of people simply turn away altogether.
Well I would question the premise that the Khronos C++ wrapper is done in an "extremely well manner" and has "rock-solid quality" ;-). Could you let me know what parts of the core wrapper API fail to meet your bar for quality?
With respect to the second point I suppose something like that is doable:
// BEGIN
compute::gpgpu_engine gpgpuEngine; // default initialization links it to a default device etc.
// create vector, fill with data - ordinary C++ std::vector<float> vec(10000); std::generate(vec.begin(), vec.end(), rand);
compute::transform(vec.begin(), vec.end(), vec.begin(), compute::sqrt<float>(), gpgpuEngine);
std::cout << vec[0]; // results are already on the host
// END
So an instance of gpgpu_engine (or whatever name) gets setup once and can, if needed, become customized for its behaviour (devices used, execution policies etc.). This engine can then internally (hidden to the user, like a std::vector manages it's memory) manage buffers, and when transform now gets invoked it: -) copies the data to one of its buffers (create one if none available) -) run the kernel -) copies the data to the host side container (keep buffer for reuse later)
This would bring several advantages: -) the code becomes very similar to ordinary C++ code -> short-handed, less error-prone -) buffers can be recycled among multiple algorithm calls (e.g. the engine can cache a small number of buffers immediately available for future calls) -) more efficient OpenCL runtime utilization (performance) because the whole operation sequence has been abstracted: e.g. the input copy operation can be enqueued in a non-blocking fashion, so the data transfer to the device and the boost.compute kernel preparation can occur concurrently; equally while the kernel runs the copy-back-to-host command can already become enqueued. -) gpgpu_engine can encapsulate a number of policies that control its behaviour (providing sensible default-configurations but allowing fine-grained control if desired), e.g.: error handling (e.g. throwing exception vs. setting some plain-odd OpenCL error codes); device to execute on (if multiple available); copying back to the host in non-blocking manner (like copy_async); to allow the selection of a 'smart' execution path (for example if the input data do not warrant the overhead of a GPU call [e.g. if they are too small or do not well fit GPU computation problems] defer the call a plain STL-algorithm call or use OpenCL's built-in native C++ calling threading functionality); etc. It would be beneficial if those options can become temporarily overwritten (something like the boost::ios_..._saver classes come to mind).
With respect to the third point I am thinking of something along the lines of:
template
class vector_buffer { private: std::vector
vec; cl::Buffer buf; }; the class ensures that the std::vector and the buffer are automatically synchronized whenever changes must become transparent (i.e. access). Obviously this requires some thought if get functions grant access to the plain std::vector / cl_mem/cl::Buffer; however for the present implementation I also don't see what would stop me from hijacking a cl_mem from a compute::vector and modify the buffer arbitrarily outside the compute::vector class.
These are definitely ideas I've thought about and these kinds of tools could all be built upon the current API. I've played around with basically a "kernel functor" abstraction which provides a class with the regular C++ "operator()(Args...)" function which stores internally an OpenCL kernel along with a command queue and any buffers it needs. This would then provide a simple, pure C++ interface for executing functions on the GPU that looks just like a normal C++ function call. I'd be very interested in pursing this more.
For 4) I am thinking of overloads for the algorithm to +- directly accept Khronos C / C++ objects. Some really light-weight adapters adding e.g. required type + size data could do the trick.
This is fairly trivial. I have a patch in the works which would allow the Khronos C++ types to be passed anywhere that Boost.Compute wrapper types are used to make this all virtually transparent to the user.
In general I'd find it useful that all of a host object, a device object and something linking a host with a device object can be an input / output of an algorithm, and the implementation takes care of automatic data transfer. So if the input refers to a host object the implementation automatically copies the data to the device, if the output is a host object it also automatically copies the result to it etc.
See my response above, I'm not a huge fan of these automatic data transfer abstractions. However, I could be in favor of offering this functionality in a higher-level API.
A final but probably very important design consideration: I wonder if boost needs a OpenCL-computing library, or a general parallelization library. Presently the GPGPU world is already split too much between CUDA and OpenCL as main players (hardware vendors doing their parts ...), and technology is really rapidly moving (APUs etc.). As Hartmut has already pointed out one approach could be to use the current proposal as foundation for a parallelization implementation: cut it down to the essentials of that and hide as much OpenCL implementation details as possible. A completely different approach could be to try coming up with a unifying parallelization framework that supports multiple backends (OpenCL, CUDA, others). Obviously this would be a tremendous amount of work (and getting that API right is probably extremely difficult - the last thing we'd need is just another restricted API causing more splitting) but in the long run could be the more rewarding solution.
I think developing a unifying parallel framework which can intelligently dispatch algorithms to multiple back-ends is outside the scope of Boost.Compute (and seemingly much more in the remit of the proposed C++ Parallelism TS). I think OpenCL is more than sufficient for running code efficiently across a variety of parallel hardware including GPUs, multi-core CPUs, FPGAs and various other accelerators which is why I chose it to build Boost.Compute on top of (along with it being cross-platform and an open-standard).
Implementation details:
I have checked the implementation only briefly, mostly only when questions arose for a few functions. Overall it looks ok and organized, yet I have encountered some issues.
1) type safety [major issue - must be fixed before acceptance]: Type safety is not a strength of OpenCL, and this is reflected at parts in the implementation when it fails to add a proper conversion/protection layer. Using compute::reduce it was embarrassingly easy to produce rubbish results through the following code (modifying the provided tutorial code) :
// BEGIN
compute::device device = compute::system::default_device(); compute::context context(device); compute::command_queue queue(context, device);
// generate random data on the host - type is float std::vector<float> host_vector(10000); std::generate(host_vector.begin(), host_vector.end(), rand);
// create a vector on the device compute::vector<float> device_vector(host_vector.size(), context);
// transfer data from the host to the device compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue);
double reduction_result = 0.0; // result is of type double compute::reduce(device_vector.begin(), device_vector.end(), &reduction_result, queue); std::cout << "result: " << reduction_result<< std::endl;
// END
The input data is of type float while the result shall be stored in a double. This fails miserably under the current implementation because after the reduction has completed the final value stored in device memory gets copied merely byte-wise to the target variable (using a plain type-ignorant clEnqueueReadBuffer), reading simply the 4 bytes from a float into an 8 byte double (4/8 on my PC machine). I suppose reversing types (double as input, float as output) will be even more spectacular because 4 superfluous bytes simply overwrite the stack.
The same affects a plain compute::copy, for example if above the device_vector is of type double:
// BEGIN
// generate random data on the host - type is float std::vector<float> host_vector(10000); std::generate(host_vector.begin(), host_vector.end(), rand);
// create a vector on the device - type is double compute::vector<double> device_vector(host_vector.size(), context);
// transfer data from the host to the device compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue);
// END
it equally makes pang because the data are just copied byte-wise.
The library must provide a strict type-safety for all algorithms / data structures, where (in order of preference that comes to mind): a) convert properly to target type if possible (above surely applicable) b) issue compile-time error if conversions not possible c) last fallback: throw a proper exception at runtime
Thanks for providing these tests cases. I'll work on improving type-checking and error-reporting. If possible, could you submit these as bug reports to the issue tracker [1]?
2) when inspecting the code flow in above copy operation I missed a debug mode check that for a copy operation the output range can hold that many elements; something like a safe iterator returned by device_vector.begin() -> a good implementation should throw an organized error instead of just overwriting memory.
This is a very good idea. I'll work on adding more assertions to verify these sorts of pre-conditions for the algorithms.
3) for float/double input containers compute::accumulate falls back to a plain serial reduction, making element-wise additions (which is really slow on a GPU). This is because can_accumulate_with_reduce returns false as it is not defined for integral types. Is there a technical reason why it cannot work for floating types? How many algorithms are affected by a possible fallback to a plain serial execution order?
Because floating-point addition is not associative. Doing this would lead to accumulate() producing different results on the device versus the host. If the user is willing to trade performance for accuracy, they can call the reduce() algorithm directly. However, I don't think Boost.Compute should make this call universally and thus it is left up to the user to decide. I think this is documented in the accumulate() function, I'll look into making it more clear.
4) for types not supported under both OpenCL and C++ (e.g. long double, bool, half) more specific error messages would be useful.
Noted, I'll work on producing better error messages for unsupported types/operations.
Note: above are listed only issues which I have encountered during my few trials; there's no claim whatsoever for complete coverage
Performance:
I have not really tested performance so I cannot say much on it. At times I spotted what appears as unnecessary OpenCL runtime overhead (e.g. blocking commands, resetting kernel arguments upon each invocation) but I am not familiar enough with the implementation to judge if this really just redundant.
If you find any unnecessary blocking/synchronization please report it to the bug tracker [1] and we'll get it fixed.
Invoking the OpenCL compiler always takes considerable time for any OpenCL program. The library compiles kernels on demand when encountered to execute; while this is technically reasonable I am not sure in how far it is clear to everyone (foremost end-users of programs) that e.g. a simply accumulate of 100 ints may take several seconds to execute in total on first invocation simply because the kernel compilation takes that time. I guess this considerable penalty also somewhat discourages from using the library to create a number of kernels on the fly.
True, the run-time compilation model provided by OpenCL does have some associated overhead. There a few techniques in Boost.Compute which help mitigate this. The first is a run-time program cache which keeps built versions of commonly used kernels ready so that the compilation cost is only payed once per application run (see the program_cache class [3]). The other is support for offline-caching of program binaries. This is enabled by defining "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache binaries for programs so that they are only compiled once the very first time they are run on the system. I'll work on documenting the compilation-overhead issues as well as the caching functionality better.
I would find it very useful if smart algorithms dispatch the algorithm to a plain C++ algorithm if it's really predictable that a GPU execution will just waste time (I have elaborated on this above). It's fairly trivial to have data/algorithm combinations that are better not executed on the GPU, being able to rely on some auto-mechanism would relief programmers.
I disagree, I think the call on whether to execute the algorithm on the host or on an accelerator (with its associated overhead) should be left up to the user (or potentially a higher-level API built on top of Boost.Compute). While I agree that this would a useful feature, I just don't think Boost.Compute is the right place for that logic to live.
Answers to reviewer questions:
1. What is your evaluation of the design?
See comments above
2. What is your evaluation of the implementation?
See comments above
3. Documentation:
Overall I find the documentation well written and structured. As minor issues at times it could be more explicit / elaborate (e.g. for compute::vector a short description is provided, but for several other containers there is none; what are the accepted types for predicates in the algorithms etc.).
Will do.
The installation page misses that (on Windows) the OpenCL headers must be explicitly included (it leaves the impression that the library would do it on it's own).
Hmm, the Boost.Compute headers do "#include
The performance page should include more details with respect to overhead and specifically "unintuitive" ones such as kernel compilation time. Recommendations can be given when a problem is / is not suitable for GPU execution. It is unclear if the provided measurements refer to float or double; measurements for both should be provided.
Will do.
4. Overall usefulness of the library
I find the portable STL-ish algorithms (+ their supplements) useful. With respect to the core + utilities parts I think unnecessary competition with existent wrappers is introduced.
5. Did you try to use the library?
I used MSVC 12 and had no problems installing it and running a few little programs.
6. How much effort did you put into your evaluation?
I read the documentation, ran tutorial code and created a few example programs on my own (I did not run any of the pre-packaged examples). When questions arose I took close looks at the implementation (thus I focused more on in-depth analyses of selected components instead of testing overall into breadth).
7. Are you knowledgeable about the problem domain?
I consider myself knowledgeable of the problem domain.
8. The core question: Do you think the library should be accepted as a Boost library?
Generally speaking yes but I recommend a major revision before acceptance is reconsidered. My greatest concern revolves around the overall design aim. I don't like the idea of competing with the Khronos API for general wrapping; I'd prefer a more light-weight library with the STL-ish algorithms (or other things) at the core, adding to what is already out there. The library needs to find its own niche, minimizing overlap and elaborating on it's novelty/strengths. The implementation must become more robust, presently it is ways too trivial to break things.
Thanks again for the review! There are lots of good ideas raised here. Let me know if you have any other feedback or if I can explain anything further/more clearly. -kyle [1] https://github.com/kylelutz/compute/issues [2] http://kylelutz.github.io/compute/boost/compute/mapped_view.html [3] http://kylelutz.github.io/compute/boost/compute/program_cache.html
On 20/12/2014 09:32, Kyle Lutz wrote:
Can you give examples of where the OpenCL wrapper types in Boost.Compute are lacking?
I missed any interface to cl::... (Khronos C++ wrapping API)
One particular issue that makes me hesitant is the lack of OpenCL 2.0 support in the "official" C++ bindings. The OpenCL 2.0 specification was released over a year ago (November 2013). The first public OpenCL 2.0 implementation was released by Intel three months ago (September 2014, followed closely by an AMD implementation).
It is correct that Khronos presently only offers a C++ API for OpenCL 1.2, but in practice the ways more serious, limiting constraint is that Nvidia only offers a 1.1 _implementation_ anyway. Any portable OpenCL code must restrict itself to 1.1. We could discuss issues of the Khronos C++ bindings and your wrapper to no end, but my main point remains: The Khronos version is official and established; an alternative version intended for wider use must be clearly superior and give people strong reasons to migrate (I am not only referring to programming itself, but also people writing books on OpenCL, blogs etc.). In my eyes yours (that is core + utilities parts of your library) surely fails to meet that.
Could you let me know what parts of the core wrapper API fail to meet your bar for quality?
Just as one analysis example: For the plain C API clEnqueueReadBuffer (surely a commonly used API function) has this signature (we use it as reference): cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t size, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event); The Khronos C++ bindings version is: cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer, cl_bool blocking_read, ::size_t offset, ::size_t size, const void * ptr, const VECTOR_CLASS<Event> * events = NULL, Event * event = NULL); And your "main" version is: void enqueue_read_buffer(const buffer & buffer, size_t offset, size_t size, void * host_ptr, const wait_list & events = wait_list()); The Khronos C++ version matches in arguments logic and order fully the C version, and also returns a cl_int as error code - they correspond to each other, easy to use one or the other. Your version a) misses the cl_bool for blocking/non-blocking, b) misses the last event argument, and c) throws (I suppose) instead of returning an error code. Let's go through these: a) blocking: If I remember my code inspection correctly your version is automatically always blocking, and a non-blocking version is (I guess?) given by enqueue_read_buffer_async, which has a different signature (return value) besides it's different name. So instead of being able to set a single cl_bool and pass it as standard argument I need to take care which function (name) to call, what it returns etc. Studying your library docs I find very little information, what makes them different etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a blocking operation, it only says that it _enqueues_ a read command. Both functions then simply refer to clEnqueueReadBuffer() which does not help matters at all given the different signature. Now take a look at the Khronos C++ documentation and one encounters a ways, ways more detailed description. It's signature is already quite clear, and with the elaborate description it's really clear. b) why your version lacks the last event argument entirely, I have no idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL runtime-level profiling; controlling an out-of-order command-queue / another commmand-queue) seems to be undoable. c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour). reviewer note: My original review said that the documentation is well done, but this did not refer to the parts core + utilities of which I had always been critical. Another, huge issue in practice, is reliability + stability. Not only does the OpenCL execution model by itself make committing errors relatively easy, but I guess many of us can tell stories of bugged OpenCL implementations having hit us deeply. When something goes wrong figuring out where/why exactly can be a really slow, frustrating experience. So with respect to this the bar for any additional layer, developed outside Khronos (who has the OpenCL implementers on board !) must also be set very very high. Let me reiterate that my main concern is not fixing the example above; my main point is that any alternative version must be rock-solid from ground up on the one hand, plus offer considerable benefits to warrant migration considerations. Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there yet, have you ever given it a thought of getting in touch with the Khronos group if they are interested in doing work together, merging efforts? Note for clarity: I am personally neither involved with the Khronos Groups, nor an OpenCL implementer, nor otherwise with distributing OpenCL [no book author etc.]; just a programmer using OpenCL.
On the other hand to those rather new to OpenCL a simplified, less error-prone design would be beneficial ...
I have put much consideration into this and ultimately I don't feel it is right for the data-copying to be made implicit and hidden away from the user. ...> In any case, there is a simplified, high-level API available in Boost.Compute which allows kernels to directly operate on host memory. See the mapped_view class [2].
mapped_view uses a direct host-memory mapped buffer scheme, through the CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code to assert that initial suspicion - the docs don't say it, and you know other implementation schemes would have also been possible, e.g. it could have used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only very exceptionally useful in practice. However I fully agree with you that users should not be forced to use exclusively the one or the other. My proposal aimed at offering a variety of high-level abstractions from which users can choose: -) if an input or output to a Boost.compute algorithm call is a plain host side object (e.g. a std:: container), automatically copy the data to the OpenCL runtime (if input), run the kernel, and automatically copy the data from the OpenCL runtime to the host (if output). -) if an input or output to an algorithm call is a plain OpenCL object (e.g. cl_mem or some C++ class wrapping it) the user will need to take care of any copying to host memory. -) if an input or output to an algorithm call is something which links a plain host object (e.g. std:: container) to an OpenCL object (e.g. cl_mem) - I have sketched such a class in my original post - ensure that upon any access to either the data become automatically synchronized. One can imagine additional / mixed schemes as well, temporarily disable default behaviour etc.
These are definitely ideas I've thought about and these kinds of tools could all be built upon the current API...
...
See my response above, I'm not a huge fan of these automatic data transfer abstractions. However, I could be in favor of offering this functionality in a higher-level API.
My recommendation and reviewer comment is: Your library shall offer high(er)-level abstractions of things NOT already offered by Khronos, by building on top of the Khronos APIs (incl. their C++ wrapper). Your STL-ish algorithms and their supporting functionality are one aspect here; the stuff just discussed another example: those who want low-level control already have it and can use it (through the Khronos APIs), while those who desire an "auto-synchronization" between host and OpenCL runtime presently do NOT have anything. Here the real benefit would come in. 2 other examples that come to mind (real-world applications from my work): -) smoothly support utilizing multiple devices (e.g. auto-splitting kernel workload across devices and data synchronization among devices; details will depend quite whether devices do or do not belong to the same context). -) specifying "operations-sequences", like: "Copy input data A to OpenCL, copy input data B, run kernel 1, 2, an intermediate result decides if next run kernel 3 or 4, and then copy data C to host". Such pre-specified sequences implicitly also help to improve performance (e.g. only the last command must be non-blocking).
3) for float/double input containers compute::accumulate falls back to a plain serial reduction.
Because floating-point addition is not associative. Doing this would lead to accumulate() producing different results on the device versus the host.
This argument is not clear to me. Floating-point results must, generally speaking, always be expected to deviate slightly between a plain C++ and OpenCL execution (subject to the specific device hardware + driver version + compilation settings used; for pretty much any kernel). I'd say that when someone defers any floating-point calculation to OpenCL then this is implicitly acknowledged, at the benefit of a parallelized execution.
True, the run-time compilation model provided by OpenCL does have some associated overhead. There a few techniques in Boost.Compute which help mitigate this ... The other is support for offline-caching of program binaries. This is enabled by defining "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache binaries for programs so that they are only compiled once the very first time they are run on the system.
Has this been well-tested, including scenarios such as OpenCL driver version changes between runs and how it interacts with Nvidia's own auto-caching mechanism (which can drive one nuts when it fails to detect changes and happily uses an outdated binary).
I would find it very useful if smart algorithms dispatch the algorithm to a plain C++ algorithm if it's really predictable that a GPU execution will just waste time.
I disagree, I think the call on whether to execute the algorithm on the host or on an accelerator (with its associated overhead) should be left up to the user (or potentially a higher-level API built on top of Boost.Compute)...While I agree that this would a useful feature, I just don't think Boost.Compute is the right place for that logic to live.
It is meant as another example of a higher-level abstraction that can be done; in my opinion Boost.Compute would be the perfect place to put it in. There is no need to implement a number of the suggested higher-level abstractions right away (or for some: ever), it shall help finding the right niche for your library (now and in the long run). As such the recommendations / critics raised here are by no means intended to be destructive, I just bring in my view how I see your library best serving the community. cheers, Thomas
On Sun, Dec 21, 2014 at 3:44 AM, Thomas M
On 20/12/2014 09:32, Kyle Lutz wrote:
Can you give examples of where the OpenCL wrapper types in Boost.Compute are lacking?
I missed any interface to cl::... (Khronos C++ wrapping API)
True. As of now you can convert between any of the types (by just using the C++ wrapper types function call operator to access the OpenCL C type and then passing that to the Boost.Compute types constructor) like so: cl::CommandQueue q1 = ...; boost::compute::command_queue q2(q1()); I'm working on adding more direct support for the C++ wrapper types so in the future you will also be able to do this: cl::CommandQueue q1 = ...; boost::compute::command_queue q2 = q1;
One particular issue that makes me hesitant is the lack of OpenCL 2.0 support in the "official" C++ bindings. The OpenCL 2.0 specification was released over a year ago (November 2013). The first public OpenCL 2.0 implementation was released by Intel three months ago (September 2014, followed closely by an AMD implementation).
It is correct that Khronos presently only offers a C++ API for OpenCL 1.2, but in practice the ways more serious, limiting constraint is that Nvidia only offers a 1.1 _implementation_ anyway. Any portable OpenCL code must restrict itself to 1.1.
Well portable code need not restrict itself, but merely provide implementations of its functionality which will also work with only OpenCL 1.1 (or even 1.0). For instance, calling some of the higher-level algorithms in Boost.Compute will automatically dispatch based on the version of OpenCL supported by the device and call more efficient APIs if available, otherwise fall back to the older APIs. Boost.Compute tries as much as possible to shield users from these sorts of backwards-compatibility issues.
We could discuss issues of the Khronos C++ bindings and your wrapper to no end, but my main point remains: The Khronos version is official and established; an alternative version intended for wider use must be clearly superior and give people strong reasons to migrate (I am not only referring to programming itself, but also people writing books on OpenCL, blogs etc.). In my eyes yours (that is core + utilities parts of your library) surely fails to meet that.
Could you let me know what parts of the core wrapper API fail to meet your bar for quality?
Just as one analysis example:
For the plain C API clEnqueueReadBuffer (surely a commonly used API function) has this signature (we use it as reference):
cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t size, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event);
The Khronos C++ bindings version is:
cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer, cl_bool blocking_read, ::size_t offset, ::size_t size, const void * ptr, const VECTOR_CLASS<Event> * events = NULL, Event * event = NULL);
And your "main" version is:
void enqueue_read_buffer(const buffer & buffer, size_t offset, size_t size, void * host_ptr, const wait_list & events = wait_list());
The Khronos C++ version matches in arguments logic and order fully the C version, and also returns a cl_int as error code - they correspond to each other, easy to use one or the other. Your version a) misses the cl_bool for blocking/non-blocking, b) misses the last event argument, and c) throws (I suppose) instead of returning an error code. Let's go through these: a) blocking: If I remember my code inspection correctly your version is automatically always blocking, and a non-blocking version is (I guess?) given by enqueue_read_buffer_async, which has a different signature (return value) besides it's different name. So instead of being able to set a single cl_bool and pass it as standard argument I need to take care which function (name) to call, what it returns etc. Studying your library docs I find very little information, what makes them different etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a blocking operation, it only says that it _enqueues_ a read command. Both functions then simply refer to clEnqueueReadBuffer() which does not help matters at all given the different signature. Now take a look at the Khronos C++ documentation and one encounters a ways, ways more detailed description. It's signature is already quite clear, and with the elaborate description it's really clear.
Yes, I've split the blocking and non-blocking memory copy operations into separate functions. Personally, I've never been fond of APIs which drastically change behavior based on a single boolean flag. Also, this is more in line with the API provided by other libraries like Boost.ASIO (e.g. boost::asio::read() vs. boost::asio::async_read()). I'll also definitely work on improving the documentation for these functions.
b) why your version lacks the last event argument entirely, I have no idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL runtime-level profiling; controlling an out-of-order command-queue / another commmand-queue) seems to be undoable.
My main motivation for returning "void" was to prevent users from attempting to build asynchronous pipelines but accidentally using an event object returned from a synchronous API function causing the whole thing to become synchronous. But I see your point that there may be legitimate uses for event objects associated with synchronous operations. It should be relatively easy to update these APIs (return void -> return event). I'll look into this.
c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour).
This policy setting is implemented using the Boost Exception library [1]. Users may define the BOOST_NO_EXCEPTIONS configuration macro which will keep the library from throwing exceptions and instead call a user-defined error handling function. This technique is used nearly universally in Boost and I think is superior to approaches which would change the function signature based on a policy configuration.
reviewer note: My original review said that the documentation is well done, but this did not refer to the parts core + utilities of which I had always been critical.
I'll definitely continue to work on improving the documentation, especially to address the points you've made above.
Another, huge issue in practice, is reliability + stability. Not only does the OpenCL execution model by itself make committing errors relatively easy, but I guess many of us can tell stories of bugged OpenCL implementations having hit us deeply. When something goes wrong figuring out where/why exactly can be a really slow, frustrating experience. So with respect to this the bar for any additional layer, developed outside Khronos (who has the OpenCL implementers on board !) must also be set very very high.
I've worked hard to ensure that all error codes from OpenCL functions are checked and any errors are properly propagated back to the user. If you find any place where this error handling is missing, please let me know.
Let me reiterate that my main concern is not fixing the example above; my main point is that any alternative version must be rock-solid from ground up on the one hand, plus offer considerable benefits to warrant migration considerations. Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there yet, have you ever given it a thought of getting in touch with the Khronos group if they are interested in doing work together, merging efforts?
Note for clarity: I am personally neither involved with the Khronos Groups, nor an OpenCL implementer, nor otherwise with distributing OpenCL [no book author etc.]; just a programmer using OpenCL.
On the other hand to those rather new to OpenCL a simplified, less error-prone design would be beneficial ...
I have put much consideration into this and ultimately I don't feel it is right for the data-copying to be made implicit and hidden away from the user. ...> In any case, there is a simplified, high-level API available in Boost.Compute which allows kernels to directly operate on host memory. See the mapped_view class [2].
mapped_view uses a direct host-memory mapped buffer scheme, through the CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code to assert that initial suspicion - the docs don't say it, and you know other implementation schemes would have also been possible, e.g. it could have used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only very exceptionally useful in practice.
Umm, CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR do very different things. In Boost.Compute, the mapped_view class provides an abstraction around the former, while the pinned_allocator class provides an abstraction around the latter.
However I fully agree with you that users should not be forced to use exclusively the one or the other. My proposal aimed at offering a variety of high-level abstractions from which users can choose:
-) if an input or output to a Boost.compute algorithm call is a plain host side object (e.g. a std:: container), automatically copy the data to the OpenCL runtime (if input), run the kernel, and automatically copy the data from the OpenCL runtime to the host (if output).
This is actually already implemented for the copy() and sort() algorithms. For example, this code will automatically copy the data to the device, sort it, and copy it back to the host: std::vector<int> vec = ...; boost::compute::sort(vec.begin(), vec.end()); I plan on implementing this support for host iterators more widely in the API, just haven't had the time.
-) if an input or output to an algorithm call is a plain OpenCL object (e.g. cl_mem or some C++ class wrapping it) the user will need to take care of any copying to host memory.
I'm not sure I understand this completely (maybe could you provide some example code?). Currently, like in the STL, the algorithms accept iterators rather than containers/memory objects. This is the same approach taken in Boost.Compute. However, it's easy to wrap an arbitrary OpenCL buffer with an iterator by using the buffer_iterator class [2]: // opencl memory buffer cl_mem mem = ...; // fill mem with zeros boost::compute::fill_n(make_buffer_iterator<int>(mem, 0), memory_size / sizeof(int), 0, queue);
-) if an input or output to an algorithm call is something which links a plain host object (e.g. std:: container) to an OpenCL object (e.g. cl_mem) - I have sketched such a class in my original post - ensure that upon any access to either the data become automatically synchronized.
One can imagine additional / mixed schemes as well, temporarily disable default behaviour etc.
This is synchronized container concept is an interesting idea. If you have the time/desire to draw up a working proof-of-concept I'd be very interested in getting it integrated into Boost.Compute.
These are definitely ideas I've thought about and these kinds of tools could all be built upon the current API...
...
See my response above, I'm not a huge fan of these automatic data transfer abstractions. However, I could be in favor of offering this functionality in a higher-level API.
My recommendation and reviewer comment is: Your library shall offer high(er)-level abstractions of things NOT already offered by Khronos, by building on top of the Khronos APIs (incl. their C++ wrapper). Your STL-ish algorithms and their supporting functionality are one aspect here; the stuff just discussed another example: those who want low-level control already have it and can use it (through the Khronos APIs), while those who desire an "auto-synchronization" between host and OpenCL runtime presently do NOT have anything. Here the real benefit would come in.
2 other examples that come to mind (real-world applications from my work):
-) smoothly support utilizing multiple devices (e.g. auto-splitting kernel workload across devices and data synchronization among devices; details will depend quite whether devices do or do not belong to the same context).
This is something I have had on the road-map for a long while. I agree that building the infrastructure for these device-distributed algorithms would be very useful.
-) specifying "operations-sequences", like: "Copy input data A to OpenCL, copy input data B, run kernel 1, 2, an intermediate result decides if next run kernel 3 or 4, and then copy data C to host". Such pre-specified sequences implicitly also help to improve performance (e.g. only the last command must be non-blocking).
This is also something I have played around with. Basically I'd like to have any API which allows users to define "pipelines" or "task-graphs" which hook up several different kernels/algorithms/memory-copies and produce an efficient set of operations to stream data through and extract the results. Any ideas you have on a potential API you'd like to see for this would be great. There is potentially some prior art in the C++ pipelines proposal [3] which may be interesting.
3) for float/double input containers compute::accumulate falls back to a plain serial reduction.
Because floating-point addition is not associative. Doing this would lead to accumulate() producing different results on the device versus the host.
This argument is not clear to me. Floating-point results must, generally speaking, always be expected to deviate slightly between a plain C++ and OpenCL execution (subject to the specific device hardware + driver version + compilation settings used; for pretty much any kernel). I'd say that when someone defers any floating-point calculation to OpenCL then this is implicitly acknowledged, at the benefit of a parallelized execution.
Strongly disagree, the floating-point operations on the device are well defined and their output should be identical to the host results (barring optimizations like "-cl-fast-relaxed-math"). More concretely, this test should always pass for any set of input data: float data[] = { 1.1f, 2.3f, 3.4f, 4.5f }; std::vector<float> host_vec(data, data + 4); boost::compute::vector<float> device_vec(data, data + 4, queue); BOOST_CHECK_EQUAL( std::accumulate(host_vec.begin(), host_vec.end(), 0), boost::compute::accumulate(device_vec.begin(), device_vec.end(), 0, queue) ); I don't feel that sacrificing precision is always implicitly acknowledged and prefer to leave the choice of precision vs. performance up to the user.
True, the run-time compilation model provided by OpenCL does have some associated overhead. There a few techniques in Boost.Compute which help mitigate this ... The other is support for offline-caching of program binaries. This is enabled by defining "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache binaries for programs so that they are only compiled once the very first time they are run on the system.
Has this been well-tested, including scenarios such as OpenCL driver version changes between runs and how it interacts with Nvidia's own auto-caching mechanism (which can drive one nuts when it fails to detect changes and happily uses an outdated binary).
I am very confident that the caching system works and it will properly deal with changing driver versions. It was added nearly a year ago and I haven't encountered any problems with it or received any bug reports. And it essentially duplicates the functionality of NVIDIA's offline caching functionality but makes it available to all OpenCL platforms. I haven't had any ill effects using both together on the same system.
I would find it very useful if smart algorithms dispatch the algorithm to a plain C++ algorithm if it's really predictable that a GPU execution will just waste time.
I disagree, I think the call on whether to execute the algorithm on the host or on an accelerator (with its associated overhead) should be left up to the user (or potentially a higher-level API built on top of Boost.Compute)...While I agree that this would a useful feature, I just don't think Boost.Compute is the right place for that logic to live.
It is meant as another example of a higher-level abstraction that can be done; in my opinion Boost.Compute would be the perfect place to put it in.
There is no need to implement a number of the suggested higher-level abstractions right away (or for some: ever), it shall help finding the right niche for your library (now and in the long run). As such the recommendations / critics raised here are by no means intended to be destructive, I just bring in my view how I see your library best serving the community.
These are definitely some good ideas. Thanks for all the feedback! -kyle [1] http://www.boost.org/doc/libs/1_57_0/libs/exception/doc/boost-exception.html [2] http://kylelutz.github.io/compute/boost/compute/buffer_iterator.html [3] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3534.html
On 21.12.2014 20:39, Kyle Lutz wrote:
On Sun, Dec 21, 2014 at 3:44 AM, Thomas M
wrote: Studying your library docs I find very little information, what makes them different etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a blocking operation, it only says that it _enqueues_ a read command. Both functions then simply refer to clEnqueueReadBuffer() which does not help matters at all given the different signature.
Yes, I've split the blocking and non-blocking memory copy operations into separate functions. Personally, I've never been fond of APIs which drastically change behavior based on a single boolean flag. Also, this is more in line with the API provided by other libraries like Boost.ASIO (e.g. boost::asio::read() vs. boost::asio::async_read()).
As a library user I agree with the more explicit split between sync and async routines, ala ASIO, and I think Boost.Compute should follow this convention. However, I think this case you should deviate from the OpenCL API names to make it more clear that things are different. Specifically, drop the "enqueue" word. Simply have "read_buffer" and "read_buffer_async". For me the "enqueue" word just makes things more confusing.
c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour).
FWIW, again as a library user, I quite like ASIO's approach where each operation is overloaded to either fill an error_code or throw.
This is also something I have played around with. Basically I'd like to have any API which allows users to define "pipelines" or "task-graphs" which hook up several different kernels/algorithms/memory-copies and produce an efficient set of operations to stream data through and extract the results.
Any ideas you have on a potential API you'd like to see for this would be great. There is potentially some prior art in the C++ pipelines proposal [3] which may be interesting.
In my "just for fun" Delphi.Compute library (written in Delphi, inspired by Boost.Compute) I made Copy() and Transform() return futures of the output buffers, as well as accept futures as parameters. Note, Delphi doesn't have iterators like C++ so my routines operate directly on buffers. So when Transform() say got a Future<Buffer> instead of a Buffer for a parameter, it would add the future's associated event to the wait list passed to clEnqueueNDRangeKernel (technically the Buffer type has an implicit conversion operator to an "immediate" Future<Buffer>). This made it pretty seamless to queue up everything and then just wait for the final read (the default "copy device buffer to host array and return it" call is blocking). The code looks sequential but would only block on that last read. I'm sure there are better ways, just thought I'd share.
Strongly disagree, the floating-point operations on the device are well defined and their output should be identical to the host results (barring optimizations like "-cl-fast-relaxed-math").
While I agree, I've found Intel's OpenCL CPU device to return results which make me think it uses some relaxed math regardless. With NVIDIA and AMD I can get (essentially) the same results as reference CPU calculations, but with Intel I sometimes get quite large discrepancies. Of course, it's possible I'm just doing it wrong... Cheers - Asbjørn
In my "just for fun" Delphi.Compute library (written in Delphi, inspired by Boost.Compute) I made Copy() and Transform() return futures of the output buffers, as well as accept futures as parameters. Note, Delphi doesn't have iterators like C++ so my routines operate directly on buffers.
So when Transform() say got a Future<Buffer> instead of a Buffer for a parameter, it would add the future's associated event to the wait list
clEnqueueNDRangeKernel (technically the Buffer type has an implicit conversion operator to an "immediate" Future<Buffer>).
This made it pretty seamless to queue up everything and then just wait for
Asbjørn
final read (the default "copy device buffer to host array and return it" call is blocking). The code looks sequential but would only block on that last read.
This idea is pretty nifty and I've been pondering this exact way of implementing asynchrony as well in my library. Another method that I also like would be to just let the buffer store the futures. This is how Joel's/Numscale's NT2 works [0], he gave an awesome talk about it at this years Meeting C++ [1]. But this might just be a higher-level interface more applicable to expression trees as opposed to STL like functions. A third way would be to return a future<void> from all meta-functions and allow meta-functions to take a future<void> - this would directly map to the stream/command_queue but the interface is maybe not that meaningful any more. Does your implementation implicitly synchronize two command_queues if you issue a binary function with two different futures? Sebastian [0] https://github.com/MetaScale/nt2 [1] http://www.slideshare.net/joelfalcou/automatic-taskbased-code-generation-for...
On Sun, Dec 21, 2014 at 3:01 PM, Sebastian Schaetz
Asbjørn
writes: In my "just for fun" Delphi.Compute library (written in Delphi, inspired by Boost.Compute) I made Copy() and Transform() return futures of the output buffers, as well as accept futures as parameters. Note, Delphi doesn't have iterators like C++ so my routines operate directly on buffers.
So when Transform() say got a Future<Buffer> instead of a Buffer for a parameter, it would add the future's associated event to the wait list passed to clEnqueueNDRangeKernel (technically the Buffer type has an implicit conversion operator to an "immediate" Future<Buffer>).
This made it pretty seamless to queue up everything and then just wait for the final read (the default "copy device buffer to host array and return it" call is blocking). The code looks sequential but would only block on that last read.
This idea is pretty nifty and I've been pondering this exact way of implementing asynchrony as well in my library.
Another method that I also like would be to just let the buffer store the futures. This is how Joel's/Numscale's NT2 works [0], he gave an awesome talk about it at this years Meeting C++ [1]. But this might just be a higher-level interface more applicable to expression trees as opposed to STL like functions.
A third way would be to return a future<void> from all meta-functions and allow meta-functions to take a future<void> - this would directly map to the stream/command_queue but the interface is maybe not that meaningful any more.
Yeah, these are all interesting ideas. While I'm a bit hesitant to
bake this behavior into the low-level command_queue interface, I'd be
more than happy to support this type of API at a higher level.
I've also been meaning to look at Aura again more closely. Last time
was many months ago and it looks like you've done quite a lot of work
since then.
And if your either of you are interested in trying to implement any of
these ideas in Boost.Compute I'd be very happy to work together. There
is already a "
Kyle Lutz
Yeah, these are all interesting ideas. While I'm a bit hesitant to bake this behavior into the low-level command_queue interface, I'd be more than happy to support this type of API at a higher level.
I have the exact same problem, I'm hesitant to include these thoughts into Aura because I haven't found the best way to do it. I guess to find the correct API, a few variants have to be implemented and we have to see what makes most sense.
I've also been meaning to look at Aura again more closely. Last time was many months ago and it looks like you've done quite a lot of work since then.
You are of course free to look but I have to disappoint your: these thoughts are up until now just thoughts. I'll let you know once I have something.
On 22/12/2014 16:44, Sebastian Schaetz wrote:
Kyle Lutz
writes: Yeah, these are all interesting ideas. While I'm a bit hesitant to bake this behavior into the low-level command_queue interface, I'd be more than happy to support this type of API at a higher level.
I have the exact same problem, I'm hesitant to include these thoughts into Aura because I haven't found the best way to do it. I guess to find the correct API, a few variants have to be implemented and we have to see what makes most sense.
I've also been meaning to look at Aura again more closely. Last time was many months ago and it looks like you've done quite a lot of work since then.
You are of course free to look but I have to disappoint your: these thoughts are up until now just thoughts. I'll let you know once I have something.
I think Aura and Boost.Compute have mertis as low level, C++ styled proper wrapper on top of accelerator technology. It's important for them to be flexible and not too fancy so multiple fancier tool can reuse them.
On Sun, Dec 21, 2014 at 2:24 PM, Asbjørn
On 21.12.2014 20:39, Kyle Lutz wrote:
On Sun, Dec 21, 2014 at 3:44 AM, Thomas M
wrote: Studying your library docs I find very little information, what makes them different etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a blocking operation, it only says that it _enqueues_ a read command. Both functions then simply refer to clEnqueueReadBuffer() which does not help matters at all given the different signature.
Yes, I've split the blocking and non-blocking memory copy operations into separate functions. Personally, I've never been fond of APIs which drastically change behavior based on a single boolean flag. Also, this is more in line with the API provided by other libraries like Boost.ASIO (e.g. boost::asio::read() vs. boost::asio::async_read()).
As a library user I agree with the more explicit split between sync and async routines, ala ASIO, and I think Boost.Compute should follow this convention. However, I think this case you should deviate from the OpenCL API names to make it more clear that things are different. Specifically, drop the "enqueue" word. Simply have "read_buffer" and "read_buffer_async". For me the "enqueue" word just makes things more confusing.
Well one concern with removing the word "enqueue" from the name is that the function actually *does* enqueue a command in the queue. For instance, if you queue up a couple kernel launches (which are asynchronous) followed by a synchronous read (with enqueue_read_buffer()), the asynchronous operations in the queue will still be executed before the read operation (i.e. the normal FIFO queue behavior).
c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour).
FWIW, again as a library user, I quite like ASIO's approach where each operation is overloaded to either fill an error_code or throw.
Having two separate APIs with different error handling semantics is definitely possible. While I'm fairly happy with the current exception-based error handling, implementing an approach like ASIO's wouldn't be that difficult.
This is also something I have played around with. Basically I'd like to have any API which allows users to define "pipelines" or "task-graphs" which hook up several different kernels/algorithms/memory-copies and produce an efficient set of operations to stream data through and extract the results.
Any ideas you have on a potential API you'd like to see for this would be great. There is potentially some prior art in the C++ pipelines proposal [3] which may be interesting.
In my "just for fun" Delphi.Compute library (written in Delphi, inspired by Boost.Compute) I made Copy() and Transform() return futures of the output buffers, as well as accept futures as parameters. Note, Delphi doesn't have iterators like C++ so my routines operate directly on buffers.
So when Transform() say got a Future<Buffer> instead of a Buffer for a parameter, it would add the future's associated event to the wait list passed to clEnqueueNDRangeKernel (technically the Buffer type has an implicit conversion operator to an "immediate" Future<Buffer>).
This made it pretty seamless to queue up everything and then just wait for the final read (the default "copy device buffer to host array and return it" call is blocking). The code looks sequential but would only block on that last read.
I'm sure there are better ways, just thought I'd share.
That sounds very cool. I'd definitely be interested in exploring an API like that. I've also been keeping my eye on the papers from the C++ concurrency working group (for instance, N3857). It would be good to align work in this direction for Boost.Compute with the proposed standards (where feasible).
Strongly disagree, the floating-point operations on the device are well defined and their output should be identical to the host results (barring optimizations like "-cl-fast-relaxed-math").
While I agree, I've found Intel's OpenCL CPU device to return results which make me think it uses some relaxed math regardless. With NVIDIA and AMD I can get (essentially) the same results as reference CPU calculations, but with Intel I sometimes get quite large discrepancies. Of course, it's possible I'm just doing it wrong...
Interesting, could you provide a test-case to reproduce this? In my testing (on Intel and others) I haven't found any problems with the code I supplied (though I've heard there may be issues with precision complex operations involving transcendental functions/sqrt()/etc. on some implementations). Thanks for your feedback! -kyle
On 22.12.2014 05:32, Kyle Lutz wrote:
Well one concern with removing the word "enqueue" from the name is that the function actually *does* enqueue a command in the queue. For instance, if you queue up a couple kernel launches (which are asynchronous) followed by a synchronous read (with enqueue_read_buffer()), the asynchronous operations in the queue will still be executed before the read operation (i.e. the normal FIFO queue behavior).
Fair enough. For me this is implied since you're operating on a command_queue, not say command_graph.
That sounds very cool. I'd definitely be interested in exploring an API like that. I've also been keeping my eye on the papers from the C++ concurrency working group (for instance, N3857). It would be good to align work in this direction for Boost.Compute with the proposed standards (where feasible).
I agree w.r.t. standards. Not a lot like that in the Delphi world so I tend to explore on my own :)
Interesting, could you provide a test-case to reproduce this? In my testing (on Intel and others) I haven't found any problems with the code I supplied (though I've heard there may be issues with precision complex operations involving transcendental functions/sqrt()/etc. on some implementations).
I'll have to verify, but I'm pretty sure it was my simple Legendre polynomial evaluation test. It simply evaluates P_10(x) for several million points x \in [-1, 1]. So just multiplications and additions/subtractions. I'll rerun the tests to make sure I'm not just imagining things :) Cheers - Asbjørn
On 21/12/2014 23:24, Asbjørn wrote:
On 21.12.2014 20:39, Kyle Lutz wrote:
Strongly disagree, the floating-point operations on the device are well defined and their output should be identical to the host results (barring optimizations like "-cl-fast-relaxed-math").
While I agree, I've found Intel's OpenCL CPU device to return results which make me think it uses some relaxed math regardless. With NVIDIA and AMD I can get (essentially) the same results as reference CPU calculations, but with Intel I sometimes get quite large discrepancies. Of course, it's possible I'm just doing it wrong...
Intel's OpenCL CPU implementation (it was SDK 2013) is exactly the one from which I am used to that results normally deviate, quite considerably indeed. I don't have it installed on this machine, but Kyle could you run some kernel code on doubles yourself? My code used basic arithmetic operands, summed values up (up to tens-of-thousands) and there were several exp / log / sqrt along the way. I hadn't set any special compiler flag, surely not -cl-fast-relaxed-math. My suspicion was that they use their own math library which provides highly optimized calculation variants. FWIW I am not even sure if a plain C++ program compiled with the Intel C++ compiler and linking in their math library will produce the same results as e.g. MSVC.
On 21 Dec 2014 at 11:39, Kyle Lutz wrote:
Yes, I've split the blocking and non-blocking memory copy operations into separate functions. Personally, I've never been fond of APIs which drastically change behavior based on a single boolean flag. Also, this is more in line with the API provided by other libraries like Boost.ASIO (e.g. boost::asio::read() vs. boost::asio::async_read()).
I'll also definitely work on improving the documentation for these functions.
b) why your version lacks the last event argument entirely, I have no idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL runtime-level profiling; controlling an out-of-order command-queue / another commmand-queue) seems to be undoable.
My main motivation for returning "void" was to prevent users from attempting to build asynchronous pipelines but accidentally using an event object returned from a synchronous API function causing the whole thing to become synchronous. But I see your point that there may be legitimate uses for event objects associated with synchronous operations. It should be relatively easy to update these APIs (return void -> return event). I'll look into this.
c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour).
I won't have the time to contribute a review as I am in the rural
United States until the 29th. But I will say this: if your library is
sufficiently header implemented, I'd look into the async_result
infrastructure used by ASIO.
Using async_result lets the caller of your API specify how that API
should (a) be async or sync (b) how to return errors (exception or
code or anything else) (c) if async, then what kind of operation
token object (e.g. a std::future, a boost::future, a compute::future)
to return. You'd probably still provide sync versions of APIs for
convenience, but these would simply thunk into the async API with the
appropriate async_result inputs.
If your library isn't sufficiently header implemented (like AFIO
which maintains a stable ABI in addition to API, and therefore
async_result cannot pass through that ABI), then if you can wait
until summer 2015 I should have my non-allocating basic_future
infrastructure working for AFIO. This more generic future mash up
toolkit lets you roll futures which return error codes rather than
exceptions (thanks to expected
On Sun, Dec 21, 2014 at 5:33 PM, Niall Douglas
On 21 Dec 2014 at 11:39, Kyle Lutz wrote:
Yes, I've split the blocking and non-blocking memory copy operations into separate functions. Personally, I've never been fond of APIs which drastically change behavior based on a single boolean flag. Also, this is more in line with the API provided by other libraries like Boost.ASIO (e.g. boost::asio::read() vs. boost::asio::async_read()).
I'll also definitely work on improving the documentation for these functions.
b) why your version lacks the last event argument entirely, I have no idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL runtime-level profiling; controlling an out-of-order command-queue / another commmand-queue) seems to be undoable.
My main motivation for returning "void" was to prevent users from attempting to build asynchronous pipelines but accidentally using an event object returned from a synchronous API function causing the whole thing to become synchronous. But I see your point that there may be legitimate uses for event objects associated with synchronous operations. It should be relatively easy to update these APIs (return void -> return event). I'll look into this.
c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour).
I won't have the time to contribute a review as I am in the rural United States until the 29th.But I will say this: if your library is sufficiently header implemented, I'd look into the async_result infrastructure used by ASIO.
I'd be very interested in hearing your feedback, even if it is after the formal review period. And I also looked into trying to integrate Boost.Compute with ASIO a while back but didn't make much progress. From what I could tell, it seemed like asynchronous operations are handled via file descriptors (which I guess is required to pass them to select()/epoll()) which doesn't match up with what the OpenCL API makes available (essentially a blocking wait() API or a callback-based API). However, I'm not nearly as knowledgeable about ASIO internals as I'd like to be, any pointers/examples/guidance in this area would be very helpful (I'll definitely look into the async_result infrastructure that you mentioned).
Using async_result lets the caller of your API specify how that API should (a) be async or sync (b) how to return errors (exception or code or anything else) (c) if async, then what kind of operation token object (e.g. a std::future, a boost::future, a compute::future) to return. You'd probably still provide sync versions of APIs for convenience, but these would simply thunk into the async API with the appropriate async_result inputs.
If your library isn't sufficiently header implemented (like AFIO which maintains a stable ABI in addition to API, and therefore async_result cannot pass through that ABI), then if you can wait until summer 2015 I should have my non-allocating basic_future infrastructure working for AFIO. This more generic future mash up toolkit lets you roll futures which return error codes rather than exceptions (thanks to expected
, you simply supply error_code as type E). They should perfectly close the gap between the flexibility of ASIO's async_result and futures and the functional programming idioms of Hana and Expected in an ABI stable solution.
Very interesting. I'll keep my eye on this. Thanks! -kyle
Thomas M
We could discuss issues of the Khronos C++ bindings and your wrapper to no end, but my main point remains: The Khronos version is official and established; an alternative version intended for wider use must be clearly superior and give people strong reasons to migrate (I am not only referring to programming itself, but also people writing books on OpenCL, blogs etc.).
[snipping exemplary comparison of OpenCL C, OpenCL C++ and Boost.Compute] I'd like to offer some more general thoughts on this: Isn't there a precedent here? If we look at MPI and Boost.MPI we have something similar. There is a standard C implementation and a crude (but official and established) C++ wrapper around that (that was removed in MPI3 btw). And as far as I know Boost.MPI builds on top the C implementation of MPI and just does a much better job than the official C++ wrapper. I think Boost.Compute is the better wrapper for OpenCL C. The weight of "official" and "established" should not play such a significant role if we're trying to think ahead. Sebastian
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Sebastian Schaetz Sent: Monday, December 22, 2014 11:16 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014
Thomas M writes:
Isn't there a precedent here? If we look at MPI and Boost.MPI we have something similar. There is a standard C implementation and a crude (but official and established) C++ wrapper around that (that was removed in MPI3 btw). And as far as I know Boost.MPI builds on top the C implementation of MPI and just does a much better job than the official C++ wrapper.
To me, it seems more useful to focus on suitability of the solution (i.e. the proposed library) to a problem domain, rather than making legalistic arguments based on precedent. For instance, you could consider Boost.Thread and Boost.ASIO. They would've been useful as simple pthreads and sockets API wrappers, respectively. But they went further. Why? I'd like to think it's because this would've left out too many platforms with similar APIs, or with native APIs that offered better performance than going through a pthreads or sockets emulation layers. To this end, it seems to me an OpenCL-specific library for parallel computation leaves too many platforms out in the cold. While we could speculate about HSA and future acceleration APIs that would be omitted, there already exist multitudes of multicore and multiprocessor systems which don't support OpenCL but do support threading. In my opinion, this is no small point. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
To me, it seems more useful to focus on suitability of the solution (i.e.
Gruenke,Matt
domain, rather than making legalistic arguments based on precedent.
For instance, you could consider Boost.Thread and Boost.ASIO. They would've been useful as simple pthreads and sockets API wrappers, respectively. But they went further. Why? I'd like to think it's because this would've left out too many platforms with similar APIs, or with native APIs that offered better performance than going through a pthreads or sockets emulation layers.
My argument is part of the discussion: "Should Boost.Compute be based on the OpenCL C++ or OpenCL C layer?". My argument makes no sense when trying to answer "Should Boost.Compute support one or multiple backends?".
Not a full review, but I just wanted to chime in on a couple points. -----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Saturday, December 20, 2014 3:33 To: Thomas M Cc: boost@lists.boost.org List Subject: Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014
On Fri, Dec 19, 2014 at 1:07 PM, Thomas M
wrote:
<snip>
One particular issue that makes me hesitant is the lack of OpenCL 2.0 support in the "official" C++ bindings. The OpenCL 2.0 specification was released over a year ago (November 2013). The first public OpenCL 2.0 implementation was released by Intel three months ago (September 2014, followed closely by an AMD implementation). Boost.Compute had OpenCL 2.0 support implemented a week later. As of today (over a year since the specification was released), there is still no support for OpenCL 2.0 in the Khronos C++ OpenCL wrapper. I don't think it would be prudent to restrict Boost.Compute to a subset of the OpenCL API merely because of shortcomings in the "official" C++ wrapper.
Kronos is receptive to contributions. You can file bugs on their APIs and even contact them regarding patches and other contributions. <snip>
A final but probably very important design consideration: I wonder if boost needs a OpenCL-computing library, or a general parallelization library. Presently the GPGPU world is already split too much between CUDA and OpenCL as main players (hardware vendors doing their parts ...), and technology is really rapidly moving (APUs etc.). As Hartmut has already pointed out one approach could be to use the current proposal as foundation for a parallelization implementation:
<snip>
I think developing a unifying parallel framework which can intelligently dispatch algorithms to multiple back-ends is outside the scope of Boost.Compute
Perhaps, but I strongly agree that a Boost.Compute library shouldn't be tied to OpenCL. Beyond CUDA, we should expect to see more OpenCL alternatives enabled by HSA. And it would seem highly desirable to have a TBB (or equivalent) backend, for the vast multitudes of multi-core machines that don't support OpenCL. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
participants (7)
-
Asbjørn
-
Gruenke,Matt
-
Joel FALCOU
-
Kyle Lutz
-
Niall Douglas
-
Sebastian Schaetz
-
Thomas M