[compute] Review

newer
updated Boost.Test devel->master:...

Yiannis Papadopoulos

31 Dec 2014 31 Dec '14

4:14 a.m.

Hi, This is my review of Boost.Compute: 1. What is your evaluation of the design? It seems logical to me. It is effectively a wrapper around OpenCL that provides implementations of higher-level algorithms, and allows interoperability with OpenCL and OpenGL. The Boost.Compute name is a bit misleading, as Boost.Compute supports only OpenCL-enabled devices. 2. What is your evaluation of the implementation? There is some code duplication (e.g. type traits) and various other bits and pieces that can be moved to existing Boost components. I think there should be some effort spent towards that. It seems that performance is on par with Thrust. However, there are other libraries out there (e.g Bolt) and multiple devices, so there has to be a more extensive experimental evaluation to say decidedly that it is a good implementation. 3. What is your evaluation of the documentation? Overall, it is pretty good. Given the complexity of the accelerator programming model, a few more elaborate examples in the tutorial would be welcome. 4. What is your evaluation of the potential usefulness of the library? This is difficult to answer. A lot of work has been put in this library and it seems the way to go. The interfaces are clean, the code looks solid and the developer willing. However, there is limited vendor support, there are not enough benchmarks and there are other alternatives that they have both. Given that Boost.Compute is targeted to users that know a thing or two about performance, I don't know how they can be convinced to consider using Boost.Compute against Bolt or Thrust. 5. Did you try to use the library? With what compiler? Did you have any problems? I did using an AMD 7850 on Linux with gcc 4.8. The few examples I tried, compiled and ran fine. 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I went over the documentation, I glanced over the code and ran a few examples. 7. Are you knowledgeable about the problem domain? I'm in the HPC field. I have extensive experience with MPI, OpenMP, pthreads, and less with TBB, CUDA and OpenCL. 8. Do you think the library should be accepted as a Boost library? This will be a maybe. It is a well-written library with a few minor issues that can be resolved. However, why would someone use Boost.Compute against what is out there? Average users can resort to Bolt or Thrust. Power users will probably always try to hand-tune their OpenCL or CUDA algorithm. How can we test it and prove its performance? Regards, Yiannis

Show replies by date

Kyle Lutz

31 Dec 31 Dec

5:57 a.m.

On Tue, Dec 30, 2014 at 8:14 PM, Yiannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:

...

Hi,

This is my review of Boost.Compute:

1. What is your evaluation of the design?

It seems logical to me. It is effectively a wrapper around OpenCL that provides implementations of higher-level algorithms, and allows interoperability with OpenCL and OpenGL.

The Boost.Compute name is a bit misleading, as Boost.Compute supports only OpenCL-enabled devices.

2. What is your evaluation of the implementation?

There is some code duplication (e.g. type traits) and various other bits and pieces that can be moved to existing Boost components. I think there should be some effort spent towards that.

Could you let me know which type-traits you think are duplicated or should be moved elsewhere?

...

It seems that performance is on par with Thrust. However, there are other libraries out there (e.g Bolt) and multiple devices, so there has to be a more extensive experimental evaluation to say decidedly that it is a good implementation.

There are a large number of performance benchmarks under the "perf" directory [1] which can be used to measure and evaluate performance of the library. But you're right that the performance page in the documentation currently only shows comparisons with the STL and Thrust, I'll work on adding others to this.

...

3. What is your evaluation of the documentation?

Overall, it is pretty good. Given the complexity of the accelerator programming model, a few more elaborate examples in the tutorial would be welcome.

Fully agree, I will continue to work on improving the documentation.

...

4. What is your evaluation of the potential usefulness of the library?

This is difficult to answer. A lot of work has been put in this library and it seems the way to go. The interfaces are clean, the code looks solid and the developer willing.

However, there is limited vendor support, there are not enough benchmarks and there are other alternatives that they have both. Given that Boost.Compute is targeted to users that know a thing or two about performance, I don't know how they can be convinced to consider using Boost.Compute against Bolt or Thrust.

5. Did you try to use the library? With what compiler? Did you have any problems?

I did using an AMD 7850 on Linux with gcc 4.8. The few examples I tried, compiled and ran fine.

6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

I went over the documentation, I glanced over the code and ran a few examples.

7. Are you knowledgeable about the problem domain?

I'm in the HPC field. I have extensive experience with MPI, OpenMP, pthreads, and less with TBB, CUDA and OpenCL.

8. Do you think the library should be accepted as a Boost library?

This will be a maybe. It is a well-written library with a few minor issues that can be resolved.

However, why would someone use Boost.Compute against what is out there? Average users can resort to Bolt or Thrust. Power users will probably always try to hand-tune their OpenCL or CUDA algorithm. How can we test it and prove its performance?

Yes, Thrust and Bolt are alternatives. The problem is that each is incompatible with the other. Thrust works on NVIDIA GPUs while Bolt only works on AMD GPUs. Choosing one will preclude your code from working on devices from the other. On the other hand, code written with Boost.Compute will work on any device with an OpenCL implementation. This includes NVIDIA GPUs, AMD GPUs/CPUs, Intel GPUs/CPUs as well as other more exotic architectures (Xeon Phi, FPGAs, Parallella Epiphany, etc.). Furthermore, unlike CUDA/Thrust, Boost.Compute requires no special complier or compiler-extensions in order to execute code on GPUs, it is a pure library-level solution which is compatible with any standard C++ compiler. Also, Boost.Compute does allow for users to access the low-level APIs and execute their own hand-rolled kernels (and even interleave their custom operations with the high-level algorithms available in Boost.Compute). I think using Boost.Compute in this way allows for both rapid development and the ability to fully-optimize kernels for specific operations where necessary. Thanks for the review. Let me know if I can explain anything more clearly. -kyle [1] https://github.com/kylelutz/compute/tree/master/perf

Gruenke,Matt

7:29 a.m.

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Wednesday, December 31, 2014 0:58 To: boost@lists.boost.org List Subject: Re: [boost] [compute] Review

...

Bolt only works on AMD GPUs.

I wasn't aware that it was AMD-specific (aside from the C++ AMP backend). They claim support for "any available OpenCL(tm) capable accelerated compute unit". I don't have a Nvidia card on which to test it. https://github.com/HSA-Libraries/Bolt They *do* support Intel TBB. http://hsa-libraries.github.io/Bolt/html/buildingTBB.html Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

Kyle Lutz

5:37 p.m.

On Tue, Dec 30, 2014 at 11:29 PM, Gruenke,Matt <mgruenke@tycoint.com> wrote:

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Wednesday, December 31, 2014 0:58 To: boost@lists.boost.org List Subject: Re: [boost] [compute] Review

...
Bolt only works on AMD GPUs.

I wasn't aware that it was AMD-specific (aside from the C++ AMP backend). They claim support for "any available OpenCL(tm) capable accelerated compute unit". I don't have a Nvidia card on which to test it.

Yeah, while Bolt has an OpenCL backend for GPU execution, it depends on the "OpenCL Static C++ Kernel Language" extension which is only implemented by AMD. -kyle

Ioannis Papadopoulos

5:57 p.m.

On 12/30/2014 11:57 PM, Kyle Lutz wrote:

...

On Tue, Dec 30, 2014 at 8:14 PM, Yiannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:

...
Hi,

This is my review of Boost.Compute:

2. What is your evaluation of the implementation?

There is some code duplication (e.g. type traits) and various other bits and pieces that can be moved to existing Boost components. I think there should be some effort spent towards that.

Could you let me know which type-traits you think are duplicated or should be moved elsewhere?

For example, the is_fundamental<T> is already implemented in Boost.TypeTraits. Or type_traits/type_name.hpp may be able to leverage Boost.TypeIndex?

...

...
8. Do you think the library should be accepted as a Boost library?

This will be a maybe. It is a well-written library with a few minor issues that can be resolved.

However, why would someone use Boost.Compute against what is out there? Average users can resort to Bolt or Thrust. Power users will probably always try to hand-tune their OpenCL or CUDA algorithm. How can we test it and prove its performance?

Yes, Thrust and Bolt are alternatives. The problem is that each is incompatible with the other. Thrust works on NVIDIA GPUs while Bolt only works on AMD GPUs. Choosing one will preclude your code from working on devices from the other.

On the other hand, code written with Boost.Compute will work on any device with an OpenCL implementation. This includes NVIDIA GPUs, AMD GPUs/CPUs, Intel GPUs/CPUs as well as other more exotic architectures (Xeon Phi, FPGAs, Parallella Epiphany, etc.). Furthermore, unlike CUDA/Thrust, Boost.Compute requires no special complier or compiler-extensions in order to execute code on GPUs, it is a pure library-level solution which is compatible with any standard C++ compiler.

Also, Boost.Compute does allow for users to access the low-level APIs and execute their own hand-rolled kernels (and even interleave their custom operations with the high-level algorithms available in Boost.Compute). I think using Boost.Compute in this way allows for both rapid development and the ability to fully-optimize kernels for specific operations where necessary.

Thanks for the review. Let me know if I can explain anything more clearly.

-kyle

[1] https://github.com/kylelutz/compute/tree/master/perf

I realize that, but the thing is that what is the advantage of Boost.Compute vs doing something like: template<class InputIterator , class EqualityComparable > auto count(InputIterator first, InputIterator last, const EqualityComparable& value) { #ifdef THRUST return thrust::count(first, last, value); #elif BOLT return bolt::cl::count(first, last, value); #elif STL return std::count(first, last, value); #endif } where first and last are iterators on some vector<> that is ifdefed similarly (or just use some template magic to invoke the right algorithm based on the container type). I have this concern, and IMO users might question themselves that while shopping for GPU libraries. Just to be clear, I am not dissing your work: I really like it and your positive attitude for addressing issues.

Kyle Lutz

7:09 p.m.

On Wed, Dec 31, 2014 at 9:57 AM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:

...

On 12/30/2014 11:57 PM, Kyle Lutz wrote:

...
On Tue, Dec 30, 2014 at 8:14 PM, Yiannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:

...
Hi,

This is my review of Boost.Compute:

2. What is your evaluation of the implementation?

There is some code duplication (e.g. type traits) and various other bits and pieces that can be moved to existing Boost components. I think there should be some effort spent towards that.

Could you let me know which type-traits you think are duplicated or should be moved elsewhere?

For example, the is_fundamental<T> is already implemented in Boost.TypeTraits. Or type_traits/type_name.hpp may be able to leverage Boost.TypeIndex?

True, there is a boost::is_fundamental<T> (and a std::is_fundamental<T> in C++11), but these have different semantics than boost::compute::is_fundamental<T>. For Boost.Compute, the is_fundamental<T> trait returns true if the type T is fundamental on the device (i.e. a OpenCL built-in type). For example, the float4_ type is an aggregate type on the host (i.e. std:: is_fundamental <float4_>::value == false) but is a built-in type in OpenCL (i.e. boost::compute::is_fundamental<float4_>::value == true). As for type_name<T>(), it returns a string with the OpenCL type name for the C++ type and can actually be very different from the C++ type name (e.g. type_name<Eigen::Vector2f>() == "float2").

...

...
...
8. Do you think the library should be accepted as a Boost library?

This will be a maybe. It is a well-written library with a few minor issues that can be resolved.

However, why would someone use Boost.Compute against what is out there? Average users can resort to Bolt or Thrust. Power users will probably always try to hand-tune their OpenCL or CUDA algorithm. How can we test it and prove its performance?

Yes, Thrust and Bolt are alternatives. The problem is that each is incompatible with the other. Thrust works on NVIDIA GPUs while Bolt only works on AMD GPUs. Choosing one will preclude your code from working on devices from the other.

On the other hand, code written with Boost.Compute will work on any device with an OpenCL implementation. This includes NVIDIA GPUs, AMD GPUs/CPUs, Intel GPUs/CPUs as well as other more exotic architectures (Xeon Phi, FPGAs, Parallella Epiphany, etc.). Furthermore, unlike CUDA/Thrust, Boost.Compute requires no special complier or compiler-extensions in order to execute code on GPUs, it is a pure library-level solution which is compatible with any standard C++ compiler.

Also, Boost.Compute does allow for users to access the low-level APIs and execute their own hand-rolled kernels (and even interleave their custom operations with the high-level algorithms available in Boost.Compute). I think using Boost.Compute in this way allows for both rapid development and the ability to fully-optimize kernels for specific operations where necessary.

Thanks for the review. Let me know if I can explain anything more clearly.

-kyle

[1] https://github.com/kylelutz/compute/tree/master/perf

I realize that, but the thing is that what is the advantage of Boost.Compute vs doing something like:

template<class InputIterator , class EqualityComparable > auto count(InputIterator first, InputIterator last, const EqualityComparable& value) { #ifdef THRUST return thrust::count(first, last, value); #elif BOLT return bolt::cl::count(first, last, value); #elif STL return std::count(first, last, value); #endif }

where first and last are iterators on some vector<> that is ifdefed similarly (or just use some template magic to invoke the right algorithm based on the container type). I have this concern, and IMO users might question themselves that while shopping for GPU libraries.

Well if we took this approach, the library would have to be compiled separately for each different compute device rather than being portable to any system with an OpenCL implementation. And while this is a trivial example for count(), implementing this for more complicated algorithms which take user-defined operators or work with higher-level iterators (e.g. transform_iterator or zip_iterator) would be much more difficult. I think an approach like this would ultimately be more complex, harder to maintain, and less flexible in the interfaces/functionality we could offer.

...

Just to be clear, I am not dissing your work: I really like it and your positive attitude for addressing issues.

Not at all, I appreciate your feedback. Thanks! -kyle

Jason Newton

1 Jan 1 Jan

9:46 a.m.

As someone who's looked at both sources of Bolt and Boost.Compute - I can attest Kyle has some well written code and one of the cleanest / concise / most straightforward approaches to several common issues that are generally much more complicated when trying to do C++ with AOT/JIT metaprogramming OpenCL to deal with templates. Re bolt, AMD does indeed use the static kernel extension in bolt and they suffer alot of copypasta - from what I can tell it's fairly unnecessary copypasta-ing and unnecessary usage of templates (don't get me wrong, I like type traits and lots of templates for subroutines and specializations!) but non-the-less that extension was used and thus is not compatible with other implementations (nvidia, intel) and so we have 2 different libraries that are effectively stl-like and solve the same algorithms but only work on their respective platforms. On the flip side Nvidia seems to not be interested in good OpenCL performance (or a public 1.2+ version) for whatever reasons so a library that runs well on any OpenCL compliant device would be good to those needing a portable GPU backed stl-like set of algorithms but that makes the libraries choice a bit at odds with performance since most performance interested parties would probably use CUDA out of the belief that it will perform significantly better vs NVidia's OpenCL implementation or OpenCL in general - I'd be very interested in seeing a valid and direct comparison of a few common algorithms implemented in the same fashion in OpenCL and CUDA in addition library-to-library performance charts [here: https://kylelutz.github.io/compute/boost_compute/performance.html ] Boost.Compute has already - this might take away some of the fear from the research community of OpenCL vs CUDA esp given the strength of the Boost name. A few tidbits relating to the work other than that though: -One of the things I don't like is the scan operation recursively calls itself and reallocates memory a bunch of times - Bolt seems to have taken a better approach here which reduces the global memory usage too. I don't like additional memory allocations or things that could incur unnecessary/unexpected latencies if the problem changes size. I assume this may be in a few different spots too? -If I was to use the library it probably would be for sorting - I think you should take a peek at the techniques used in Bolt to make the radix sort faster and scan faster. Potentially if available and requested prefer OpenCL 2.0's work group functions too. -I personally haven't found a place to use libraries like Thrust/Bolt/Boost.Compute - their whole idea is that you have a single huge workload rather than a batch of small-medium workloads which I find much more common in my own work - I have only rolled my own specialized kernels so far - who is the typical userbase of these libraries? As such I do not forsee myself as a user at present. -Would like some way to make the runtime dump the kernel's the library has made. -Based on my current understanding of OpenCL on Altera - likely on Xillinx as well, this library's AoT/JIT technique will not work on FPGAs - for those you would need to take the kernel's used through a suite of tools which can take hours to process and later you are able to load a program and call the kernels - the chief problem is that there is no actual JIT allowed in that domain. I dont hold this against Boost.Compute - of course the FPGA design flow is going to be be different and no one has a library for them anyway. To try and be formal I'll put in my own review if anyone's still reading: 1. What is your evaluation of the design? High quality. 2. What is your evaluation of the implementation? High quality. Could use some refinement and maybe optimization in some functions but these changes are easy to make and will not break API - the library has many functions with very good performance and is better / more skillfully maintained than others I have have compared it to. 3. What is your evaluation of the documentation? Good enough. 4. What is your evaluation of the potential usefulness of the library? Fills a portability gap in stl-like libraries such as thrust and bolt, looks to have more STL algorithms implemented. I don't know who uses these libraries though but there are several alternatives if you give up code portability which indicates userbases exist. 5. Did you try to use the library? With what compiler? Did you have any problems? No 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I was evaluating these libraries myself several times over the last year on specific components, I had to get deep in the guts of those parts. 7. Are you knowledgeable about the problem domain? GPGPU yes - not in who uses these GPU backed stl-like libraries though. On Wed, Dec 31, 2014 at 11:09 AM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:

...

On Wed, Dec 31, 2014 at 9:57 AM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:

...
On 12/30/2014 11:57 PM, Kyle Lutz wrote:

...
On Tue, Dec 30, 2014 at 8:14 PM, Yiannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:

...
Hi,

This is my review of Boost.Compute:

2. What is your evaluation of the implementation?

There is some code duplication (e.g. type traits) and various other bits and pieces that can be moved to existing Boost components. I think there should be some effort spent towards that.

Could you let me know which type-traits you think are duplicated or should be moved elsewhere?

For example, the is_fundamental<T> is already implemented in Boost.TypeTraits. Or type_traits/type_name.hpp may be able to leverage Boost.TypeIndex?

True, there is a boost::is_fundamental<T> (and a std::is_fundamental<T> in C++11), but these have different semantics than boost::compute::is_fundamental<T>. For Boost.Compute, the is_fundamental<T> trait returns true if the type T is fundamental on the device (i.e. a OpenCL built-in type). For example, the float4_ type is an aggregate type on the host (i.e. std:: is_fundamental <float4_>::value == false) but is a built-in type in OpenCL (i.e. boost::compute::is_fundamental<float4_>::value == true).

As for type_name<T>(), it returns a string with the OpenCL type name for the C++ type and can actually be very different from the C++ type name (e.g. type_name<Eigen::Vector2f>() == "float2").

...
...
...
8. Do you think the library should be accepted as a Boost library?

This will be a maybe. It is a well-written library with a few minor issues that can be resolved.

However, why would someone use Boost.Compute against what is out there? Average users can resort to Bolt or Thrust. Power users will probably always try to hand-tune their OpenCL or CUDA algorithm. How can we test it and prove its performance?

Yes, Thrust and Bolt are alternatives. The problem is that each is incompatible with the other. Thrust works on NVIDIA GPUs while Bolt only works on AMD GPUs. Choosing one will preclude your code from working on devices from the other.

On the other hand, code written with Boost.Compute will work on any device with an OpenCL implementation. This includes NVIDIA GPUs, AMD GPUs/CPUs, Intel GPUs/CPUs as well as other more exotic architectures (Xeon Phi, FPGAs, Parallella Epiphany, etc.). Furthermore, unlike CUDA/Thrust, Boost.Compute requires no special complier or compiler-extensions in order to execute code on GPUs, it is a pure library-level solution which is compatible with any standard C++ compiler.

Also, Boost.Compute does allow for users to access the low-level APIs and execute their own hand-rolled kernels (and even interleave their custom operations with the high-level algorithms available in Boost.Compute). I think using Boost.Compute in this way allows for both rapid development and the ability to fully-optimize kernels for specific operations where necessary.

Thanks for the review. Let me know if I can explain anything more clearly.

-kyle

[1] https://github.com/kylelutz/compute/tree/master/perf

I realize that, but the thing is that what is the advantage of Boost.Compute vs doing something like:

template<class InputIterator , class EqualityComparable > auto count(InputIterator first, InputIterator last, const EqualityComparable& value) { #ifdef THRUST return thrust::count(first, last, value); #elif BOLT return bolt::cl::count(first, last, value); #elif STL return std::count(first, last, value); #endif }

where first and last are iterators on some vector<> that is ifdefed similarly (or just use some template magic to invoke the right algorithm based on the container type). I have this concern, and IMO users might question themselves that while shopping for GPU libraries.

Well if we took this approach, the library would have to be compiled separately for each different compute device rather than being portable to any system with an OpenCL implementation. And while this is a trivial example for count(), implementing this for more complicated algorithms which take user-defined operators or work with higher-level iterators (e.g. transform_iterator or zip_iterator) would be much more difficult. I think an approach like this would ultimately be more complex, harder to maintain, and less flexible in the interfaces/functionality we could offer.

...
Just to be clear, I am not dissing your work: I really like it and your positive attitude for addressing issues.

Not at all, I appreciate your feedback. Thanks!

-kyle

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Gruenke,Matt

11:41 p.m.

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Jason Newton Sent: Thursday, January 01, 2015 4:46 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review

...

I'd be very interested in seeing a valid and direct comparison of a few common algorithms implemented in the same fashion in OpenCL and CUDA in addition library-to-library performance charts [here: https://kylelutz.github.io/compute/boost_compute/performance.html] Boost.Compute has already - this might take away some of the fear from the research community of OpenCL vs CUDA esp given the strength of the Boost name.

The biggest issue I see with those performance comparisons is that they're not very realistic. While a direct comparisons of individual operation performance is certainly interesting, a more relevant metric would be to implement and compare solutions to real-world problems. This will expose various copying and synchronization overheads, which are doubtlessly of concern to many current GPGPU practitioners and potential Boost.Compute users.

...

A few tidbits relating to the work other than that though: -One of the things I don't like is the scan operation recursively calls itself and reallocates memory a bunch of times - Bolt seems to have taken a better approach here which reduces the global memory usage too. I don't like additional memory allocations or things that could incur unnecessary/unexpected latencies if the problem changes size.

At the very least, documentation of the memory requirements might help users partition their data appropriately to avoid device memory exhaustion. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

3845

Age (days ago)

3846

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

Gruenke,Matt
Ioannis Papadopoulos
Jason Newton
Kyle Lutz
Yiannis Papadopoulos

[compute] Review

Yiannis Papadopoulos

Kyle Lutz

Gruenke,Matt

Kyle Lutz

Ioannis Papadopoulos

Kyle Lutz

Jason Newton

Gruenke,Matt

tags

participants (5)