Hi, This week I presented a proposal to the C++ standards committee to provide a standard library component for SIMD computation based on the library in development Boost.SIMD (not yet a Boost library). I was hoping to get feedback on the interface and establish an API that would satisfy hardware and compiler vendors. Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit. Development of Boost.SIMD will still proceed, aiming for integration in Boost, but standardization appears to be definitely out of the question. Any feedback of the API presented in the proposal is welcome. http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3571.pdf
hi mathias,
Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit.
hmm ... that's quite unfortunate ... a memory-based interface like valarray just does not compose and is pretty useless for many use cases ... would be quite interested to hear the reasons for the refusal, do you have any details? cheers, tim
On 18/04/13 13:28, Tim Blechmann wrote:
hi mathias,
Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit.
hmm ... that's quite unfortunate ... a memory-based interface like valarray just does not compose and is pretty useless for many use cases ... would be quite interested to hear the reasons for the refusal, do you have any details?
I'm not sure I can publish the minutes here.
On Thu, Apr 18, 2013 at 2:46 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Hi,
This week I presented a proposal to the C++ standards committee to provide a standard library component for SIMD computation based on the library in development Boost.SIMD (not yet a Boost library).
I was hoping to get feedback on the interface and establish an API that would satisfy hardware and compiler vendors.
Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit.
This is a shame. Is the rationale or the official response from the working group available somewhere? Development of Boost.SIMD will still proceed, aiming for integration in
Boost, but standardization appears to be definitely out of the question. Any feedback of the API presented in the proposal is welcome. http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3571.pdf
I have a few questions: 1. When a particular algorithm or pack configuration is not supported by the hardware, is the implementation required to emulate it with scalar or partially vectorized operations? And what is the behavior of Boost.SIMD in this regard? 2. It looks like the proposal does not define any means to discover the availability of certain operations and pack configurations in hardware. How would algorithm versioning work with this proposal? I'm not assuming each algorithm and operation would dispatch its implementation based on the hardware check as this would be too slow. 3. It supports division and modulus for integers? Is it supported by any hardware? This is out of curiosity, I'm not familiar with implementations other than SSE/AVX. 4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers. 5. Do you have any plans or time frames on Boost.SIMD inclusion? What is the state of the library? I also want to encourage you on continuing this work. This is a very interesting area and there is certainly demand for a higher level abstraction of SIMD operations. Keep up the good work and perhaps one day we'll see SIMD in the Standard after all.
On Thu, 18 Apr 2013, Andrey Semashev wrote:
1. When a particular algorithm or pack configuration is not supported by the hardware, is the implementation required to emulate it with scalar or partially vectorized operations?
Yes (according to my recollection of reading the paper).
3. It supports division and modulus for integers?
Why not?
Is it supported by any hardware?
At least some special cases are, like division by a power of 2. And if the divisor is constant, you can also let the implementation handle turning it into a multiplication. And the general case might very well be supported in the future, if it isn't already.
4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers.
If you only want fma as a fast way to compute a+b*c, you could just let your compiler optimize an addition and a multiplication to fma. They are not bad at that. If you rely on the extra accuracy of fma, then library functions seem necessary. -- Marc Glisse
On Friday 19 April 2013 01:21:58 Marc Glisse wrote:
On Thu, 18 Apr 2013, Andrey Semashev wrote:
3. It supports division and modulus for integers?
Why not?
Is it supported by any hardware?
At least some special cases are, like division by a power of 2.
I think these special cases are better coded explicitly.
And if the divisor is constant, you can also let the implementation handle turning it into a multiplication.
Does the compiler do that with user-defined operators (which are user-defined in case of packs)? Or do you mean the implementation of the operator will handle that? The latter means that the division will be very slow, but ok, since the division is slow even in hardware...
4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers. If you only want fma as a fast way to compute a+b*c, you could just let your compiler optimize an addition and a multiplication to fma. They are not bad at that. If you rely on the extra accuracy of fma, then library functions seem necessary.
According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization. Regarding FMA, this is probably easier for compilers, but due to the difference in accuracy I don't expect compilers to perform this optimization lightly (i.e. without a specific compiler switch explicitly allowing it). And a switch, being a global option, may not be suitable in every place of the application. So having a way to explicitly express programmer's intention is useful here too. I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler. What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?
Le 19/04/2013 07:55, Andrey Semashev a écrit :
On Friday 19 April 2013 01:21:58 Marc Glisse wrote:
On Thu, 18 Apr 2013, Andrey Semashev wrote:
3. It supports division and modulus for integers? Why not?
Is it supported by any hardware? At least some special cases are, like division by a power of 2. I think these special cases are better coded explicitly.
And if the divisor is constant, you can also let the implementation handle turning it into a multiplication. Does the compiler do that with user-defined operators (which are user-defined in case of packs)? Or do you mean the implementation of the operator will handle that? The latter means that the division will be very slow, but ok, since the division is slow even in hardware...
4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers. If you only want fma as a fast way to compute a+b*c, you could just let your compiler optimize an addition and a multiplication to fma. They are not bad at that. If you rely on the extra accuracy of fma, then library functions seem necessary. According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization.
Regarding FMA, this is probably easier for compilers, but due to the difference in accuracy I don't expect compilers to perform this optimization lightly (i.e. without a specific compiler switch explicitly allowing it). And a switch, being a global option, may not be suitable in every place of the application. So having a way to explicitly express programmer's intention is useful here too.
I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler.
What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?
They are already present in Boost simd. Overflowing operations are the current operators + - * / abs and neg. Saturating operations are abss, adds,subs,muls,divs,negs (the final s standing for saturated). abs and negs differ from abs and neg in that they handle Valmin -> Valmax with integer. (versus Valmin -> Valmin for the standard ones). We also have saturate<A>(a) which returns the saturated value of a in the type A (the available types in Bosst.simd are signed/unsigned/integers (8,16,32,64) and pack of such) All operations in Boost.simd are coded in a way that if they do not exist or have no speed interest to be written using proper intrinsics they fallback to a map of the scalar available implementation on each element of the SIMD vectors (however note that is very uncommon). This is the case of integer division for 64 bits integers (without entering to far in implementation, it is often (when possible) speedier to use floating division intrinsics to implement integer division on today's processors)
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 19/04/13 06:55, Andrey Semashev wrote:
According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization.
_mm_madd_epi16 is not a vertical operation, so it's a fairly special function, and you can't expect the compiler to recognize cases where it can use it. _mm_macc_epi16 is the vertical one (XOP only), and quite more easy on the optimizer. There are fma and correct_fma functions in any case.
Regarding FMA, this is probably easier for compilers, but due to the difference in accuracy I don't expect compilers to perform this optimization lightly (i.e. without a specific compiler switch explicitly allowing it).
They do. A compiler is allowed to use higher precision for intermediate results whenever it wants. This is what also allows compilers to use 80-bit of precision for operations on float or double.
I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler.
What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?
The standard proposal tried to keep things simple, the library itself has quite a few more things.
On Sunday 21 April 2013 11:34:14 Mathias Gaunard wrote:
On 19/04/13 06:55, Andrey Semashev wrote:
According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization.
_mm_madd_epi16 is not a vertical operation, so it's a fairly special function, and you can't expect the compiler to recognize cases where it can use it.
That's my point. Nonetheless this operation is very useful in some cases and I would like to be able to use it with Boost.SIMD. Same as many other special operations.
I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler.
What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?
The standard proposal tried to keep things simple, the library itself has quite a few more things.
So, is it possible to convert pack to __m128i & co. and back in Boost.SIMD?
Andrey Semashev
4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers.
I have a more fundamental question. How will you handle vector masks? You aren't going to get performance on future implementations without it. -David
18.04.2013 14:46, Mathias Gaunard:
Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers.
I would also like to know details of rejection.
I'm afraid I don't quite understand the rationale for such a refusal; proposing more high-level constructs similar to valarray (or to our own library NT2) was suggested, but that's obviously a more complex and limited API, not a basic building block to program portably a specific processor unit.
I am using Eigen library in my projects, internally it has it's own abstraction around SIMD instructions and several backends for different instruction sets. Such kind of high-level libraries would clearly benefit from some standard-way to do SIMD operations. However, I see one major drawback of low-level SIMD interface in ISO: it is not as future-proof as higher-level API. SIMD instruction sets are expanding and becoming more complex, I am not sure how low-level library is supposed to catch future trends. For instance, there is FMA instruction "d=a+b*c" - yes, your proposal have appropriate fma function in <cmath>. But imagine that some new architecture would have "double FMA" instruction like: "f=a+b*c+d*e", or even more complex instruction: "2x2 matrix multiplication". In order to support such kind of new instructions, low-level library should add new functions - i.e. wait for new version of ISO. And until new version of ISO would not adapt these new functions, that low-level interface would not be competitive - users (higher-level libraries developers) would again use compiler-specific intrinsics. While in case of higher level interface (like Eigen), only internal implementation should be adjusted in order to get benefits. -- Evgeny Panasyuk
On 18/04/13 14:29, Evgeny Panasyuk wrote:
For instance, there is FMA instruction "d=a+b*c" - yes, your proposal have appropriate fma function in <cmath>. But imagine that some new architecture would have "double FMA" instruction like: "f=a+b*c+d*e", or even more complex instruction: "2x2 matrix multiplication".
It is relatively easy for compilers to transform a*b+c to fma(a,b,c) (even if the operations involved are SIMD intrinsics). As a matter of fact, compilers already do it.
20.04.2013 3:37, Mathias Gaunard:
For instance, there is FMA instruction "d=a+b*c" - yes, your proposal have appropriate fma function in <cmath>. But imagine that some new architecture would have "double FMA" instruction like: "f=a+b*c+d*e", or even more complex instruction: "2x2 matrix multiplication".
It is relatively easy for compilers to transform a*b+c to fma(a,b,c) (even if the operations involved are SIMD intrinsics). As a matter of fact, compilers already do it.
And what is your point? Do you mean that we should rely on auto-vectorizer? Quote from proposal: "Autovectorizers have the ability to detect code fragments that can be vectorized. This automatic process nds its limits when the user code is not presenting a clear vectorizable pattern (i.e. complex data dependencies, non-contiguous memory accesses, aliasing or control ows). The SIMD code generation stays fragile and the resulting instruction ow may be suboptimal compared to an explicit vectorization." -- Evgeny Panasyuk
On Mon, Apr 22, 2013 at 6:32 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
On 21/04/13 14:01, Evgeny Panasyuk wrote:
And what is your point? Do you mean that we should rely on
auto-vectorizer?
This has nothing to do with auto-vectorization.
I think the argument was that in one regard you point out that we cannot rely on compiler to optimize the code and in the other you suggest the opposite. Although I admit that expression transform to FMA is simpler for the compiler to handle, I would still prefer to explicitly spell it out as a function call. In general, when writing SIMD code, I would prefer to spell out as much as possible and leave only lowest level optimizations to the compiler (such as instruction scheduling, register allocation and spilling, maybe CSE and DCE, things like that).
On 22/04/13 16:03, Andrey Semashev wrote:
I think the argument was that in one regard you point out that we cannot rely on compiler to optimize the code and in the other you suggest the opposite.
Some optimizations are trivial, some require complex analysis and transformation algorithms. I don't think there is anything wrong with relying on the former.
Although I admit that expression transform to FMA is simpler for the compiler to handle, I would still prefer to explicitly spell it out as a function call.
You can still spell it out explicitly if you want to.
On Thu, 18 Apr 2013, Mathias Gaunard wrote:
Development of Boost.SIMD will still proceed, aiming for integration in Boost, but standardization appears to be definitely out of the question. Any feedback of the API presented in the proposal is welcome. http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3571.pdf
Copying here my earlier comments so they are in the same place as others'.
Some of them are only relevant for standardization, not for a boost library.
Hello,
a few comments while reading N3571.
pack
On 19/04/13 00:29, Marc Glisse wrote:
On Thu, 18 Apr 2013, Mathias Gaunard wrote:
Development of Boost.SIMD will still proceed, aiming for integration in Boost, but standardization appears to be definitely out of the question. Any feedback of the API presented in the proposal is welcome. http://open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3571.pdf
Copying here my earlier comments so they are in the same place as others'. Some of them are only relevant for standardization, not for a boost library.
I wasn't subscribed to c++-lib-ext at the time, so I missed them.
Hello,
a few comments while reading N3571.
pack
seems more similar to std::array than std::tuple to me.
It's statically-sized, so both runtime and compile-time access are possible. Compile-time access could be significantly more efficient.
We could even dream of merging pack and array into a single type.
I don't think that's a good idea, pack has a very strong numerical
semantic. You don't really want to do +, * or / on arrays.
Plus pack
As much as possible, I would like to avoid having a different interface for vectors and scalars. We have std::min for scalars, we can overload it for vectors instead of having simd::min.
The idea is that any SIMD code should be valid scalar code as well. I'm not sure whether we want the selection for functions to use ADL or whether it should be std::min. I have no strong opinion on this.
We have ?: for scalars, you don't need to restrict yourself to a pure library
Overloading ?: would probably have to be a standalone language extension to core no?
Masking: it is a bit strange to be able to do pack<double> & int but not double & int. Currently in gcc we require that you (reinterpret) cast the pack<double> to a pack<some integer>, do the masking and go back.
It was a wish from my colleague; I personally think it might be better to align the pack operators to be the same as the scalar equivalents, and therefore not allow it without a cast.
Any policy on reinterpret_cast-ing a pack to a pack of a different type?
In Boost.SIMD there is a bitwise_cast<To>(from) function, which is essentially the same as To to; memcpy(&to, &from, sizeof(from)); return to; This can be optimized to a reinterpret_cast in some cases, but reinterpret_cast itself is dangerous because of aliasing issues. ass T , std :: size_t N = unspecified >
struct alignas ( sizeof ( T ) * N ) pack
Do you really want to specify that large an alignment? You give examples with N=100...
N can never be 100, it's a power of 2. It would be possible to relax the alignment requirement somewhat however.
Maybe operator[] const could return by value if it wants to?
Yes.
Since the splat constructor is implicit, you may not need to document all the mixed operations.
It was to make things clearer, but I guess it might not be necessary.
Any notion of a subvector?
No, though it would probably be useful to be able to split any vector in two. Duly noted.
For gather and others, the proposal accepts mixing vector sizes.
For scatter/gather, it doesn't accept a number of indices different than the size of the vectors being loaded. Conversions are allowed to happen however, you can load from a uint8* to a vector of int32.
cmath functions: it is not clear what signatures are supported, in particular for functions that use several types (ldexp has double and int). The list doesn't seem to exactly match cmath, actually. frexp takes an int*, does the vector version take a pointer to a vector, or some scatter-like vector?
We have some equivalents in Boost.SIMD, to make things simpler we use int32 for float and int64 for double, so that the sizes of vectors are the same.
Traits: are those supposed to be template aliases? Or to derive from what they are supposed to "return"? Or have a typedef ... type; inside?
They're metafunctions, i.e. classes with a type member typedef.
For transform and accumulate, I have seen other proposals that specify new versions more in terms of permissions (what the compiler is allowed to do) and less implementation. Depending on the tag argument you pass to transform/accumulate, you give the compiler permission to reorder the operations, or do other transformations and it then deduces that it can parallelize and/or vectorize. Looks nice. Note that it doesn't contradict this proposal, simd::transform can always forward to std::transform(..., vectorizable_tag()) (or in the reverse direction).
simd::transform takes a function object that must be valid with both scalar and pack values; std::transform only requires that the function object be valid with scalar values. That's the main difference between the two functions.
On Sat, 20 Apr 2013, Mathias Gaunard wrote:
We have ?: for scalars, you don't need to restrict yourself to a pure library
Overloading ?: would probably have to be a standalone language extension to core no?
Well, yes, it touches core, but there is a large difference between allowing one specific library type in ?: and letting users overload it as they please. Anyway, since standardization seems to be out, you can forget this comment. -- Marc Glisse
Mathias Gaunard
This week I presented a proposal to the C++ standards committee to provide a standard library component for SIMD computation based on the library in development Boost.SIMD (not yet a Boost library).
Unfortunately the concurrency/parallelism group has decided that they do not want C++ to provide types representing SIMD registers. I'm afraid I don't quite understand the rationale for such a refusal;
It's pretty clear to me. A better approach is to add language constructs to help the compiler do the vectorization. It's 2013. We should be finished with requiring people to hand-vectorize code. Adding things like "restrict" and/or keywords like "concurrent," ways to disambiguate possible aliases, describe unknown loop dependences, etc. are going to be much more flexible and fruitful long-term than providing a library that is tied to a particular model of parallelism (and a narrow model of vectorization, BTW). Please look at what compilers from PGI, Intel, CAPS and yes, Cray, do to help users parallelize code. A reading of the pragma descriptions in the various compiler manuals would be informative. -David
participants (7)
-
Andrey Semashev
-
dag@cray.com
-
Evgeny Panasyuk
-
jtl
-
Marc Glisse
-
Mathias Gaunard
-
Tim Blechmann