Re: [boost] Going forward with Boost.SIMD

19 Apr 2013

      On Friday 19 April 2013 01:21:58 Marc Glisse wrote:
...
On Thu, 18 Apr 2013, Andrey Semashev wrote:
...
3. It supports division and modulus for integers?
Why not?
...
Is it supported by any hardware?
At least some special cases are, like division by a power of 2.
I think these special cases are better coded explicitly.
...
And if the
divisor is constant, you can also let the implementation handle turning it
into a multiplication.
Does the compiler do that with user-defined operators (which are user-defined 
in case of packs)? Or do you mean the implementation of the operator will 
handle that? The latter means that the division will be very slow, but ok, 
since the division is slow even in hardware...
...
...
4. How would advanced operations be implemented, such as FMA and integer
madd? Is it through additional library provided functions? IMHO, the
availability of these operations is often crucial for performance of the
user's algorithm, if it is more complicated than just accumulating
integers.
If you only want fma as a fast way to compute a+b*c, you could just let
your compiler optimize an addition and a multiplication to fma. They are
not bad at that. If you rely on the extra accuracy of fma, then library
functions seem necessary.
According to my experience, compilers are reluctant at pattern matching the 
intrinsics and replacing them with other intrinsics (which is a good thing). 
So if the user's code a*b+c*d is equivalent to two 
_mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get 
in the output instead of a single _mm_madd_epi16. Note also that 
_mm_madd_epi16 requires a special layout of its operands in xmm register 
elements, which is also a blocker for the compiler optimization.

Regarding FMA, this is probably easier for compilers, but due to the 
difference in accuracy I don't expect compilers to perform this optimization 
lightly (i.e. without a specific compiler switch explicitly allowing it). And 
a switch, being a global option, may not be suitable in every place of the 
application. So having a way to explicitly express programmer's intention is 
useful here too.

I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be 
provided as functions. Also, it might be helpful to be able to convert packs 
to the compiler-specific types, like __m128i, and back to be able to use other 
more special intrinsics that are not available as functions or interoperate 
with inline assembler.

What I also forgot to ask is how the paper and Boost.SIMD handle overflowing 
and saturating integer arithmetics? I assume, the operators on packs implement 
overflowing operations since that's how scalar operations work. Is it possible 
to do saturating operations then?