On Friday 19 April 2013 01:21:58 Marc Glisse wrote:
On Thu, 18 Apr 2013, Andrey Semashev wrote:
3. It supports division and modulus for integers?
Why not?
Is it supported by any hardware?
At least some special cases are, like division by a power of 2.
I think these special cases are better coded explicitly.
And if the divisor is constant, you can also let the implementation handle turning it into a multiplication.
Does the compiler do that with user-defined operators (which are user-defined in case of packs)? Or do you mean the implementation of the operator will handle that? The latter means that the division will be very slow, but ok, since the division is slow even in hardware...
4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers. If you only want fma as a fast way to compute a+b*c, you could just let your compiler optimize an addition and a multiplication to fma. They are not bad at that. If you rely on the extra accuracy of fma, then library functions seem necessary.
According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization. Regarding FMA, this is probably easier for compilers, but due to the difference in accuracy I don't expect compilers to perform this optimization lightly (i.e. without a specific compiler switch explicitly allowing it). And a switch, being a global option, may not be suitable in every place of the application. So having a way to explicitly express programmer's intention is useful here too. I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler. What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?