On 19/04/13 06:55, Andrey Semashev wrote:
According to my experience, compilers are reluctant at pattern matching the intrinsics and replacing them with other intrinsics (which is a good thing). So if the user's code a*b+c*d is equivalent to two _mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get in the output instead of a single _mm_madd_epi16. Note also that _mm_madd_epi16 requires a special layout of its operands in xmm register elements, which is also a blocker for the compiler optimization.
_mm_madd_epi16 is not a vertical operation, so it's a fairly special function, and you can't expect the compiler to recognize cases where it can use it. _mm_macc_epi16 is the vertical one (XOP only), and quite more easy on the optimizer. There are fma and correct_fma functions in any case.
Regarding FMA, this is probably easier for compilers, but due to the difference in accuracy I don't expect compilers to perform this optimization lightly (i.e. without a specific compiler switch explicitly allowing it).
They do. A compiler is allowed to use higher precision for intermediate results whenever it wants. This is what also allows compilers to use 80-bit of precision for operations on float or double.
I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be provided as functions. Also, it might be helpful to be able to convert packs to the compiler-specific types, like __m128i, and back to be able to use other more special intrinsics that are not available as functions or interoperate with inline assembler.
What I also forgot to ask is how the paper and Boost.SIMD handle overflowing and saturating integer arithmetics? I assume, the operators on packs implement overflowing operations since that's how scalar operations work. Is it possible to do saturating operations then?
The standard proposal tried to keep things simple, the library itself has quite a few more things.