On Thu, 18 Apr 2013, Andrey Semashev wrote:
1. When a particular algorithm or pack configuration is not supported by the hardware, is the implementation required to emulate it with scalar or partially vectorized operations?
Yes (according to my recollection of reading the paper).
3. It supports division and modulus for integers?
Why not?
Is it supported by any hardware?
At least some special cases are, like division by a power of 2. And if the divisor is constant, you can also let the implementation handle turning it into a multiplication. And the general case might very well be supported in the future, if it isn't already.
4. How would advanced operations be implemented, such as FMA and integer madd? Is it through additional library provided functions? IMHO, the availability of these operations is often crucial for performance of the user's algorithm, if it is more complicated than just accumulating integers.
If you only want fma as a fast way to compute a+b*c, you could just let your compiler optimize an addition and a multiplication to fma. They are not bad at that. If you rely on the extra accuracy of fma, then library functions seem necessary. -- Marc Glisse