
On 24/04/13 22:47, dag@cray.com wrote:
Mathias Gaunard
writes: Automatic parallelization will never beat code optimized by experts. Experts program each type of parallelism by taking into account its specificities.
That is hyperbole. "Never" is a strong word.
A compiler can only perform the optimization that it has been engineered to do. A human can study the code and find the best optimizations available for the algorithm at hand. Until compilers become self-aware, they'll never be better than what a human can do.
Scalar predication hasn't changed the way people program because compilers do the if-conversion. As it should be with vectors.
[...]
I have trouble seeing how one would use the SIMD library to make it easier to write predicated vector code. Can you sketch it out?
As you said yourself, the if-conversion can be done by the compiler with vectors just as easily as it can be done with scalars. The library has a if_else(cond, a, b) function (similar to the ?: ternary operator). You cannot write if(cond) { x = foo; y = bar; } but you can write x = if_else(cond, foo, x); y = if_else(cond, bar, y); In the current implementation on MIC if_else is implemented as a predicated move. The compiler could optimize this by fusing the predicate with whatever operation is done to compute a or b. On SSE4 it uses a blend instruction. On other SIMD architectures it uses a combination of two or three bitwise instructions. In the library itself but not in the proposal there are also a couple of other functions where an operation is directly masked or predicated, like seladd and selsub which perform predicated addition/subtraction. There is also a conditional store, because writing to memory is a special thing.
Predication allows much more effecient vectorization of many common idioms. A SIMD library without support for it will miss those idioms and the compiler auto-vectorizer will get better performance.
Not many SIMD programming idioms. Yes, SIMD programming has its own idioms. Interestingly enough, apparently some of them are not always known by the people designing the hardware!
So the user has to write multiple versions of loops nests, potentially one for each target architecture? I don't see the advantage of this approach.
C++ supports generic and generative programming. You don't actually write multiple versions. You just write one that is generic. As an example, there are also simple C++ utilities that you can use to automatically unroll a loop by a given factor chosen at compile-time. Not strictly SIMD-related though.
From my experience, it is still fairly reliable. There are differences in performance, but they're mostly due to differences in the hardware capabilities at solving a particular application domain well.
Well yes, that's one of the main issues.
I don't see how it is an issue. Not all hardware has to be equal. Some algorithms will also perform better on some types of hardware. Consider GPUs, for example. The fact that they mostly don't have cache means that the algorithms that you use for FFT or matrix multiplication are entirely different than those used on a CPU. A compiler would have no way of generating the optimal algorithm from the other, that's something that must be done manually. Likewise if you write an algorithm that relies a lot on division performance, when moving to an architecture without native division it will be slower. You could use a different algorithm that uses different operations.