On 24/04/13 20:00, dag@cray.com wrote:
All of the scalar and complex arithmetic using simple binary operators can be easily vectorized if the compiler has knowledge about dependencies. That is why I suggest standardizing keywords, attributes and/or pragmas rather than a specific parallel model provided by a library. The former is more general and gives the compiler more freedom during code generation.
But see that's exactly the problem. Look at the X1. It has multiple levels of parallelism. So does Intel MIC and GPUs. The compiler has to balance multiple parallel models simultaneously. When you hard-code vector loops you remove some of the compiler's freedom to transform loops and improve parallelism.
Automatic parallelization will never beat code optimized by experts. Experts program each type of parallelism by taking into account its specificities. A one-size-fits-all model for all kinds of parallelism is nice, but limited; using a dedicated tool for each type of parallelism is the right approach for maximum performance. While it could be argued that experts should use the lowest level API to reach their goals, such libraries can still make experts much more productive. An interesting point in favor of a library is also memory layout. A C++ compiler cannot change the memory layout on its own to make it more friendly to vectorize. By providing the right types and primitives to the user, he is made aware of the issues at hand and empowered with the ability to explicitly state how a given algorithm is to be vectorized.
For specialized operations like horizontal add, saturating arithmetic, etc. we will need intrinsics or functions that will be necessarily target-dependent.
The proposal suggests providing vectorized variants of all mathematical functions in the C++ standard (the Boost.SIMD library covers C99, TR1 and more). That's quite a lot of functions. Should all these functions be made compiler built-ins? That doesn't sound like a very scalable and extensible approach. You'll probably want to use different algorithms for the SIMD variants of these functions, so having the compiler auto-vectorize the scalar variant doesn't sound like a terrible idea either.
Vector masks fundamentally change the model. They drastically affect control flow.
Some processors have had predication at the scalar level for quite some time. It hasn't drastically changed the way people program. It is similar to doing two instructions in one (any instruction can also do a blend for free), and optimizing those instructions done separately into one is something that a compiler should be able to do pretty well. It doesn't sound very unlike what a compiler must do for VLIW codegen to me, but then I have little knowledge of compilers. The fact that it is the library doesn't mean that the compiler shouldn't perform on vector types the same optimizations that it does on scalar ones. While I can see the benefit of this feature for a compiler that wants to generate SIMD for arbitrary code, dedicated SIMD code will not depend on this too much that it cannot be covered by a couple of additional functions.
Longer vectors can also dramatically change the generated code. It is *not* simply a matter of using larger strips for stripmined loops. One often will want to vectorize different loops in a nest based on the hardware's maximum vector length.
I don't see what the problem is here. This is C++. You can write generic code for arbitrary vector lengths. It is up to the user to use generative programming techniques to make his code depend on this parameter and be portable. The library tries to make this as easy as possible.
A library-based short vector model like the SIMD library is very non-portable from a performance perspective.
From my experience, it is still fairly reliable. There are differences in performance, but they're mostly due to differences in the hardware capabilities at solving a particular application domain well.