Mathias Gaunard
The proposed SIMD library supports many architectures and has been deployed in several pieces of software, from academia to production software, with complex and varied usage patterns, and has given significant performance gains where optimizing compilers didn't give much even when loops were specifically written to be optimizer-friendly. I wouldn't call it an inefficient model.
I said *relatively* inefficient. It's the best we have on commodity processors right now, unfortunately. Really, investigate past vector architectures. I would start with the Cray X1 or X2 because I am biased and it's a pretty straightforward RISC-like vector ISA. It has a lot of features implemented based on decades of vectorization and parallelization experience. I'm not knocking the SIMD library itself. I certainly see how it would be a useful bridge between current and future architectures. I just don't think we should standardize something that's going to rapidly change. All of the scalar and complex arithmetic using simple binary operators can be easily vectorized if the compiler has knowledge about dependencies. That is why I suggest standardizing keywords, attributes and/or pragmas rather than a specific parallel model provided by a library. The former is more general and gives the compiler more freedom during code generation. For specialized operations like horizontal add, saturating arithmetic, etc. we will need intrinsics or functions that will be necessarily target-dependent.
It doesn't aim to do all sorts of parallelization, just the SIMD part. Other parallelization and optimization tasks must be done in addition to its usage.
But see that's exactly the problem. Look at the X1. It has multiple levels of parallelism. So does Intel MIC and GPUs. The compiler has to balance multiple parallel models simultaneously. When you hard-code vector loops you remove some of the compiler's freedom to transform loops and improve parallelism.
See Intel MIC. This stuff is coming much faster than most people realize. From where I sit (developing compilers professionally for vector architectures), the path is clear and it is not the current SSE/AVX model.
I wouldn't say that MIC is that different from SSE/AVX. Scatter, predication, conversion on load/store. That's just extras, it doesn't fundamentally change the model at all.
Vector masks fundamentally change the model. They drastically affect control flow. Longer vectors can also dramatically change the generated code. It is *not* simply a matter of using larger strips for stripmined loops. One often will want to vectorize different loops in a nest based on the hardware's maximum vector length. A library-based short vector model like the SIMD library is very non-portable from a performance perspective. It is exactly for this reason that things like OpenACC are rapidly replacing CUDA in production codes. Libraries are great for a lot of things. General parallel code generation is not one of them. -David