Re: [boost] Going forward with Boost.SIMD
A GPU is an accelerator for large regular computations, and requiring sending memory and receiving it back. It's also programmed with a very constrained programming model that cannot express efficiently all kinds of operations.
A CPU, on the other hand, is a very flexible processor and all memory is already there. You can make it do a lot of complex computations, irregular, sparse or iterative, can do dynamic scheduling and work stealing, and have fine-grained control on all components and how they work together.
However, SIMD has been here for 25 years and is still in the roadmap of future processors. Across all this time it has mostly stayed the same.
On the other hand GPU computing is relatively new and is evolving a lot. It's also quite trendy and buzzword-y, and is in reality not as fast and versatile as marketing makes it out to be. A lot of people seem to be intent on standardizing GPU technology rather
:) You're actually wrong on that, and it's one of the first big surprises anyone who sits on ISO committees experiences: the change in scope of definitions. When you're coming at things from the level of international engineering standards, a computer's CPU is not defined as anything approximating what any of us use on a regular basis. It includes large NUMA clusters, it includes Cray supercomputers all of which don't do SIMD anything like how a PC does. It *also* includes tiny embedded 8-bit CPUs, the kind you find in watches, inlined in wiring, that sort of thing. Some of those tiny CPUs, believe it or not, do SIMD and have done SIMD for donkey's years, but it's in a very primitive way. Some of those CPUs, for example, work in SIMD 3 x 8 bit = 24-bit or even 3 x 9 bit = 27-bit not 32-bit integers, that sort of thing. Yet international engineering standards must *always* target the conservative majority, and PCs or even CPUs designed more recently than the 1990s are always in a minority in that frame of reference. Don't get me wrong: you could standardize desktop class SIMD on its own. But generally you need to hear noise complaining about the costs of lack of standardization, and I'm not aware of much regarding SIMD on CPUs (it's different on GPUs where hedge funds and oil/gas frackers regularly battle lack of interop). than
SIMD technology; that's quite a shame.
Thing is, had Intel decided Larrabee was worth pushing to the mass market - and it was a close thing - PC based SIMD would look completely different now and we wouldn't be using SSE/NEON/AVX which is really an explicit prefetch opcode set for directly programming the ALUs and bypassing the out of order logic, not true SIMD (these are Intel's own words, not mine). As it is, convergence will simply take longer. Some of those new ARM dense cluster servers look awfully like Larrabee actually, 1000+ NUMA ARM Cortex A9's in a single rack, and their density appears to be growing exponentially for now. Given all this change going on, I'd still wait and see. Niall
Niall Douglas
You're actually wrong on that, and it's one of the first big surprises anyone who sits on ISO committees experiences: the change in scope of definitions. When you're coming at things from the level of international engineering standards, a computer's CPU is not defined as anything approximating what any of us use on a regular basis. It includes large NUMA clusters, it includes Cray supercomputers all of which don't do SIMD anything like how a PC does. It *also* includes tiny embedded 8-bit CPUs, the kind you find in watches, inlined in wiring, that sort of thing. Some of those tiny CPUs, believe it or not, do SIMD and have done SIMD for donkey's years, but it's in a very primitive way. Some of those CPUs, for example, work in SIMD 3 x 8 bit = 24-bit or even 3 x 9 bit = 27-bit not 32-bit integers, that sort of thing. Yet international engineering standards must *always* target the conservative majority, and PCs or even CPUs designed more recently than the 1990s are always in a minority in that frame of reference.
Exactly. I urge anyone working on parallelism-related stuff to investigate the many vector and parallel architectures that have been developed over the decades. The proposed SIMD library is a *very* small slice of what's been done and it is a relatively inefficient model at that. It was developed in the 1990's when we had much less die area and couldn't afford to do "real" vector ISAs in microprocessors. The world has changed since then.
Thing is, had Intel decided Larrabee was worth pushing to the mass market - and it was a close thing - PC based SIMD would look completely different now and we wouldn't be using SSE/NEON/AVX
As it is, convergence will simply take longer.
See Intel MIC. This stuff is coming much faster than most people realize. From where I sit (developing compilers professionally for vector architectures), the path is clear and it is not the current SSE/AVX model. -David
On 24/04/13 18:31, dag@cray.com wrote:
Exactly. I urge anyone working on parallelism-related stuff to investigate the many vector and parallel architectures that have been developed over the decades. The proposed SIMD library is a *very* small slice of what's been done and it is a relatively inefficient model at that. It was developed in the 1990's when we had much less die area and couldn't afford to do "real" vector ISAs in microprocessors. The world has changed since then.
The proposed SIMD library supports many architectures and has been deployed in several pieces of software, from academia to production software, with complex and varied usage patterns, and has given significant performance gains where optimizing compilers didn't give much even when loops were specifically written to be optimizer-friendly. I wouldn't call it an inefficient model. It doesn't aim to do all sorts of parallelization, just the SIMD part. Other parallelization and optimization tasks must be done in addition to its usage.
See Intel MIC. This stuff is coming much faster than most people realize. From where I sit (developing compilers professionally for vector architectures), the path is clear and it is not the current SSE/AVX model.
I wouldn't say that MIC is that different from SSE/AVX. Scatter, predication, conversion on load/store. That's just extras, it doesn't fundamentally change the model at all.
Mathias Gaunard
The proposed SIMD library supports many architectures and has been deployed in several pieces of software, from academia to production software, with complex and varied usage patterns, and has given significant performance gains where optimizing compilers didn't give much even when loops were specifically written to be optimizer-friendly. I wouldn't call it an inefficient model.
I said *relatively* inefficient. It's the best we have on commodity processors right now, unfortunately. Really, investigate past vector architectures. I would start with the Cray X1 or X2 because I am biased and it's a pretty straightforward RISC-like vector ISA. It has a lot of features implemented based on decades of vectorization and parallelization experience. I'm not knocking the SIMD library itself. I certainly see how it would be a useful bridge between current and future architectures. I just don't think we should standardize something that's going to rapidly change. All of the scalar and complex arithmetic using simple binary operators can be easily vectorized if the compiler has knowledge about dependencies. That is why I suggest standardizing keywords, attributes and/or pragmas rather than a specific parallel model provided by a library. The former is more general and gives the compiler more freedom during code generation. For specialized operations like horizontal add, saturating arithmetic, etc. we will need intrinsics or functions that will be necessarily target-dependent.
It doesn't aim to do all sorts of parallelization, just the SIMD part. Other parallelization and optimization tasks must be done in addition to its usage.
But see that's exactly the problem. Look at the X1. It has multiple levels of parallelism. So does Intel MIC and GPUs. The compiler has to balance multiple parallel models simultaneously. When you hard-code vector loops you remove some of the compiler's freedom to transform loops and improve parallelism.
See Intel MIC. This stuff is coming much faster than most people realize. From where I sit (developing compilers professionally for vector architectures), the path is clear and it is not the current SSE/AVX model.
I wouldn't say that MIC is that different from SSE/AVX. Scatter, predication, conversion on load/store. That's just extras, it doesn't fundamentally change the model at all.
Vector masks fundamentally change the model. They drastically affect control flow. Longer vectors can also dramatically change the generated code. It is *not* simply a matter of using larger strips for stripmined loops. One often will want to vectorize different loops in a nest based on the hardware's maximum vector length. A library-based short vector model like the SIMD library is very non-portable from a performance perspective. It is exactly for this reason that things like OpenACC are rapidly replacing CUDA in production codes. Libraries are great for a lot of things. General parallel code generation is not one of them. -David
On 24/04/13 20:00, dag@cray.com wrote:
All of the scalar and complex arithmetic using simple binary operators can be easily vectorized if the compiler has knowledge about dependencies. That is why I suggest standardizing keywords, attributes and/or pragmas rather than a specific parallel model provided by a library. The former is more general and gives the compiler more freedom during code generation.
But see that's exactly the problem. Look at the X1. It has multiple levels of parallelism. So does Intel MIC and GPUs. The compiler has to balance multiple parallel models simultaneously. When you hard-code vector loops you remove some of the compiler's freedom to transform loops and improve parallelism.
Automatic parallelization will never beat code optimized by experts. Experts program each type of parallelism by taking into account its specificities. A one-size-fits-all model for all kinds of parallelism is nice, but limited; using a dedicated tool for each type of parallelism is the right approach for maximum performance. While it could be argued that experts should use the lowest level API to reach their goals, such libraries can still make experts much more productive. An interesting point in favor of a library is also memory layout. A C++ compiler cannot change the memory layout on its own to make it more friendly to vectorize. By providing the right types and primitives to the user, he is made aware of the issues at hand and empowered with the ability to explicitly state how a given algorithm is to be vectorized.
For specialized operations like horizontal add, saturating arithmetic, etc. we will need intrinsics or functions that will be necessarily target-dependent.
The proposal suggests providing vectorized variants of all mathematical functions in the C++ standard (the Boost.SIMD library covers C99, TR1 and more). That's quite a lot of functions. Should all these functions be made compiler built-ins? That doesn't sound like a very scalable and extensible approach. You'll probably want to use different algorithms for the SIMD variants of these functions, so having the compiler auto-vectorize the scalar variant doesn't sound like a terrible idea either.
Vector masks fundamentally change the model. They drastically affect control flow.
Some processors have had predication at the scalar level for quite some time. It hasn't drastically changed the way people program. It is similar to doing two instructions in one (any instruction can also do a blend for free), and optimizing those instructions done separately into one is something that a compiler should be able to do pretty well. It doesn't sound very unlike what a compiler must do for VLIW codegen to me, but then I have little knowledge of compilers. The fact that it is the library doesn't mean that the compiler shouldn't perform on vector types the same optimizations that it does on scalar ones. While I can see the benefit of this feature for a compiler that wants to generate SIMD for arbitrary code, dedicated SIMD code will not depend on this too much that it cannot be covered by a couple of additional functions.
Longer vectors can also dramatically change the generated code. It is *not* simply a matter of using larger strips for stripmined loops. One often will want to vectorize different loops in a nest based on the hardware's maximum vector length.
I don't see what the problem is here. This is C++. You can write generic code for arbitrary vector lengths. It is up to the user to use generative programming techniques to make his code depend on this parameter and be portable. The library tries to make this as easy as possible.
A library-based short vector model like the SIMD library is very non-portable from a performance perspective.
From my experience, it is still fairly reliable. There are differences in performance, but they're mostly due to differences in the hardware capabilities at solving a particular application domain well.
Mathias Gaunard
Automatic parallelization will never beat code optimized by experts. Experts program each type of parallelism by taking into account its specificities.
That is hyperbole. "Never" is a strong word.
An interesting point in favor of a library is also memory layout. A C++ compiler cannot change the memory layout on its own to make it more friendly to vectorize. By providing the right types and primitives to the user, he is made aware of the issues at hand and empowered with the ability to explicitly state how a given algorithm is to be vectorized.
I agree that libraries to make data shaping easier are useful!
For specialized operations like horizontal add, saturating arithmetic, etc. we will need intrinsics or functions that will be necessarily target-dependent.
The proposal suggests providing vectorized variants of all mathematical functions in the C++ standard (the Boost.SIMD library covers C99, TR1 and more). That's quite a lot of functions.
But not the special ones I mentioned.
Should all these functions be made compiler built-ins? That doesn't sound like a very scalable and extensible approach.
I dunno, we do a lot of that here.
Vector masks fundamentally change the model. They drastically affect control flow.
Some processors have had predication at the scalar level for quite some time. It hasn't drastically changed the way people program.
Scalar predication hasn't changed the way people program because compilers do the if-conversion. As it should be with vectors.
It is similar to doing two instructions in one (any instruction can also do a blend for free), and optimizing those instructions done separately into one is something that a compiler should be able to do pretty well. It doesn't sound very unlike what a compiler must do for VLIW codegen to me, but then I have little knowledge of compilers.
I have trouble seeing how one would use the SIMD library to make it easier to write predicated vector code. Can you sketch it out?
The fact that it is the library doesn't mean that the compiler shouldn't perform on vector types the same optimizations that it does on scalar ones.
Of course it will. But the library user has already made the choice of what to vectorize. Many times it will be the right choice, but not always.
While I can see the benefit of this feature for a compiler that wants to generate SIMD for arbitrary code, dedicated SIMD code will not depend on this too much that it cannot be covered by a couple of additional functions.
Predication allows much more effecient vectorization of many common idioms. A SIMD library without support for it will miss those idioms and the compiler auto-vectorizer will get better performance.
Longer vectors can also dramatically change the generated code. It is *not* simply a matter of using larger strips for stripmined loops. One often will want to vectorize different loops in a nest based on the hardware's maximum vector length.
I don't see what the problem is here. This is C++. You can write generic code for arbitrary vector lengths. It is up to the user to use generative programming techniques to make his code depend on this parameter and be portable. The library tries to make this as easy as possible.
So the user has to write multiple versions of loops nests, potentially one for each target architecture? I don't see the advantage of this approach.
A library-based short vector model like the SIMD library is very non-portable from a performance perspective.
From my experience, it is still fairly reliable. There are differences in performance, but they're mostly due to differences in the hardware capabilities at solving a particular application domain well.
Well yes, that's one of the main issues. -David
On 24/04/13 22:47, dag@cray.com wrote:
Mathias Gaunard
writes: Automatic parallelization will never beat code optimized by experts. Experts program each type of parallelism by taking into account its specificities.
That is hyperbole. "Never" is a strong word.
A compiler can only perform the optimization that it has been engineered to do. A human can study the code and find the best optimizations available for the algorithm at hand. Until compilers become self-aware, they'll never be better than what a human can do.
Scalar predication hasn't changed the way people program because compilers do the if-conversion. As it should be with vectors.
[...]
I have trouble seeing how one would use the SIMD library to make it easier to write predicated vector code. Can you sketch it out?
As you said yourself, the if-conversion can be done by the compiler with vectors just as easily as it can be done with scalars. The library has a if_else(cond, a, b) function (similar to the ?: ternary operator). You cannot write if(cond) { x = foo; y = bar; } but you can write x = if_else(cond, foo, x); y = if_else(cond, bar, y); In the current implementation on MIC if_else is implemented as a predicated move. The compiler could optimize this by fusing the predicate with whatever operation is done to compute a or b. On SSE4 it uses a blend instruction. On other SIMD architectures it uses a combination of two or three bitwise instructions. In the library itself but not in the proposal there are also a couple of other functions where an operation is directly masked or predicated, like seladd and selsub which perform predicated addition/subtraction. There is also a conditional store, because writing to memory is a special thing.
Predication allows much more effecient vectorization of many common idioms. A SIMD library without support for it will miss those idioms and the compiler auto-vectorizer will get better performance.
Not many SIMD programming idioms. Yes, SIMD programming has its own idioms. Interestingly enough, apparently some of them are not always known by the people designing the hardware!
So the user has to write multiple versions of loops nests, potentially one for each target architecture? I don't see the advantage of this approach.
C++ supports generic and generative programming. You don't actually write multiple versions. You just write one that is generic. As an example, there are also simple C++ utilities that you can use to automatically unroll a loop by a given factor chosen at compile-time. Not strictly SIMD-related though.
From my experience, it is still fairly reliable. There are differences in performance, but they're mostly due to differences in the hardware capabilities at solving a particular application domain well.
Well yes, that's one of the main issues.
I don't see how it is an issue. Not all hardware has to be equal. Some algorithms will also perform better on some types of hardware. Consider GPUs, for example. The fact that they mostly don't have cache means that the algorithms that you use for FFT or matrix multiplication are entirely different than those used on a CPU. A compiler would have no way of generating the optimal algorithm from the other, that's something that must be done manually. Likewise if you write an algorithm that relies a lot on division performance, when moving to an architecture without native division it will be slower. You could use a different algorithm that uses different operations.
Hi, On 13:00 Wed 24 Apr , dag@cray.com wrote:
All of the scalar and complex arithmetic using simple binary operators can be easily vectorized if the compiler has knowledge about dependencies. That is why I suggest standardizing keywords, attributes and/or pragmas rather than a specific parallel model provided by a library. The former is more general and gives the compiler more freedom during code generation.
It seems like the auto-parallelizing compiler is constantly just a couple of years away. I know there is progress, but apparently the complexity of today's architectures counteracts this.
But see that's exactly the problem. Look at the X1. It has multiple levels of parallelism. So does Intel MIC and GPUs. The compiler has to balance multiple parallel models simultaneously. When you hard-code vector loops you remove some of the compiler's freedom to transform loops and improve parallelism.
But isn't the current programming model broken? If you let the programmer write loops which the compiler will aim to parallelize, then the programmer will still always think of the iterations of running sequentially, thus creating an "impedance mismatch". Programming models such as Intel's ispc or Nvidia's CUDA fare so well because they exhibit an acceptable amount of parallelism to the user, while simultaneously maintaining some leeway for the compiler.
A library-based short vector model like the SIMD library is very non-portable from a performance perspective. It is exactly for this reason that things like OpenACC are rapidly replacing CUDA in production codes. Libraries are great for a lot of things. General parallel code generation is not one of them.
CUDA is being rapidly replaced by things like OpenACC? Hmm, in my world people are still rubbing their eyes as the slowly realize that this "#pragma omp parallel for" gives them poor speedups, even on quad-core UMA nodes. And seeing how "well" the auto-magical offload mode on MIC works, they are very suspicious of things like OpenACC. Best -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!
All of the scalar and complex arithmetic using simple binary operators can be easily vectorized if the compiler has knowledge about dependencies. That is why I suggest standardizing keywords, attributes and/or pragmas rather than a specific parallel model provided by a library. The former is more general and gives the compiler more freedom during code generation.
It seems like the auto-parallelizing compiler is constantly just a couple of years away. I know there is progress, but apparently the complexity of today's architectures counteracts this.
Such compilers exist today. gcc and clang are not among them, but they are improving. Compilers exist in the field today that generate CPU/GPU code that outperforms hand-coded CUDA. Compilers exist in the field today that vectorize and parallelize code that outperforms hand-parallelized code. There will always be cases where hand-tuning will win. The question is whether standardizing a library to help these cases which exist in a narrow model of parallelism is a good idea. Hand-tuned scalar code can beat compiler-generated code yet we don't advocate people write in asm all the time. Hand-written vector code is really just a slightly higher form of asm. Even with operator overloading the user still has to explicitly think about strip mining, hardware capabilities and data arrangement. I would much rather see array syntax notation in standard C++ than a library that provides one restricted form of parallelism. A SIMD library is fine, maybe even great! But not in the standard.
But see that's exactly the problem. Look at the X1. It has multiple levels of parallelism. So does Intel MIC and GPUs. The compiler has to balance multiple parallel models simultaneously. When you hard-code vector loops you remove some of the compiler's freedom to transform loops and improve parallelism.
But isn't the current programming model broken? If you let the programmer write loops which the compiler will aim to parallelize, then the programmer will still always think of the iterations of running sequentially, thus creating an "impedance mismatch".
Or it's providing a level of abstraction convenient for the user.
Programming models such as Intel's ispc or Nvidia's CUDA fare so well because they exhibit an acceptable amount of parallelism to the user, while simultaneously maintaining some leeway for the compiler.
As mentioned before, CUDA is on its way out for many codes. Yes, there are models of parallelism that have proven useful. Co-Array Fortran is one example. Yhese models are generally implemented in languages in a way that provides freedom to the compiler to optimize as it sees fit. Putting too many constraints on implementation doesn't work well.
A library-based short vector model like the SIMD library is very non-portable from a performance perspective. It is exactly for this reason that things like OpenACC are rapidly replacing CUDA in production codes. Libraries are great for a lot of things. General parallel code generation is not one of them.
CUDA is being rapidly replaced by things like OpenACC? Hmm, in my world people are still rubbing their eyes as the slowly realize that this "#pragma omp parallel for" gives them poor speedups, even on quad-core UMA nodes. And seeing how "well" the auto-magical offload mode on MIC works, they are very suspicious of things like OpenACC.
Knights Corner is not a particularly good implementation of the MIC concept. That has been known for a while. It's a step on a path. I referenced it for the ISA concepts, not the microarchitecture implementation. CUDA *is* being replaced by OpenACC in our cutomers' codes. Not overnight, but every month we see more use of OpenACC. OpenMP has some real deficiencies when it comes to efficient parallelization. That is one reason I didn't mention it in my list of suggestions. Still, it is quite useful for certain types of codes. I'm not advocating for any particular parallel model. I'm advocating for the tools to let the compiler choose the most appropriate model for a given piece of code. -David
On 24/04/13 23:00, dag@cray.com wrote:
Compilers exist in the field today that generate CPU/GPU code that outperforms hand-coded CUDA. Compilers exist in the field today that vectorize and parallelize code that outperforms hand-parallelized code.
That just means that the hand-parallelized code was badly done. Can you beat optimized libraries like CUBLAS or CUFFT ? Can you generate an optimized GPU sort from the code of std::sort ? I have seen the published results of many different types of auto-parallelization technology. Even when specifically engineered to parallelize specific algorithms they still don't beat the state of the art optimized implementation, and sometimes are quite far from it.
Hand-tuned scalar code can beat compiler-generated code yet we don't advocate people write in asm all the time.
There is no need to go down to ASM to optimize scalar code, you can optimize with C or C++. A simple optimization like scalarization for example is not done reliably by today's compilers, and doing it manually can help performance. Likewise doing register rotation explicitly can also help performance tremendously. Unrolling or pipelining can also be done at the source level, and give performance benefits even on modern out-of-core architectures. It's all a matter of how important a specific piece of code is and how much work it would take to make it faster.
CUDA *is* being replaced by OpenACC in our cutomers' codes. Not overnight, but every month we see more use of OpenACC.
I don't know much about Cray, but I would think that your customers probably do not represent the whole of CUDA users at large.
participants (4)
-
Andreas Schäfer
-
dag@cray.com
-
Mathias Gaunard
-
Niall Douglas