Re: [boost] Going forward with Boost.SIMD

25 Apr 2013

      On 24/04/13 23:00, dag@cray.com wrote:
...
Compilers exist in the field today that generate CPU/GPU code that
outperforms hand-coded CUDA.  Compilers exist in the field today that
vectorize and parallelize code that outperforms hand-parallelized code.
That just means that the hand-parallelized code was badly done.
Can you beat optimized libraries like CUBLAS or CUFFT ?
Can you generate an optimized GPU sort from the code of std::sort ?

I have seen the published results of many different types of 
auto-parallelization technology. Even when specifically engineered to 
parallelize specific algorithms they still don't beat the state of the 
art optimized implementation, and sometimes are quite far from it.
...
Hand-tuned scalar code can beat compiler-generated code yet we don't
advocate people write in asm all the time.
There is no need to go down to ASM to optimize scalar code, you can 
optimize with C or C++.
A simple optimization like scalarization for example is not done 
reliably by today's compilers, and doing it manually can help performance.
Likewise doing register rotation explicitly can also help performance 
tremendously.
Unrolling or pipelining can also be done at the source level, and give 
performance benefits even on modern out-of-core architectures.

It's all a matter of how important a specific piece of code is and how 
much work it would take to make it faster.
...
CUDA *is* being replaced by OpenACC in our cutomers' codes.  Not
overnight, but every month we see more use of OpenACC.
I don't know much about Cray, but I would think that your customers 
probably do not represent the whole of CUDA users at large.

Re: [boost] Going forward with Boost.SIMD

Mathias Gaunard