On 24/04/13 23:00, dag@cray.com wrote:
Compilers exist in the field today that generate CPU/GPU code that outperforms hand-coded CUDA. Compilers exist in the field today that vectorize and parallelize code that outperforms hand-parallelized code.
That just means that the hand-parallelized code was badly done. Can you beat optimized libraries like CUBLAS or CUFFT ? Can you generate an optimized GPU sort from the code of std::sort ? I have seen the published results of many different types of auto-parallelization technology. Even when specifically engineered to parallelize specific algorithms they still don't beat the state of the art optimized implementation, and sometimes are quite far from it.
Hand-tuned scalar code can beat compiler-generated code yet we don't advocate people write in asm all the time.
There is no need to go down to ASM to optimize scalar code, you can optimize with C or C++. A simple optimization like scalarization for example is not done reliably by today's compilers, and doing it manually can help performance. Likewise doing register rotation explicitly can also help performance tremendously. Unrolling or pipelining can also be done at the source level, and give performance benefits even on modern out-of-core architectures. It's all a matter of how important a specific piece of code is and how much work it would take to make it faster.
CUDA *is* being replaced by OpenACC in our cutomers' codes. Not overnight, but every month we see more use of OpenACC.
I don't know much about Cray, but I would think that your customers probably do not represent the whole of CUDA users at large.