Re: [boost] Going forward with Boost.SIMD

25 Apr 2013

      On 24/04/13 22:47, dag@cray.com wrote:
...
Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
...
Automatic parallelization will never beat code optimized by
experts. Experts program each type of parallelism by taking into
account its specificities.
That is hyperbole.  "Never" is a strong word.
A compiler can only perform the optimization that it has been engineered 
to do.

A human can study the code and find the best optimizations available for 
the algorithm at hand.

Until compilers become self-aware, they'll never be better than what a 
human can do.
...
Scalar predication hasn't changed the way people program because
compilers do the if-conversion.  As it should be with vectors.
[...]
I have trouble seeing how one would use the SIMD library to make it
easier to write predicated vector code.  Can you sketch it out?
As you said yourself, the if-conversion can be done by the compiler with 
vectors just as easily as it can be done with scalars.

The library has a if_else(cond, a, b) function (similar to the ?: 
ternary operator).

You cannot write
if(cond)
{
   x = foo;
   y = bar;
}

but you can write
x = if_else(cond, foo, x);
y = if_else(cond, bar, y);

In the current implementation on MIC if_else is implemented as a 
predicated move. The compiler could optimize this by fusing the 
predicate with whatever operation is done to compute a or b.
On SSE4 it uses a blend instruction. On other SIMD architectures it uses 
a combination of two or three bitwise instructions.

In the library itself but not in the proposal there are also a couple of 
other functions where an operation is directly masked or predicated, 
like seladd and selsub which perform predicated addition/subtraction.

There is also a conditional store, because writing to memory is a 
special thing.
...
Predication allows much more effecient vectorization of many common
idioms.  A SIMD library without support for it will miss those idioms
and the compiler auto-vectorizer will get better performance.
Not many SIMD programming idioms. Yes, SIMD programming has its own 
idioms. Interestingly enough, apparently some of them are not always 
known by the people designing the hardware!
...
So the user has to write multiple versions of loops nests, potentially
one for each target architecture?  I don't see the advantage of this
approach.
C++ supports generic and generative programming.
You don't actually write multiple versions. You just write one that is 
generic.

As an example, there are also simple C++ utilities that you can use to 
automatically unroll a loop by a given factor chosen at compile-time. 
Not strictly SIMD-related though.
...
...
From my experience, it is still fairly reliable. There are differences
in performance, but they're mostly due to differences in the hardware
capabilities at solving a particular application domain well.
Well yes, that's one of the main issues.
I don't see how it is an issue. Not all hardware has to be equal. Some 
algorithms will also perform better on some types of hardware.

Consider GPUs, for example. The fact that they mostly don't have cache 
means that the algorithms that you use for FFT or matrix multiplication 
are entirely different than those used on a CPU. A compiler would have 
no way of generating the optimal algorithm from the other, that's 
something that must be done manually.

Likewise if you write an algorithm that relies a lot on division 
performance, when moving to an architecture without native division it 
will be slower. You could use a different algorithm that uses different 
operations.