Re: [boost] interest in structure of arrays container?

25 Oct 2016

      On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
...
On 07:50 Fri 21 Oct     , Larry Evans wrote:
...
I can't imagine how anything could be faster
than the soa_emitter_static_t because it uses a tuple of
std::array.  I'd guess that the
soa_emitter_block_t is only faster by luck (maybe during
the soa_emitter_block_t run, my machine was not as busy on some other
stuff).
I think the reason why the different implementation techniques are so
close is that the particle model is memory bound (i.e. it's moving a
lot of data while each particle update involves relatively few
calculations).
The difference becomes larger if you're using only a few particles:
then all particles sit in the upper levels of the cache and the CPU
doesn't have to wait as much for the data. It would also be worthwhile
to try a more complex particle model (e.g. by adding interaction
between the particles). With increased computational intensity
(floating point operations per byte moved) the delta of the different
strategies should increase much more.
Thanks for the explanation.  The lastest version of the
benchmark:

d6ee370606f7f167dedb93e174459c6c7c4d8c19

reports the relative difference of the times:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823

So, based on what you say above, I guess when
particle_count:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135

increases to a point where the cache is overflowed, the
relative differences between methods should show a sharp
difference?
...
I've added an implementation of the benchmark based on LibFlatArray's
SoA containers and expression templates[1]. While working on the
benchmark, I realized that the vector types ("short_vec") in
LibFlatArray were lacking some desirable operations (e.g. masked
move), so to reproduce my results you'll have to use the trunk from
[2]. I'm very happy that you wrote this benchmark because it's a
valuable test bed for performance, programmability, and functionality.
Thanks!
You're welcome.  Much of the credit goes to the OP, as
acknowledged, indirectly, here:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L6
...
One key contribution is that the LibFlatArray-based kernels will
automatically be vectorized without the user having to touch
intrinsics (which automatically tie your code to a specific platform).
LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer
products), ARM NEON...
I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ
(Skylake quad-core) to get stable results.
Hmmm.  I didn't realize you'd have to run the benchmark
several times to get stable results.  I guess that reflect
my ignorance of how benchmarks should be run.

Could you explain how running a couple of times achieves
stable results (actually, on some occassions, I've run the
benchmark and got results completely unexpected, I suspect
it was because some application deamon was stealing cycles
from the benchmark, leading to the unexpedted results).
...
Interestingly your SSE code is ~13% faster than the
LibFlatArray code for large particle counts.
Actually, the SSE code was the OP's.

As intimated above, using the latest version of the
benchmark should make this % difference more apparent.  For
example, the output looks like this:

particle_count=1,024
frames=1,000
minimum duration=0.0369697

comparitive performance table:

method   rel_duration
________ ______________
SoA      0.902566
Flat     0.907562
Block    0.963046
AoS      1
StdArray 1.15868
LFA      undefined
SSE      undefined
SSE_opt  undefined

The above was done with compiler optimization flag -O0.  It
changes dramatically with -O2 or -O3.
...
I'll have to take a look at the assembly to figure out why
that is.
Oh, I bet that will be fun ;)
...
(As a library developer having such a test case is incredibly
valuable, so thanks again!) For fewer particles the LibFlatArray
kernel is ~31% faster. I assume that this delta would increase with a
higher computational intensity as it's using AVX. On a SSE-only CPU
the LibFlatArray code might be a little slower than the hand-optimized
SSE code.
particle_count=1.000.000
AoS in 9,21448 seconds
SoA in 5,87921 seconds
SoA flat in 5,81664 seconds
SoA Static in 7,10225 seconds
SoA block in 6,16696 seconds
LibFlatArray SoA in 5,31733 seconds
SoA SSE in 4,79973 seconds
SoA SSE opt in 4,70757 seconds
particle_count=1.024
AoS in 6,10074 seconds
SoA in 6,6032 seconds
SoA flat in 6,70765 seconds
SoA Static in 6,74453 seconds
SoA block in 6,54649 seconds
LibFlatArray SoA in 2,10663 seconds
SoA SSE in 3,53452 seconds
SoA SSE opt in 2,76819 seconds
From the above, the LibFlatArray and SSE methods are the
fastest.  I'd guess that a new "SoA block SSE" method, which
uses the _mm_* methods, would narrow the difference.  I'll
try to figure out how to do that.  I notice:

   #include 

doesn't produce a compile error; however, that #include
doesn't have the _mm_add_ps used here:

https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621

Do you know of some package I could install on my ubuntu OS
that makes those SSE functions, such as _mm_add_ps,
available?

[snip]

-regards,
Larry