Re: [boost] interest in structure of arrays container?

25 Oct 2016

      On 07:50 Fri 21 Oct     , Larry Evans wrote:
...
I can't imagine how anything could be faster
than the soa_emitter_static_t because it uses a tuple of
std::array<T,particle_count>.  I'd guess that the
soa_emitter_block_t is only faster by luck (maybe during
the soa_emitter_block_t run, my machine was not as busy on some other 
stuff).
I think the reason why the different implementation techniques are so
close is that the particle model is memory bound (i.e. it's moving a
lot of data while each particle update involves relatively few
calculations).

The difference becomes larger if you're using only a few particles:
then all particles sit in the upper levels of the cache and the CPU
doesn't have to wait as much for the data. It would also be worthwhile
to try a more complex particle model (e.g. by adding interaction
between the particles). With increased computational intensity
(floating point operations per byte moved) the delta of the different
strategies should increase much more.

I've added an implementation of the benchmark based on LibFlatArray's
SoA containers and expression templates[1]. While working on the
benchmark, I realized that the vector types ("short_vec") in
LibFlatArray were lacking some desirable operations (e.g. masked
move), so to reproduce my results you'll have to use the trunk from
[2]. I'm very happy that you wrote this benchmark because it's a
valuable test bed for performance, programmability, and functionality.
Thanks!

One key contribution is that the LibFlatArray-based kernels will
automatically be vectorized without the user having to touch
intrinsics (which automatically tie your code to a specific platform).
LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer
products), ARM NEON...

I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ
(Skylake quad-core) to get stable results. Interestingly your SSE code
is ~13% faster than the LibFlatArray code for large particle counts.
I'll have to take a look at the assembly to figure out why that is.
(As a library developer having such a test case is incredibly
valuable, so thanks again!) For fewer particles the LibFlatArray
kernel is ~31% faster. I assume that this delta would increase with a
higher computational intensity as it's using AVX. On a SSE-only CPU
the LibFlatArray code might be a little slower than the hand-optimized
SSE code.

particle_count=1.000.000
AoS in 9,21448 seconds
SoA in 5,87921 seconds
SoA flat in 5,81664 seconds
SoA Static in 7,10225 seconds
SoA block in 6,16696 seconds
LibFlatArray SoA in 5,31733 seconds
SoA SSE in 4,79973 seconds
SoA SSE opt in 4,70757 seconds

particle_count=1.024
AoS in 6,10074 seconds
SoA in 6,6032 seconds
SoA flat in 6,70765 seconds
SoA Static in 6,74453 seconds
SoA block in 6,54649 seconds
LibFlatArray SoA in 2,10663 seconds
SoA SSE in 3,53452 seconds
SoA SSE opt in 2,76819 seconds

Cheers
-Andreas

[1] https://github.com/gentryx/soa_experiment/blob/master/soa_compare.benchmark....
[2] https://github.com/gentryx/libflatarray

-- 
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!