On 07:50 Fri 21 Oct , Larry Evans wrote:
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array
. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff).
I think the reason why the different implementation techniques are so close is that the particle model is memory bound (i.e. it's moving a lot of data while each particle update involves relatively few calculations). The difference becomes larger if you're using only a few particles: then all particles sit in the upper levels of the cache and the CPU doesn't have to wait as much for the data. It would also be worthwhile to try a more complex particle model (e.g. by adding interaction between the particles). With increased computational intensity (floating point operations per byte moved) the delta of the different strategies should increase much more. I've added an implementation of the benchmark based on LibFlatArray's SoA containers and expression templates[1]. While working on the benchmark, I realized that the vector types ("short_vec") in LibFlatArray were lacking some desirable operations (e.g. masked move), so to reproduce my results you'll have to use the trunk from [2]. I'm very happy that you wrote this benchmark because it's a valuable test bed for performance, programmability, and functionality. Thanks! One key contribution is that the LibFlatArray-based kernels will automatically be vectorized without the user having to touch intrinsics (which automatically tie your code to a specific platform). LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer products), ARM NEON... I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ (Skylake quad-core) to get stable results. Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts. I'll have to take a look at the assembly to figure out why that is. (As a library developer having such a test case is incredibly valuable, so thanks again!) For fewer particles the LibFlatArray kernel is ~31% faster. I assume that this delta would increase with a higher computational intensity as it's using AVX. On a SSE-only CPU the LibFlatArray code might be a little slower than the hand-optimized SSE code. particle_count=1.000.000 AoS in 9,21448 seconds SoA in 5,87921 seconds SoA flat in 5,81664 seconds SoA Static in 7,10225 seconds SoA block in 6,16696 seconds LibFlatArray SoA in 5,31733 seconds SoA SSE in 4,79973 seconds SoA SSE opt in 4,70757 seconds particle_count=1.024 AoS in 6,10074 seconds SoA in 6,6032 seconds SoA flat in 6,70765 seconds SoA Static in 6,74453 seconds SoA block in 6,54649 seconds LibFlatArray SoA in 2,10663 seconds SoA SSE in 3,53452 seconds SoA SSE opt in 2,76819 seconds Cheers -Andreas [1] https://github.com/gentryx/soa_experiment/blob/master/soa_compare.benchmark.... [2] https://github.com/gentryx/libflatarray -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!