On 10/25/2016 01:41 AM, Andreas Schäfer wrote:
On 07:50 Fri 21 Oct , Larry Evans wrote:
I can't imagine how anything could be faster than the soa_emitter_static_t because it uses a tuple of std::array
. I'd guess that the soa_emitter_block_t is only faster by luck (maybe during the soa_emitter_block_t run, my machine was not as busy on some other stuff). I think the reason why the different implementation techniques are so close is that the particle model is memory bound (i.e. it's moving a lot of data while each particle update involves relatively few calculations).
The difference becomes larger if you're using only a few particles: then all particles sit in the upper levels of the cache and the CPU doesn't have to wait as much for the data. It would also be worthwhile to try a more complex particle model (e.g. by adding interaction between the particles). With increased computational intensity (floating point operations per byte moved) the delta of the different strategies should increase much more.
Thanks for the explanation. The lastest version of the benchmark: d6ee370606f7f167dedb93e174459c6c7c4d8c19 reports the relative difference of the times: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L823 So, based on what you say above, I guess when particle_count: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L135 increases to a point where the cache is overflowed, the relative differences between methods should show a sharp difference?
I've added an implementation of the benchmark based on LibFlatArray's SoA containers and expression templates[1]. While working on the benchmark, I realized that the vector types ("short_vec") in LibFlatArray were lacking some desirable operations (e.g. masked move), so to reproduce my results you'll have to use the trunk from [2]. I'm very happy that you wrote this benchmark because it's a valuable test bed for performance, programmability, and functionality. Thanks!
You're welcome. Much of the credit goes to the OP, as acknowledged, indirectly, here: https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L6
One key contribution is that the LibFlatArray-based kernels will automatically be vectorized without the user having to touch intrinsics (which automatically tie your code to a specific platform). LibFlatArray supports SSE, AVX, AVX512 (not yet available in consumer products), ARM NEON...
I've re-run the benchmark a couple of times on my Intel Core i7-6700HQ (Skylake quad-core) to get stable results.
Hmmm. I didn't realize you'd have to run the benchmark several times to get stable results. I guess that reflect my ignorance of how benchmarks should be run. Could you explain how running a couple of times achieves stable results (actually, on some occassions, I've run the benchmark and got results completely unexpected, I suspect it was because some application deamon was stealing cycles from the benchmark, leading to the unexpedted results).
Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts.
Actually, the SSE code was the OP's. As intimated above, using the latest version of the benchmark should make this % difference more apparent. For example, the output looks like this: particle_count=1,024 frames=1,000 minimum duration=0.0369697 comparitive performance table: method rel_duration ________ ______________ SoA 0.902566 Flat 0.907562 Block 0.963046 AoS 1 StdArray 1.15868 LFA undefined SSE undefined SSE_opt undefined The above was done with compiler optimization flag -O0. It changes dramatically with -O2 or -O3.
I'll have to take a look at the assembly to figure out why that is.
Oh, I bet that will be fun ;)
(As a library developer having such a test case is incredibly valuable, so thanks again!) For fewer particles the LibFlatArray kernel is ~31% faster. I assume that this delta would increase with a higher computational intensity as it's using AVX. On a SSE-only CPU the LibFlatArray code might be a little slower than the hand-optimized SSE code.
particle_count=1.000.000 AoS in 9,21448 seconds SoA in 5,87921 seconds SoA flat in 5,81664 seconds SoA Static in 7,10225 seconds SoA block in 6,16696 seconds LibFlatArray SoA in 5,31733 seconds SoA SSE in 4,79973 seconds SoA SSE opt in 4,70757 seconds
particle_count=1.024 AoS in 6,10074 seconds SoA in 6,6032 seconds SoA flat in 6,70765 seconds SoA Static in 6,74453 seconds SoA block in 6,54649 seconds LibFlatArray SoA in 2,10663 seconds SoA SSE in 3,53452 seconds SoA SSE opt in 2,76819 seconds
From the above, the LibFlatArray and SSE methods are the
fastest. I'd guess that a new "SoA block SSE" method, which
uses the _mm_* methods, would narrow the difference. I'll
try to figure out how to do that. I notice:
#include