On 22:13 Tue 25 Oct , Michael Marcin wrote:
On 10/25/2016 12:22 PM, Larry Evans wrote:
Hmmm. I didn't realize you'd have to run the benchmark several times to get stable results. I guess that reflect my ignorance of how benchmarks should be run.
The code was just a quick example hacked up to show large difference between different techniques.
If you want to compare similar techniques you'll need a more robust benchmark.
It would be easy to convert it to use: https://github.com/google/benchmark
Which is quite good.
When doing performance measurements you have to take into account the most common sources of noise: 1. Other processes might eat up CPU time or memory bandwidth. 2. The OS might decide to move your benchmark from one core to another, so you're loosing all L1+L2 cache entries. (Solution: thread pinning) 3. Thermal conditions and thermal inertia may affect if/when the CPU increases its clock speed. (Solution: either disable turbo mode or run the benchmark long enough to even out the thermal fluctuations.) AFAIK Google Benchmark doesn't to thread pinning and cannot affect the turbo mode. LIKWID ( https://github.com/RRZE-HPC/likwid ) can be used to set clock frequencies and pin threads, and can read the performance counters of the CPU. Might be a good idea to use both, Google Benchmark and LIKWID together.
Could you explain how running a couple of times achieves stable results (actually, on some occassions, I've run the benchmark and got results completely unexpected, I suspect it was because some application deamon was stealing cycles from the benchmark, leading to the unexpedted results).
Interestingly your SSE code is ~13% faster than the LibFlatArray code for large particle counts.
Actually, the SSE code was the OP's.
Actually it originates from:
https://software.intel.com/en-us/articles/creating-a-particle-system-with-st...
Ah, thanks for the info.
From the above, the LibFlatArray and SSE methods are the fastest. I'd guess that a new "SoA block SSE" method, which uses the _mm_* methods, would narrow the difference. I'll try to figure out how to do that. I notice:
#include
doesn't produce a compile error; however, that #include doesn't have the _mm_add_ps used here:
https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
Do you know of some package I could install on my ubuntu OS that makes those SSE functions, such as _mm_add_ps, available?
[snip]
If you're using gcc I think the header
The header should not depend on the compiler, but on the CPU model. Or rather: the vector ISA supported by the CPU: http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Supercomputing Institute for Multiscale Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-20866 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!