On 10/26/2016 1:33 PM, Chris Glover wrote:
I guess some optimisation from way yonder (something modern compilers do routinely, even on a Monday morning!)... but more than probable irrelevant nowadays...
degski
I might be pessimistic, but I never trust the compiler and generally check what's being output. In this case, FWIW, on MSVC2015, the bit-twiddling version generates faster code than the mod version -- about 25% faster. I didn't test gcc or clang.
Using google benchmark:
Code:
static void AlignedMod(benchmark::State& state) { while (state.KeepRunning()) { for(int i = state.range_x(); i < 128; i += state.range_y()) { bool aligned = (i % 16) == 0; benchmark::DoNotOptimize(aligned); } } } BENCHMARK(AlignedMod)->ArgPair(1, 1);
static void AlignedAnd(benchmark::State& state) { while (state.KeepRunning()) { for(int i = state.range_x(); i < 128; i += state.range_y()) { bool aligned = ((i - 1) & 15) == 0; benchmark::DoNotOptimize(aligned); } } } BENCHMARK(AlignedAnd)->ArgPair(1, 1);
Generated code of the inner loop:
Mod version: mov eax,ebx and eax,8000000Fh jge AlignedMod+50h dec eax or eax,0FFFFFFF0h inc eax test eax,eax lea rcx,[aligned] sete byte ptr [aligned] call 07FF73B84A180h add ebx,dword ptr [rdi+1Ch] cmp ebx,80h jl AlignedMod+40h
And version: lea eax,[rbx-1] test al,0Fh lea rcx,[aligned] sete byte ptr [aligned] call 07FF73B84A180h add ebx,dword ptr [rdi+1Ch] cmp ebx,80h jl AlignedAnd+40h
Result: Benchmark Time CPU Iterations ------------------------------------------------------------------------- AlignedMod/1/1 204 ns 203 ns 4072727 AlignedAnd/1/1 153 ns 154 ns 4977778
It's very dangerous to draw conclusions from benchmarks like this. The operation and elapsed time are way too small for the results to be distinguishable from noise on the system. AlignedAnd has a random '- 1' in it so it's not even checking if i is 16 byte aligned. AlignedMod is taking the mod of a signed integer. Make it use an unsigned integer and some very familiar code gets generated. for ( unsigned i = state.range_x(); i < 128; i += state.range_y() ) 00007FF7ACBA316C mov ebx,dword ptr [rdi+14h] 00007FF7ACBA316F cmp ebx,80h 00007FF7ACBA3175 jae AlignedMod+12h (07FF7ACBA3132h) 00007FF7ACBA3177 nop word ptr [rax+rax] { bool aligned = (i % 16) == 0; 00007FF7ACBA3180 test bl,0Fh benchmark::DoNotOptimize( aligned ); 00007FF7ACBA3183 lea rcx,[aligned] { bool aligned = (i % 16) == 0; 00007FF7ACBA3188 sete byte ptr [aligned] benchmark::DoNotOptimize( aligned ); 00007FF7ACBA318D call benchmark::internal::UseCharPointer (07FF7ACBA1019h) { for ( unsigned i = state.range_x(); i < 128; i += state.range_y() ) 00007FF7ACBA3192 add ebx,dword ptr [rdi+1Ch] 00007FF7ACBA3195 cmp ebx,80h 00007FF7ACBA319B jb AlignedMod+60h (07FF7ACBA3180h) } } 00007FF7ACBA319D jmp AlignedMod+12h (07FF7ACBA3132h) } 00007FF7ACBA319F mov rbx,qword ptr [rsp+38h] 00007FF7ACBA31A4 mov rsi,qword ptr [rsp+40h] 00007FF7ACBA31A9 add rsp,20h 00007FF7ACBA31AD pop rdi 00007FF7ACBA31AE ret I don't agree that it is more obscure. I think it's perfectly natural to understand that the bottom 4 bits must be zero in order to be multiple of 16. I recommend to everyone the book Hacker's Delight which will make you truly appreciate how wonderful integers are.