11 Jul
2017
11 Jul
'17
1:49 p.m.
Phil Endecott wrote:
That hurts 32-bit ARM.
I think that's an issue with whatever compiler you're using, not the architecture; I've just done a quick test with arm-linux-gnueabihf-g++-6 6.3.0 and I get about a 5% speedup by using memcpy.
No, it's an issue with ARM32 not allowing unaligned loads. The memcpy code, at best, uses four byte loads, instead of one 32 bit one. __builtin_assume_aligned doesn't help. https://godbolt.org/g/iC9X35 I suspect that all architectures that don't have unaligned loads will suffer similarly. Myself, I'd go with the reinterpret_cast for the time being. It's indeed _technically UB_, but it works.