On 11/07/2017 14:49, Peter Dimov via Boost wrote:
Phil Endecott wrote:
That hurts 32-bit ARM.
I think that's an issue with whatever compiler you're using, not the architecture; I've just done a quick test with arm-linux-gnueabihf-g++-6 6.3.0 and I get about a 5% speedup by using memcpy.
No, it's an issue with ARM32 not allowing unaligned loads. The memcpy code, at best, uses four byte loads, instead of one 32 bit one. __builtin_assume_aligned doesn't help.
Be aware that some recent ARM CPUs no longer penalise unaligned loads. Cortex A15 I vaguely remember mostly does not except sometimes, Cortex A57 definitely does not. This may explain confounding results: a Cortex A9 most definitely punishes unaligned loads badly, as do any of the lower end even if very new ARM CPUs. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/