Am 18.06.20 um 13:10 schrieb Phil Endecott via Boost:
Alexander Grund wrote:
I've seen other SIMD UTF-8 conversions around and they basically all focus on ASCII converting as much as possible and fallback to one-by-one decoding once a non-ascii is found
The question is, do they do that because they've determined that that gives the best performance (for some benchmark input), or have they not tried to do more with the SIMD code? I guess the former which would be my intuition. It is easy to detect the first byte of a multi-byte UTF-8 sequence in SIMD and also easy to bulk convert single-byte UTF-8 sequences. Once you get to converting the multi-byte sequence then SIMD doesn't make sense anymore. To much checking to do: How many bytes to "squash", end-of-input, shortest value, legal value, ... So summary: Once it requries branching it doesn't make sense to use SIMD anymore.