Re: [boost] [text] SIMD UTF-8 decoding

18 Jun 2020


      Am 18.06.20 um 13:10 schrieb Phil Endecott via Boost:
...
Alexander Grund wrote:
...
I've seen other SIMD UTF-8 conversions around and they basically all 
focus on ASCII converting as much as possible and fallback to 
one-by-one decoding once a non-ascii is found
The question is, do they do that because they've determined that
that gives the best performance (for some benchmark input), or
have they not tried to do more with the SIMD code?
I guess the former which would be my intuition. It is easy to detect the 
first byte of a multi-byte UTF-8 sequence in SIMD and also easy to bulk 
convert single-byte UTF-8 sequences. Once you get to converting the 
multi-byte sequence then SIMD doesn't make sense anymore. To much 
checking to do: How many bytes to "squash", end-of-input, shortest 
value, legal value, ...
So summary: Once it requries branching it doesn't make sense to use SIMD 
anymore.