[boost] [text] SIMD UTF-8 decoding

15 Jun 2020

      Dear All,

I have been looking at the UTF-8 decoding code in the proposed
Boost.Text, as this is a problem I've looked at myself in the past.
I've mentioned an issue with the copyright in another message.
Here are my other observations.

1. The SIMD code is x86-specific.  It doesn't need to be; I think
it could use gcc's vector builtins to do the same thing and be
portable to other SIMD implementations.  (Clang provides the same
builtins; I'm not sure about what you need to do on MSVC/Windows.)
See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

2. The SIMD code only seems to provide a fast path for bytes < 0x80,
falling back to sequential code for everything else.  I guess I was
expecting something more sophisticated.

3. The code used for bytes >= 0x80, and in all cases for non-x86,
is here:
https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite...
around lines 400-560.  It implements a state machine, which surprises
me; it takes much less code and gives better performance if you write
out the bit-testing and shifting etc. explicitly.  This seems to be
about 50% slower than my existing UTF-8 decoding code.

4. There aren't enough comments anywhere in the code I've looked at!

Regards, Phil.

[boost] [text] SIMD UTF-8 decoding

Phil Endecott