Re: [boost] [text] SIMD UTF-8 decoding

16 Jun 2020


      On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
<boost@lists.boost.org> wrote:
...
Dear All,
I have been looking at the UTF-8 decoding code in the proposed
Boost.Text, as this is a problem I've looked at myself in the past.
I've mentioned an issue with the copyright in another message.
Here are my other observations.
1. The SIMD code is x86-specific.  It doesn't need to be; I think
it could use gcc's vector builtins to do the same thing and be
portable to other SIMD implementations.  (Clang provides the same
builtins; I'm not sure about what you need to do on MSVC/Windows.)
See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
That page describes vector-friendly data types and arithmetic
operations.  It does not seem to support the operations actually used
in the code currently in Boost.Text.
...
2. The SIMD code only seems to provide a fast path for bytes < 0x80,
falling back to sequential code for everything else.  I guess I was
expecting something more sophisticated.
The code makes the fast path extra fast, but the slow path, being
quite branchy, is not really amenable to vectorization.  If you have
an implementation that proves that claim false, I'm happy to use it.
...
3. The code used for bytes >= 0x80, and in all cases for non-x86,
is here:
https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite...
around lines 400-560.  It implements a state machine, which surprises
me; it takes much less code and gives better performance if you write
out the bit-testing and shifting etc. explicitly.  This seems to be
about 50% slower than my existing UTF-8 decoding code.
Could you point me to that code, and let me use your benchmarks to
verify?  I'm happy to do something faster!
...
4. There aren't enough comments anywhere in the code I've looked at!
I only put comments where something unclear or unexpected is
happening.  The intention is that the rest of the code is clear enough
to read on its own.  Particularly in the case of Boost.Text, where
most of the code follows one or more Unicode specifications, I tend to
put a comment indicating where the online description of an algorithm
might be found, and that's it -- except for API docs, of course.

Zach

Re: [boost] [text] SIMD UTF-8 decoding

Zach Laine