On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
Dear All,
I have been looking at the UTF-8 decoding code in the proposed Boost.Text, as this is a problem I've looked at myself in the past. I've mentioned an issue with the copyright in another message. Here are my other observations.
1. The SIMD code is x86-specific. It doesn't need to be; I think it could use gcc's vector builtins to do the same thing and be portable to other SIMD implementations. (Clang provides the same builtins; I'm not sure about what you need to do on MSVC/Windows.) See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
That page describes vector-friendly data types and arithmetic operations. It does not seem to support the operations actually used in the code currently in Boost.Text.
2. The SIMD code only seems to provide a fast path for bytes < 0x80, falling back to sequential code for everything else. I guess I was expecting something more sophisticated.
The code makes the fast path extra fast, but the slow path, being quite branchy, is not really amenable to vectorization. If you have an implementation that proves that claim false, I'm happy to use it.
3. The code used for bytes >= 0x80, and in all cases for non-x86, is here: https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite... around lines 400-560. It implements a state machine, which surprises me; it takes much less code and gives better performance if you write out the bit-testing and shifting etc. explicitly. This seems to be about 50% slower than my existing UTF-8 decoding code.
Could you point me to that code, and let me use your benchmarks to verify? I'm happy to do something faster!
4. There aren't enough comments anywhere in the code I've looked at!
I only put comments where something unclear or unexpected is happening. The intention is that the rest of the code is clear enough to read on its own. Particularly in the case of Boost.Text, where most of the code follows one or more Unicode specifications, I tend to put a comment indicating where the online description of an algorithm might be found, and that's it -- except for API docs, of course. Zach