Re: [boost] [text] SIMD UTF-8 decoding

18 Jun 2020

      ...
I think it has most of what's needed, though it seems that the
type conversion __builtin_convertvector, which is needed to
expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
in newer versions of g++ than I have.
Than it's likely not very useful for now. Maybe later once that compiler 
version is more wide-spread
   // Attempt to decode the subset of UTF-8 with code points < 256.
   // Format is either 0xxxxxxx          -> 0xxxxxxx
   //               or 110---xx 10yyyyyy -> xxyyyyyy
   // The input mustn't start or finish in the middle of a multi-byte
   // character.
   // Other inputs produce undefined outputs.
Good code for that special case. But I think "undefined outputs" is not 
acceptable. I've seen other SIMD UTF-8 conversions around and they 
basically all focus on ASCII converting as much as possible and fallback 
to one-by-one decoding once a non-ascii is found
That will be quick, but it does lack a few things; it doesn't check if
it has reached the end of the input and it doesn't do any error checking.
So not really usable either. BUT: Compare to Boost.Locale which has a 
`decode` and `decode_valid` function where the latter assumes valid UTF-8
However checking for end-of-input is a must obviously.

BTW: Does Boost.Text have functions or overloads where you can specify 
that text is in a specific encoding/normalization?
If not I think this should be added. Sometimes you get text from an 
internal function and know those things so you can skip verification and 
conversion

Re: [boost] [text] SIMD UTF-8 decoding

Alexander Grund