Re: [boost] [text] SIMD UTF-8 decoding

17 Jun 2020

      Zach Laine wrote:
...
On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
<boost@lists.boost.org> wrote:
...
1. The SIMD code is x86-specific.  It doesn't need to be; I think
it could use gcc's vector builtins to do the same thing and be
portable to other SIMD implementations.  (Clang provides the same
builtins; I'm not sure about what you need to do on MSVC/Windows.)
See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
That page describes vector-friendly data types and arithmetic
operations.  It does not seem to support the operations actually used
in the code currently in Boost.Text.
I think it has most of what's needed, though it seems that the 
type conversion __builtin_convertvector, which is needed to 
expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present 
in newer versions of g++ than I have.
...
...
2. The SIMD code only seems to provide a fast path for bytes < 0x80,
falling back to sequential code for everything else.  I guess I was
expecting something more sophisticated.
The code makes the fast path extra fast, but the slow path, being
quite branchy, is not really amenable to vectorization.  If you have
an implementation that proves that claim false, I'm happy to use it.
Well I've never written any SIMD code before, but I thought I'd 
have a look.  The following is just a proof of concept; it just 
converts 16 bytes of UTF-8 to one-byte-per-character ISO-8859-1. 
(Because, as noted above, I don't have a good way to spread out 
the bytes.)

int utf8_to_latin1(uint8_t* out_p, const uint8_t* in_p)
{
  // Attempt to decode the subset of UTF-8 with code points < 256.
  // Format is either 0xxxxxxx          -> 0xxxxxxx
  //               or 110---xx 10yyyyyy -> xxyyyyyy
  // The input mustn't start or finish in the middle of a multi-byte
  // character.
  // Other inputs produce undefined outputs.

  // Process 16 bytes at a time (for 128-bit SIMD).
  using uint8_x16 = uint8_t  __attribute__((vector_size(16)));

  // Load input:
  uint8_x16 in;
  for (int i = 0; i < 16; ++i) in[i] = in_p[i];

  // Each byte is one of three types:
  uint8_x16 is_onebyte       = (in & 0b10000000) == 0b00000000;
  uint8_x16 is_first_of_two  = (in & 0b11100000) == 0b11000000;
  uint8_x16 is_second_of_two = (in & 0b11000000) == 0b10000000;

  // (If all bytes are is_onebyte we could take a fast path now.)

  // Get the bits that each byte contributes to the result:
  uint8_x16 bits = is_onebyte       ? in
                 : is_first_of_two  ? (in << 6)
                 : is_second_of_two ? (in & 0b00111111)
                 : 0;

  // Make a shifted copy:
  // (Isn't there a better way to do this?)
  uint8_x16 shifts = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0 };
  uint8_x16 shifted = __builtin_shuffle(bits,shifts);

  // Combine the two bytes, where required:
  uint8_x16 combined = is_first_of_two ? (bits | shifted)
                     : bits;

  // Form a shuffle to pack the result, skipping the second bytes;
  // this is non-SIMD.  Any tricks to make this quicker?
  uint8_x16 shuffle;
  int to = 0, from = 0;
  while (from < 16) {
    shuffle[to++] = from++;
    if (is_second_of_two[from]) ++from;
  }

  // Apply the shuffle:
  uint8_x16 packed = __builtin_shuffle(combined,shuffle);

  // Store result.
  for (int i = 0; i < 16; ++i) out_p[i] = packed[i];
  return to;  // Number of characters in result.
}

That seems to work, and does generate vector instructions on ARM64. 
I've no idea if it faster than the alternatives.
...
...
3. The code used for bytes >= 0x80, and in all cases for non-x86,
is here:
https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite...
around lines 400-560.  It implements a state machine, which surprises
me; it takes much less code and gives better performance if you write
out the bit-testing and shifting etc. explicitly.  This seems to be
about 50% slower than my existing UTF-8 decoding code.
Could you point me to that code, and let me use your benchmarks to
verify?  I'm happy to do something faster!
Essentially you could just replace the body of the advance() function at 
line 536 of transcode_iterator.hpp with something like this:

char8_t b0 = *(p++);
IF_LIKELY((b0&0x80)==0) return b0;
char8_t b1 = *(p++);
IF_LIKELY((b0&0xe0)==0xc0) return (b1&0x3f) | ((b0&0x1f)<<6);
char8_t b2 = *(p++);
IF_LIKELY((b0&0xf0)==0xe0) return (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
char8_t b3 = *(p++);
IF_LIKELY((b0&0xf8)==0xf0) return (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);

That will be quick, but it does lack a few things; it doesn't check if 
it has reached the end of the input and it doesn't do any error checking.  

Regards, Phil.