[text] SIMD UTF-8 decoding
Dear All, I have been looking at the UTF-8 decoding code in the proposed Boost.Text, as this is a problem I've looked at myself in the past. I've mentioned an issue with the copyright in another message. Here are my other observations. 1. The SIMD code is x86-specific. It doesn't need to be; I think it could use gcc's vector builtins to do the same thing and be portable to other SIMD implementations. (Clang provides the same builtins; I'm not sure about what you need to do on MSVC/Windows.) See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html 2. The SIMD code only seems to provide a fast path for bytes < 0x80, falling back to sequential code for everything else. I guess I was expecting something more sophisticated. 3. The code used for bytes >= 0x80, and in all cases for non-x86, is here: https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite... around lines 400-560. It implements a state machine, which surprises me; it takes much less code and gives better performance if you write out the bit-testing and shifting etc. explicitly. This seems to be about 50% slower than my existing UTF-8 decoding code. 4. There aren't enough comments anywhere in the code I've looked at! Regards, Phil.
On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
Dear All,
I have been looking at the UTF-8 decoding code in the proposed Boost.Text, as this is a problem I've looked at myself in the past. I've mentioned an issue with the copyright in another message. Here are my other observations.
1. The SIMD code is x86-specific. It doesn't need to be; I think it could use gcc's vector builtins to do the same thing and be portable to other SIMD implementations. (Clang provides the same builtins; I'm not sure about what you need to do on MSVC/Windows.) See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
That page describes vector-friendly data types and arithmetic operations. It does not seem to support the operations actually used in the code currently in Boost.Text.
2. The SIMD code only seems to provide a fast path for bytes < 0x80, falling back to sequential code for everything else. I guess I was expecting something more sophisticated.
The code makes the fast path extra fast, but the slow path, being quite branchy, is not really amenable to vectorization. If you have an implementation that proves that claim false, I'm happy to use it.
3. The code used for bytes >= 0x80, and in all cases for non-x86, is here: https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite... around lines 400-560. It implements a state machine, which surprises me; it takes much less code and gives better performance if you write out the bit-testing and shifting etc. explicitly. This seems to be about 50% slower than my existing UTF-8 decoding code.
Could you point me to that code, and let me use your benchmarks to verify? I'm happy to do something faster!
4. There aren't enough comments anywhere in the code I've looked at!
I only put comments where something unclear or unexpected is happening. The intention is that the rest of the code is clear enough to read on its own. Particularly in the case of Boost.Text, where most of the code follows one or more Unicode specifications, I tend to put a comment indicating where the online description of an algorithm might be found, and that's it -- except for API docs, of course. Zach
Zach Laine wrote:
On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
wrote: 1. The SIMD code is x86-specific. It doesn't need to be; I think it could use gcc's vector builtins to do the same thing and be portable to other SIMD implementations. (Clang provides the same builtins; I'm not sure about what you need to do on MSVC/Windows.) See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
That page describes vector-friendly data types and arithmetic operations. It does not seem to support the operations actually used in the code currently in Boost.Text.
I think it has most of what's needed, though it seems that the type conversion __builtin_convertvector, which is needed to expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present in newer versions of g++ than I have.
2. The SIMD code only seems to provide a fast path for bytes < 0x80, falling back to sequential code for everything else. I guess I was expecting something more sophisticated.
The code makes the fast path extra fast, but the slow path, being quite branchy, is not really amenable to vectorization. If you have an implementation that proves that claim false, I'm happy to use it.
Well I've never written any SIMD code before, but I thought I'd have a look. The following is just a proof of concept; it just converts 16 bytes of UTF-8 to one-byte-per-character ISO-8859-1. (Because, as noted above, I don't have a good way to spread out the bytes.) int utf8_to_latin1(uint8_t* out_p, const uint8_t* in_p) { // Attempt to decode the subset of UTF-8 with code points < 256. // Format is either 0xxxxxxx -> 0xxxxxxx // or 110---xx 10yyyyyy -> xxyyyyyy // The input mustn't start or finish in the middle of a multi-byte // character. // Other inputs produce undefined outputs. // Process 16 bytes at a time (for 128-bit SIMD). using uint8_x16 = uint8_t __attribute__((vector_size(16))); // Load input: uint8_x16 in; for (int i = 0; i < 16; ++i) in[i] = in_p[i]; // Each byte is one of three types: uint8_x16 is_onebyte = (in & 0b10000000) == 0b00000000; uint8_x16 is_first_of_two = (in & 0b11100000) == 0b11000000; uint8_x16 is_second_of_two = (in & 0b11000000) == 0b10000000; // (If all bytes are is_onebyte we could take a fast path now.) // Get the bits that each byte contributes to the result: uint8_x16 bits = is_onebyte ? in : is_first_of_two ? (in << 6) : is_second_of_two ? (in & 0b00111111) : 0; // Make a shifted copy: // (Isn't there a better way to do this?) uint8_x16 shifts = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0 }; uint8_x16 shifted = __builtin_shuffle(bits,shifts); // Combine the two bytes, where required: uint8_x16 combined = is_first_of_two ? (bits | shifted) : bits; // Form a shuffle to pack the result, skipping the second bytes; // this is non-SIMD. Any tricks to make this quicker? uint8_x16 shuffle; int to = 0, from = 0; while (from < 16) { shuffle[to++] = from++; if (is_second_of_two[from]) ++from; } // Apply the shuffle: uint8_x16 packed = __builtin_shuffle(combined,shuffle); // Store result. for (int i = 0; i < 16; ++i) out_p[i] = packed[i]; return to; // Number of characters in result. } That seems to work, and does generate vector instructions on ARM64. I've no idea if it faster than the alternatives.
3. The code used for bytes >= 0x80, and in all cases for non-x86, is here: https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite... around lines 400-560. It implements a state machine, which surprises me; it takes much less code and gives better performance if you write out the bit-testing and shifting etc. explicitly. This seems to be about 50% slower than my existing UTF-8 decoding code.
Could you point me to that code, and let me use your benchmarks to verify? I'm happy to do something faster!
Essentially you could just replace the body of the advance() function at line 536 of transcode_iterator.hpp with something like this: char8_t b0 = *(p++); IF_LIKELY((b0&0x80)==0) return b0; char8_t b1 = *(p++); IF_LIKELY((b0&0xe0)==0xc0) return (b1&0x3f) | ((b0&0x1f)<<6); char8_t b2 = *(p++); IF_LIKELY((b0&0xf0)==0xe0) return (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12); char8_t b3 = *(p++); IF_LIKELY((b0&0xf8)==0xf0) return (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18); That will be quick, but it does lack a few things; it doesn't check if it has reached the end of the input and it doesn't do any error checking. Regards, Phil.
I think it has most of what's needed, though it seems that the type conversion __builtin_convertvector, which is needed to expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present in newer versions of g++ than I have. Than it's likely not very useful for now. Maybe later once that compiler version is more wide-spread // Attempt to decode the subset of UTF-8 with code points < 256. // Format is either 0xxxxxxx -> 0xxxxxxx // or 110---xx 10yyyyyy -> xxyyyyyy // The input mustn't start or finish in the middle of a multi-byte // character. // Other inputs produce undefined outputs. Good code for that special case. But I think "undefined outputs" is not acceptable. I've seen other SIMD UTF-8 conversions around and they basically all focus on ASCII converting as much as possible and fallback to one-by-one decoding once a non-ascii is found That will be quick, but it does lack a few things; it doesn't check if it has reached the end of the input and it doesn't do any error checking.
So not really usable either. BUT: Compare to Boost.Locale which has a `decode` and `decode_valid` function where the latter assumes valid UTF-8 However checking for end-of-input is a must obviously. BTW: Does Boost.Text have functions or overloads where you can specify that text is in a specific encoding/normalization? If not I think this should be added. Sometimes you get text from an internal function and know those things so you can skip verification and conversion
Alexander Grund wrote:
I've seen other SIMD UTF-8 conversions around and they basically all focus on ASCII converting as much as possible and fallback to one-by-one decoding once a non-ascii is found
The question is, do they do that because they've determined that that gives the best performance (for some benchmark input), or have they not tried to do more with the SIMD code? Regards, Phil.
Am 18.06.20 um 13:10 schrieb Phil Endecott via Boost:
Alexander Grund wrote:
I've seen other SIMD UTF-8 conversions around and they basically all focus on ASCII converting as much as possible and fallback to one-by-one decoding once a non-ascii is found
The question is, do they do that because they've determined that that gives the best performance (for some benchmark input), or have they not tried to do more with the SIMD code? I guess the former which would be my intuition. It is easy to detect the first byte of a multi-byte UTF-8 sequence in SIMD and also easy to bulk convert single-byte UTF-8 sequences. Once you get to converting the multi-byte sequence then SIMD doesn't make sense anymore. To much checking to do: How many bytes to "squash", end-of-input, shortest value, legal value, ... So summary: Once it requries branching it doesn't make sense to use SIMD anymore.
On Mon, 15 Jun 2020 at 14:06, Phil Endecott via Boost
Dear All,
I have been looking at the UTF-8 decoding code in the proposed Boost.Text, as this is a problem I've looked at myself in the past. I've mentioned an issue with the copyright in another message. Here are my other observations.
1. The SIMD code is x86-specific. It doesn't need to be; I think it could use gcc's vector builtins to do the same thing and be portable to other SIMD implementations. (Clang provides the same builtins; I'm not sure about what you need to do on MSVC/Windows.) See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
Bar writing intel assembler for ml64 [a stand-alone assembler, the only option on Windows-x64 is Intel Intrinsics ( https://software.intel.com/sites/landingpage/IntrinsicsGuide/# ). Over and above the Intel Intrinsics, Microsoft offers 'a number of extensions', which often correspond to some builtin_ in clang/gcc. I'm not sure this is related, but there is also the vectorcall-ABI. degski
participants (4)
-
Alexander Grund
-
degski
-
Phil Endecott
-
Zach Laine