Zach Laine wrote:
On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
wrote: 1. The SIMD code is x86-specific. It doesn't need to be; I think it could use gcc's vector builtins to do the same thing and be portable to other SIMD implementations. (Clang provides the same builtins; I'm not sure about what you need to do on MSVC/Windows.) See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
That page describes vector-friendly data types and arithmetic operations. It does not seem to support the operations actually used in the code currently in Boost.Text.
I think it has most of what's needed, though it seems that the type conversion __builtin_convertvector, which is needed to expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present in newer versions of g++ than I have.
2. The SIMD code only seems to provide a fast path for bytes < 0x80, falling back to sequential code for everything else. I guess I was expecting something more sophisticated.
The code makes the fast path extra fast, but the slow path, being quite branchy, is not really amenable to vectorization. If you have an implementation that proves that claim false, I'm happy to use it.
Well I've never written any SIMD code before, but I thought I'd have a look. The following is just a proof of concept; it just converts 16 bytes of UTF-8 to one-byte-per-character ISO-8859-1. (Because, as noted above, I don't have a good way to spread out the bytes.) int utf8_to_latin1(uint8_t* out_p, const uint8_t* in_p) { // Attempt to decode the subset of UTF-8 with code points < 256. // Format is either 0xxxxxxx -> 0xxxxxxx // or 110---xx 10yyyyyy -> xxyyyyyy // The input mustn't start or finish in the middle of a multi-byte // character. // Other inputs produce undefined outputs. // Process 16 bytes at a time (for 128-bit SIMD). using uint8_x16 = uint8_t __attribute__((vector_size(16))); // Load input: uint8_x16 in; for (int i = 0; i < 16; ++i) in[i] = in_p[i]; // Each byte is one of three types: uint8_x16 is_onebyte = (in & 0b10000000) == 0b00000000; uint8_x16 is_first_of_two = (in & 0b11100000) == 0b11000000; uint8_x16 is_second_of_two = (in & 0b11000000) == 0b10000000; // (If all bytes are is_onebyte we could take a fast path now.) // Get the bits that each byte contributes to the result: uint8_x16 bits = is_onebyte ? in : is_first_of_two ? (in << 6) : is_second_of_two ? (in & 0b00111111) : 0; // Make a shifted copy: // (Isn't there a better way to do this?) uint8_x16 shifts = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0 }; uint8_x16 shifted = __builtin_shuffle(bits,shifts); // Combine the two bytes, where required: uint8_x16 combined = is_first_of_two ? (bits | shifted) : bits; // Form a shuffle to pack the result, skipping the second bytes; // this is non-SIMD. Any tricks to make this quicker? uint8_x16 shuffle; int to = 0, from = 0; while (from < 16) { shuffle[to++] = from++; if (is_second_of_two[from]) ++from; } // Apply the shuffle: uint8_x16 packed = __builtin_shuffle(combined,shuffle); // Store result. for (int i = 0; i < 16; ++i) out_p[i] = packed[i]; return to; // Number of characters in result. } That seems to work, and does generate vector instructions on ARM64. I've no idea if it faster than the alternatives.
3. The code used for bytes >= 0x80, and in all cases for non-x86, is here: https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_ite... around lines 400-560. It implements a state machine, which surprises me; it takes much less code and gives better performance if you write out the bit-testing and shifting etc. explicitly. This seems to be about 50% slower than my existing UTF-8 decoding code.
Could you point me to that code, and let me use your benchmarks to verify? I'm happy to do something faster!
Essentially you could just replace the body of the advance() function at line 536 of transcode_iterator.hpp with something like this: char8_t b0 = *(p++); IF_LIKELY((b0&0x80)==0) return b0; char8_t b1 = *(p++); IF_LIKELY((b0&0xe0)==0xc0) return (b1&0x3f) | ((b0&0x1f)<<6); char8_t b2 = *(p++); IF_LIKELY((b0&0xf0)==0xe0) return (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12); char8_t b3 = *(p++); IF_LIKELY((b0&0xf8)==0xf0) return (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18); That will be quick, but it does lack a few things; it doesn't check if it has reached the end of the input and it doesn't do any error checking. Regards, Phil.