On Mon, Jul 10, 2017 at 5:34 PM, Phil Endecott via Boost
...if I'm right that it is undefined behaviour then it might stop working with some future compiler update somewhere.
The reinterpret_cast<> can be trivially changed to std::memcpy: ... Yes, I believe that's the right thing to do.
That hurts 32-bit ARM. So I am faced with a choice, penalize an existing platform today to benefit some possible future platform? Hmmmm......let me think about that.... ...probably not such a good idea! And the 32-bit ARM users might revolt with such a change. I've heard from a couple of users running Beast on constrained hardware.
Note that this is only about 50 lines of code; Beast's utf8_checker.hpp is maybe 5 times as long. Code follows.
Nice, this is great! I like where you are headed with your function and thanks for investing the time to write it. Perhaps the Beast utf8 validator could be improved, there's nothing more satisfying than removing lines of code! There's just an eensy teensy problem, the Beast validator is an "online" algorithm. It works with chunks of the entire input sequence at a time, sequentially, so there could be a code point that is split across the buffer boundary. All of that extra code you see in Beast is designed to handle that case, by saving the bytes at the end for when it gets called again (after the validator returns it will never see that buffer again). I admit that there is surprisingly large amount of code required just to handle this case. The good news is that those extra lines only execute in the special case where the code point is split. The bulk of the loop works on the parts of the buffer where code points can't possibly be split. And the unit test is exhaustive, it tries all possible code points and split positions. But who knows? I have never claimed to be a great coder, I consider myself average at best so its entirely possible that this could all be done in far fewer lines. Maybe you can update your function to handle this case? I am always happy to accept improvements into Beast. You might need to turn the function into a class to save those bytes at the end. Thanks