13 Jun
2017
13 Jun
'17
11:33 p.m.
Artyom Beilis wrote:
Now as you have seen there are many possible "non-standard" UTF-8 variants.
What should I accept?
Others have made their case for WTF-8, which has the desirable property of only allowing one encoding for any uint16_t sequence (taken in isolation). My personal choice here is to be more lenient and accept any combination of valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8), but I'm not going to argue very strongly for it over WTF-8, if at all. I think that overlongs should NOT be accepted, including overlong zero.