On 09.10.2015 18:41, Peter Dimov wrote:
Andrey Semashev wrote:
WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).
In addition to what I wrote earlier, the choices here are not representable in a single U or W letter. When taking UTF-8, you need to decide whether to
- accept codepoints over 10FFFF - accept codepoints encoded with more bytes than necessary - accept surrogates - probably more because Unicode is hard
and then for each rejected byte sequence whether to
- throw - ignore and skip - replace with U+FFFD
As long as the code sequences are described by the spec, I consider them valid. We can provide a number of options to influence the conversion process, but the result should be something that can be decoded by a conforming Unicode parser.