Artyom Beilis wrote:
What I meant by that is for instance
- is 0xCC 0x81 a valid UTF-8 string? - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?
Both are valid strings.. and both are meaningless on their own i.e. accent without letter or two same accents.
Being illogical in human terms or representation does not make them UTF-8 illegal.
UTF-8 is simple, human language processing is complex.
My point here is that strictly valid UTF-8 is the valid multibyte encoding of a valid codepoint sequence, and that the definition of "valid codepoint sequence" may vary depending on context, such that the above sequences are considered invalid. Drawing a line at the place where codepoints over 10FFFF and single surrogates are invalid but the above sequences are valid is an arbitrary decision. Not that this decision is wrong, it isn't. But it may not be what the user needs. Saying "invalid UTF-8 is just invalid, period" doesn't always work very well, although it's a good default. There are cases in which you have to handle specific kinds of invalid UTF-8 (but not any invalid UTF-8) and having to write UTF-8 encoding/decoding functions for every such instance does not really contribute to either security or correctness. It's better - I posit - to have functions that can be configured to handle various invalid forms of UTF-8 (that is, to accept certain invalid UTF-8, not necessarily to produce it, of course).