Andrey Semashev wrote:
Right. Just don't call it UTF-8 anymore.
I don't know what this means.
I mean as a result you will have a string fn, whose encoding is not UTF-8. As a consequence algorithms that require UTF-8 input cannot be expected to work with this string.
It's invalid UTF-8 and yes, algorithms that require valid UTF-8 will obviously not work with it. The point is that the implementation of these functions needs to encode/decode this not-quite-valid-UTF-8, for which it needs functions that encode/decode this not-quite-valid-UTF-8.
It's an invalid UTF-8 encoding of a valid codepoint sequence.
Yes, but valid codepoint sequence is not enough to interpret the string.
It's enough. What more would you need?
You mean all string-related code should be prepared for invalid input?
I don't understand this, either.
You said that properly written code should not require string validity. Should such code be always prepared for invalid strings, at any point? If so, this looks like unnecessary overhead to me.
I said that properly written code should not require minimal UTF-8 byte sequences, because properly written code validates the codepoint sequence (after normalizing it, if required), not the UTF-8 byte sequence. To expand on that, the reason UTF-8 overlong sequences are a source of security issues is because of code that does external input -> validate as NTBS -> ... -> pass to UTF-8 API -> decoding -> do something because if validation is supposed to reject ../passwords.txt, the attacker encodes the dots as two bytes and gets around the naive NTBS validation which no longer sees '..' but something else. But the actual problem with this code is that the validation should be done on the codepoint sequence, not on the byte sequence. And if you do that, you see the dot as a dot (and the slash as a slash and the NUL as a NUL) regardless of whether it's encoded with one byte or four. Anyway, that was a detour. In practice I can't think of valid cases for accepting overlong sequences except the long zero and maybe not even then.