On 8/01/2020 12:57, Yakov Galka wrote:
Paths are, almost always, concatenated with ASCII separators (or other valid strings) in-between. Even when concatenating malformed strings directly, the issue isn't there if the result is passed immediately back to the "UTF-16" system.
But the conversion from WTF-8 to UCS-16 can interpret the joining point as a different character, resulting in a different sequence. Unless I've misread something, this could occur if the first string ended in an unpaired high surrogate and the second started with an unpaired low surrogate (or rather the WTF-8 equivalents thereof). Unlikely, perhaps, but not impossible.
Although on a related note, I think C++11/17 dropped the ball a bit on the new encoding-specific character types. [...]
C++11 over-engineered it, and you keep over-engineering it even further. Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC strings in one program *at compile time*.
You've just suggested cases where apps will contain both UTF-8 and WTF-8, which would be useful to distinguish between at compile time -- both to allow overloading to automatically select the correct conversion function and to give you compile errors if you accidentally try to pass a WTF-8 string to a function that expects pure UTF-8, or vice versa. The same applies for other cases. That's why C++20 introduced char8_t, so that you wouldn't accidentally pass UTF-8 strings to methods expecting other char formats. This could even be extended to other forms of two-way data encoding, such as UUEncoding or Base64. I don't think that's over-engineering, that's just basic data conversion and type safety.