
On 7/01/2020 14:58, Yakov Galka wrote:
So, while unfortunate, v3 made the correct choice. Paths have to be kept in their original encoding between original source (command line, file, or UI) and file API usage, otherwise you can get weird errors when transcoding produces a different byte sequence that appears identical when actually rendered, but doesn't match the filesystem. Transcoding is only safe when you're going to do something with the string other than using it in a file API.
See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset) and back losslessly. The unprecedented introduction of a platform specific interface into the standard was, still is, and will always be, a horrible mistake.
Given that WTF-8 is not itself supported by the C++ standard library (and the other formats are), that doesn't seem like a valid argument. You'd have to campaign for that to be added first. The main problem though is that once you start allowing transcoding of any kind, it's a slippery slope to other conversions that can make lossy changes (such as applying different canonicalisation formats, or adding/removing layout codepoints such as RTL markers). Also, if you read the WTF-8 spec, it notes that it is not legal to directly concatenate two WTF-8 strings (you either have to convert it back to UCS-16 first, or execute some special handling for the trailing characters of the first string), which immediately renders it a poor choice for a path storage format. And indeed a poor choice for any purpose. (I suspect many people who are using it have conveniently forgotten that part.) Although on a related note, I think C++11/17 dropped the ball a bit on the new encoding-specific character types. It's definitely an improvement on the prior method, but it would have been better to do something like: struct ansi_encoding_t; struct utf_encoding_t; typedef encoded_char<ansi_encoding_t, 8> char_t; typedef encoded_char<utf_encoding_t, 8> char8_t; typedef encoded_char<utf_encoding_t, 16> char16_t; Where "encoded_char<E,N>" has storage size equal to N bits (blittable, and otherwise behaves like a standard integer type) but also carries around an arbitrary encoding tag type E. This could be used to distinguish "a string encoded in UTF-8" from "a string encoded in WTF-8" or "a string encoded in EDBDIC". And supplemental libraries could define additional encodings and conversion functions, and algorithms could operate on generic strings of any encoding.