On 17/09/2019 16:10, Vicram Rajagopalan wrote:
I'm not too familiar with dealing with non-ASCII character encodings in argv. Is it portable to assume that the input is UTF-8, regardless of locale?
It is not. I'm probably ignorant of several things in this area myself, but the basic version is: * On Windows, argv is converted to the current system codepage unless you are using the wmain/wWinMain entrypoints to get wchar_t strings instead. (And you should never ever use the converted values, as they will only sometimes work, due to being a lossy conversion.) It will never be UTF-8, but you can rely on it being UTF-16 (when using wmain/wWinMain). * On Unixes, argv contains whatever byte sequence the shell/caller put there. This might be the actual filename on disk (if they used tab completion) or it might be something subtly different (if they typed it themselves using some kind of IME), or even a binary blob. In the first two cases, while it is fairly *likely* to be UTF-8 (especially in modern systems), it is not guaranteed to be -- the user could be running a non-UTF-8 locale, or be accessing a filesystem created by someone who was. Ideally, treat them as an opaque blob that can only be passed to open() etc and never manipulated as text. (Obviously, this is frequently impractical.) So, on Windows, you must use the wchar_t as input, and while you *could* convert this to UTF-8 for internal use you still have to convert it back to UTF-16 to actually make use of it with the OS. Which is fine if you're doing a lot of string manipulation (including option parsing) but seems a bit wasteful if you're only using it as an opaque filename token. (And if you forget to convert back to UTF-16, it may interpret your UTF-8 string as a local-codepage-ANSI string, and hilarity ensues.) Whereas on Linux you can often get away with assuming that it's UTF-8, but some valid filenames will break encoder-savvy code, and any string conversions might output a no-longer-valid filename.