On Tue, Sep 17, 2019 at 8:17 AM Peter Dimov via Boost
Rainer Deyke wrote:
Or the user could be running a non-UTF-8 locale, but accessing a filesystem created by somebody who was using UTF-8 - in which case any filenames should be in UTF-8, even if the user's locale disagrees.
It is because of this last possibility that I recommend treating all command-line arguments as UTF-8 on Unix systems, even if running a non-UTF-8 locale, for all cases where treating them as binary blobs is impractical. Unix filenames are binary blobs, but the de-facto standard for interpreting these binary blobs as text is to use UTF-8. [...]
How does any of this affect the library? It just gives you whatever you passed as `argv`, without needing to interpret it.
Windows is a different story.
Indeed, you can just use UTF-8 (as long as you document this!) for everything except Windows. With Windows, you need to provide a wchar_t/UTF-16 overload for every char/UTF-8 overload in your lib. If you want 100% correctness, you are not allowed to arbitrarily convert the wchar_t strings. In particular, you are not allowed to convert them to UTF-8, because it is possible that one of them is a filename, and it is possible to construct filenames on the Windows platform that are not properly UTF-16-encoded. This means that the UTF-16 -> UTF-8 conversion is lossy, if you follow the Unicode guidelines for that conversion -- you should produce a replacement character (U+FFFD) where you encounter the broken UTF-16. Though such broken-UTF-16-named files are possible to create, they do not come up often in practice; they almost never do. So, if you don't care about this case that prevents 100% correctness, just provide wchar_t overloads, and implement each one by converting to UTF-8 and calling your UTF-8 overload, and only define the wchar_t overloads when building on Windows. Zach