Artyom Beilis wrote:
On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost
wrote: AB> By definition: you can't handle file names that can't be AB> represented in UTF-8 as there is no valid UTF-8 representation exist.
This is a nice principle to have in theory, but very unfortunate in practice because at least under Unix systems such file names do occur in the wild (maybe less often now than 10 years ago, when UTF-8 was less ubiquitous, but it's still hard to believe that the problem has completely disappeared). And there are ways to solve it, e.g. I think glib represents such file names using special characters from a PUA and there are other possible approaches, even if, admittedly, none of
On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
wrote: them is perfect. Please note: Under POSIX platforms no conversions are performed and no UTF-8 validation is done as this is incorrect:
Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".
The only case is when Windows Wide API returns/creates invalid UTF-16 - which can happen only when invalid surrogate UTF-16 pairs are generated - and they have no valid UTF-8 representation.
On the other hand creating deliberately invalid UTF-8 is very problematic idea.
Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications. I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.