Re: [boost] [review] Review of Nowide (Unicode) starts today

12 Jun 2017

      Artyom Beilis wrote:
...
...
On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost
<boost@lists.boost.org> wrote:
AB> By definition: you can't handle file names that can't be
AB> represented in UTF-8 as there is no valid UTF-8 representation exist.
This is a nice principle to have in theory, but very unfortunate in
practice because at least under Unix systems such file names do occur
in the wild (maybe less often now than 10 years ago, when UTF-8 was
less ubiquitous, but it's still hard to believe that the problem has
completely disappeared). And there are ways to solve it, e.g. I think
glib represents such file names using special characters from a PUA
and there are other possible approaches, even if, admittedly, none of
On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
<boost@lists.boost.org> wrote:
them is perfect.
...
Please note: Under POSIX platforms no conversions are performed and no
UTF-8 validation is done as this is incorrect:
http://cppcms.com/files/nowide/html/index.html#qna
Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".
...
The only case is when Windows Wide API returns/creates invalid UTF-16 -
which can happen only when invalid surrogate
UTF-16 pairs are generated - and they have no valid UTF-8 representation.
On the other hand creating deliberately invalid UTF-8 is very problematic
idea.
Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.

Re: [boost] [review] Review of Nowide (Unicode) starts today

Groke, Paul