Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

9 Oct 2015

      ----- Original Message -----
...
From: Peter Dimov <lists@pdimov.com>
Andrey Semashev wrote:
...
WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them 
 should be the user's explicit choice (e.g. the user should write 
 utf16_to_wtf8 instead of utf16_to_utf8).
The user doesn't write such things in practice. He writes things like
string fn = get_file_name();
    fopen( fn.c_str() );
and get_file_name and fopen must decide how to encode/decode UTF-8. So 
get_file_name gets some wchar_t[] sequence from Windows, which happens to be 
invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass
Ok... that is interesting point relevant to Boost.Nowide however irrelevant
to utf8_codecvt facets.

The only way UTF-16 can be invalid is to have non-properly paired utf-16 
surrogate units.

They can technically be encoded to invalid UTF-8 prepresenting code points
in closed range reserved to surrogate pairs.

i.e. boost::nowide::narrow should generate invalid UTF-8 from invalid UTF-8
and invalid in very special way UTF-8 to invalid UTF-16.

It looks horrifying for me but it maybe actually solution for such a problem

But this should never-ever-ever be used outside Boost.Nowide

And to be honest - IMHO if a program fails on files that encoded in invalid
UTF-16 when Windows states that the encoding is UTF-16... than I think
they should fail.
...
You should also keep in mind that Unicode strings can have multiple
...
representations even if using strict UTF-8. So one could argue that using 
strict UTF-8 provides a false sense of security.
This isn't correct - you are missing normalization forms and codepoint
representation. Yes properly localized software should generally use normalized
strings.

However a sequence of valid codepoints has one and only one representation 

in both UTF-8 and UTF-16. 

There is no such things as strict UTF-8 - there is either UTF-8 or not.

Interesting note: on Mac OS X there is a requirement that strings should be
NFC normalized UTF-8 strings.

Artyom

Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Artyom Beilis