----- Original Message -----
From: Peter Dimov
Andrey Semashev wrote: WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).
The user doesn't write such things in practice. He writes things like
string fn = get_file_name(); fopen( fn.c_str() );
and get_file_name and fopen must decide how to encode/decode UTF-8. So get_file_name gets some wchar_t[] sequence from Windows, which happens to be invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass
Ok... that is interesting point relevant to Boost.Nowide however irrelevant to utf8_codecvt facets. The only way UTF-16 can be invalid is to have non-properly paired utf-16 surrogate units. They can technically be encoded to invalid UTF-8 prepresenting code points in closed range reserved to surrogate pairs. i.e. boost::nowide::narrow should generate invalid UTF-8 from invalid UTF-8 and invalid in very special way UTF-8 to invalid UTF-16. It looks horrifying for me but it maybe actually solution for such a problem But this should never-ever-ever be used outside Boost.Nowide And to be honest - IMHO if a program fails on files that encoded in invalid UTF-16 when Windows states that the encoding is UTF-16... than I think they should fail.
You should also keep in mind that Unicode strings can have multiple
representations even if using strict UTF-8. So one could argue that using strict UTF-8 provides a false sense of security.
This isn't correct - you are missing normalization forms and codepoint representation. Yes properly localized software should generally use normalized strings. However a sequence of valid codepoints has one and only one representation in both UTF-8 and UTF-16. There is no such things as strict UTF-8 - there is either UTF-8 or not. Interesting note: on Mac OS X there is a requirement that strings should be NFC normalized UTF-8 strings. Artyom