Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

9 Oct 2015

      ...

...
...
To be honest I don't know what guys who designed <codecvt> in 
first place
It was done in the early and mid 1990's, with primary input coming from
Asian national bodies and the now long gone Unix vendors who had a big
presence in that market.
I'm not talking about std::codecvt<> but new C++11 codecvt header
that provides utf8_codecvt - which actually useless for char16_t or
wchar_t on Windows. Because you need to use utf8_utf16_codecvt - very
unintuitive and would likely to make lots of troubles in future.

Major flaw of std::codecvt is mbstate_t that isn't well defined
makeing it impossible to work with stateful encoding or
do some composition/decomposition withing the facet.
...
Header <codecvt> isn't what we need, as you point out below.
...
Boost.Locale provides one but currently it is deep internal and complex
 part of library.
The code I written for Boost.Nowide or one I suggest to put into
 Boost.Locale header-only part
 is codecvt that converts between utf8 and utf-16/32 according to size of
 character:
boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16
(windows)
...
utf-32 (posix)
Don't forget utf-8 to utf-8 (some embedded systems).
IAFIR std::codecvt<char,char,mbstate_t> requires it would be noconv.

Also another requirement is to actually be able to iterate over internal
character one at a time which more difficult than for UTF-16.
...
IMO, a critical aspect of all of those, including utf-8 to utf-8, is that
...
they detect all utf-8 errors since ill-formed utf-8 is used as an attack
vector.
See Markus Kuhn's
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
It should. Actually if you want to validate/encode/decode UTF (8/16/32)
there is boost::locale::utf::utf_traits that does it for yyou

Also it is good test to take a look on for boost.locale 

Artyom

Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

Artyom Beilis