[nowide] Library Updates and Boost's broken UTF-8 codecvt facet

newer
Boost Library Testing - a modest...

Artyom Beilis

7 Oct 2015 7 Oct '15

2:49 p.m.

Some updated regarding boost.nowide 1. Library moved to github and its format is converted to modular boost layout: https://github.com/artyom-beilis/nowide2. Fixed unsupported std::ios::ate flag by boost::nowide::fstream3. Added some C++11 interfaces to boost::nowide::fstream4. Added integration functionality with boost::filesystem: https://github.com/artyom-beilis/nowide/blob/master/include/boost/nowide/int... And another important update is that I implemented proper utf8 to utf-16/utf-32 codecvt facet https://github.com/artyom-beilis/nowide/blob/master/include/boost/nowide/utf... It implemented as template version working with wchar_t, char16_t and char32_t. Now I explain. There is widely used utf8 codecvt facet in various parts of code: https://github.com/boostorg/detail/blob/master/include/boost/detail/utf8_cod... https://github.com/boostorg/detail/blob/master/include/boost/detail/utf8_cod... However it is buggy and actually broken for 3 reasons: 1. It supports UCS-2 instead of UTF-16 - i.e it does not code properly Unicode characters outside BMP i.e. code points with values abouve 0xFFFF2. It allows invalid code points in UTF-32/UCS-4 i.e. above 10FFFF or that are reserved for surrogate pairs of UTF-163. And actually allows UTF-8 sequences longer than 4 (which is wrong) As a result, for example you can't use boost::filesystem::path with characters like "𝒞" U+1D49E or may actually create wrong encodings trying to read/write filesystem objects. Independently of Boost.Nowide I would like to propose replacement of boost/detail/utf8_codecvt_facet by one thatactually takes in account proper Unicode handling. Artyom Beilis--------------CppCMS - C++ Web Framework: http://cppcms.com/CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

Show replies by date

Robert Ramey

7 Oct 7 Oct

3:50 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/7/15 7:49 AM, Artyom Beilis wrote:

...

Independently of Boost.Nowide I would like to propose replacement of boost/detail/utf8_codecvt_facet by one thatactually takes in account proper Unicode handling.

This is a worthy undertaking. Consider putting it somewhere else other than "detail" directory/namespace. This should have been an officially supported boost component years and years ago. Also I'd like to see it look more like a boost library - tests, documentation ,etc. On the other hand, the standard library includes officially sanctioned <codecvt> so maybe this isn't as critical as it used to be. I'm not really trying to make a case for anything specific here, just reacting to years and years of dealing with this festering sore. Robert Ramey

Artyom Beilis

4:14 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

...

________________________________ From: Robert Ramey <ramey@rrsd.com>

...
Independently of Boost.Nowide I would like to propose replacement of boost/detail/utf8_codecvt_facet by one thatactually takes in account proper Unicode handling.

This is a worthy undertaking. Consider putting it somewhere else other than "detail" directory/namespace. This should have been an officially supported boost component years and years ago. Also I'd like to see it look more like a boost library - tests, documentation ,etc.

...

First of all I can add it as header only component to Boost.Locale with its unit test as it would be a proper place - without any review process (also code is partially based on boost-locale's one) And than other libraries should replace the broken one by a new one and remove boost/detail/utf8_codevt* all together. Also if Boost.Nowide gets reviewed/accepted it can be proper place as well because nowide is "lightweight library" when boost.Locale is heavy one.

...

On the other hand, the standard library includes officially sanctioned <codecvt> so maybe this isn't as critical as it used to be.

I'm not really trying to make a case for anything specific here, just reacting to years and years of dealing with this festering sore.

...

Not every library provides this (libstdc++ isn't for example) It is also does some other stuff that does not really fit the context. Major use of utf8_codecvt facet is to "connect" narrow and wide APIs. C++11's <codecvt> is just too long for weight especially when C++03 would be with us for a loooooooong period.

...

Robert Ramey

Artyom Beilis

Robert Ramey

5:37 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/7/15 9:14 AM, Artyom Beilis wrote:

...

...
________________________________ From: Robert Ramey <ramey@rrsd.com>

...
Independently of Boost.Nowide I would like to propose replacement of boost/detail/utf8_codecvt_facet by one thatactually takes in account proper Unicode handling.

This is a worthy undertaking. Consider putting it somewhere else other than "detail" directory/namespace. This should have been an officially supported boost component years and years ago. Also I'd like to see it look more like a boost library - tests, documentation ,etc.

...
First of all I can add it as header only component to Boost.Locale with its unit test as it would be a proper place - without any review process (also code is partially based on boost-locale's one)

Do this ! Note the test in the serialization library for the current facet. I don't know if it's useful - it has been in the past. I presume you can either move that test in or supply one that's at least as good.

...

And than other libraries should replace the broken one by a new one and remove boost/detail/utf8_codevt* all together.

You'll have to give us some warning - maybe skip a release or maybe just use header re-direction.

...

Not every library provides this (libstdc++ isn't for example)

Hmmm - I'll be damned. I've been including it conditionally on a macro from Boost config. Now that I think about it, since the test matrix uses older compilers it must still be used.

...

It is also does some other stuff that does not really fit the context.

It was written a long time ago when library support was all over the place. It's served us pretty well. I'm sure you can do much better now and I'll be damned disappointed if you don't!

...

Major use of utf8_codecvt facet is to "connect" narrow and wide APIs. C++11's <codecvt> is just too long for weight especially when C++03 would be with us for a loooooooong period.

It will be good to have an option.

...

...
Robert Ramey

Artyom Beilis

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Artyom Beilis

8 Oct 8 Oct

9:52 a.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

...

From: Robert Ramey <ramey@rrsd.com> On 10/7/15 9:14 AM, Artyom Beilis wrote:

...
...
From: Robert Ramey <ramey@rrsd.com>

...
Independently of Boost.Nowide I would like to propose replacement of boost/detail/utf8_codecvt_facet by one thatactually takes in account proper Unicode handling.

This is a worthy undertaking. Consider putting it somewhere else other than "detail" directory/namespace. This should have been an officially supported boost component years and years ago. Also I'd like to see it look more like a boost library - tests, documentation ,etc.

...
First of all I can add it as header only component to Boost.Locale with its unit test as it would be a proper place - without any review process (also code is partially based on boost-locale's one)

Do this !

...

Ok... anybody what to comment besides Robert Ramey? There is also small but significant difference, my UTF-8 <-> UTF-16/32 facets are template based and header only implementation. Regarding the test - I have my own that the old facet obviously wouldn't pass.

...

You'll have to give us some warning - maybe skip a release or maybe just use header re-direction.

...

It can't be simple redirection as it works in little bit different way. I think just keeping old utf8 facet until all library maintainers replace it - and it would likely take more than one release :-) Artyom Beilis

Robert Ramey

12:55 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/8/15 2:52 AM, Artyom Beilis wrote:

...

There is also small but significant difference, my UTF-8 <-> UTF-16/32 facets are template based and header only implementation.

The current one is also header only and template based

...

Regarding the test - I have my own that the old facet obviously wouldn't pass.

That is not my question. Will your new version pass the current test?

...

...
You'll have to give us some warning - maybe skip a release or maybe just use header re-direction.

...

It can't be simple redirection as it works in little bit different way.

Hmmm - very suspicious. I expect to just be able to include a header and create an instance of the codecvt type and start using it. If I can't do this, we might have a problem.

...

I think just keeping old utf8 facet until all library maintainers replace it - and it would likely take more than one release :-)

Artyom Beilis

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey

12:58 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/8/15 5:55 AM, Robert Ramey wrote:

...

On 10/8/15 2:52 AM, Artyom Beilis wrote:

...
There is also small but significant difference, my UTF-8 <-> UTF-16/32 facets are template based and header only implementation.

The current one is also header only and template based

whoops - the current requires is partly implemented as a *.cpp file that we each compile into our by inclusion. This kludge came about because it was considered unacceptable that the component wasn't reviewed but no one wanted to take responsibility for making a boost quality library for it. Now I'm wondering if maybe this should be reviewed. Robert Ramey

Artyom Beilis

1:39 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Boost.Serialization's does not even built on my platform Ubuntu gcc-4.8

...

...

...

...
Regarding the test - I have my own that the old facet obviously wouldn't pass.

That is not my question. Will your new version pass the current test?

http://www.boost.org/development/tests/master/developer/output/teeks99-05a-U... I fixed it by this diff --git a/include/boost/serialization/static_warning.hpp b/include/boost/serialization/static_warning.hpp index cbbcb82..7ca927b 100644 --- a/include/boost/serialization/static_warning.hpp +++ b/include/boost/serialization/static_warning.hpp @@ -96,7 +96,7 @@ struct BOOST_SERIALIZATION_SS {}; #define BOOST_SERIALIZATION_BSW(B, L) \ typedef boost::serialization::BOOST_SERIALIZATION_SS< \ sizeof( boost::serialization::static_warning_test< B, L > ) \ - > BOOST_JOIN(STATIC_WARNING_LINE, L) BOOST_STATIC_ASSERT_UNUSED_ATTRIBUTE; + > BOOST_JOIN(STATIC_WARNING_LINE, L) ; #define BOOST_STATIC_WARNING(B) BOOST_SERIALIZATION_BSW(B, __LINE__) #endif // BOOST_SERIALIZATION_STATIC_WARNING_HPP That test passes with new facet. But I must admit the test is very basic and does not test even 1/2 of what need to be tested in the interface.

Robert Ramey

3:52 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/8/15 6:39 AM, Artyom Beilis wrote:

...

Boost.Serialization's does not even built on my platform Ubuntu gcc-4.8

...
...
...
...
Regarding the test - I have my own that the old facet obviously wouldn't pass.

That is not my question. Will your new version pass the current test?

http://www.boost.org/development/tests/master/developer/output/teeks99-05a-U...

I meant: https://github.com/boostorg/serialization/blob/develop/test/test_utf8_codecv... Of course I expect that your test should be better - but I would expect it to pass the current test. Unless of course if the current test is wrong - which could well be a problem. If I had my druthers, I'd like to see a library for aiding in the creating of code convert facets including documentation about what they are and how to use them. Examples of usage would be the new utf8codecvt facet we're talking about. Of course that's a large project. And for all I know your library already does this. The way I usually go about this is: a) I have a problem - I need a portable codecvt facet for doing some specific thing - out put utf8 for example. b) I troll the web and boost looking for an idiot proof solution. c) If I find one, I subject the documentation to a cursory examination, and then insert the component into mey app and see if it works. If it works, then I'm golden. I'll likely spend more time with the library in different contexts. If it doesn't I'll just move on. So this is what I'd like to see - It may already be done and I haven't looked at it because I already had a "solution" or maybe it's done in a way which isn't idiot proof enough for me. At one time I did take a cursory look at the documentation. I'm not sure I really have a point here. I've just been unsatisfied with our current solution and hope for something more "complete" Robert Ramey

Artyom Beilis

8:32 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

...

I meant:

https://github.com/boostorg/serialization/blob/develop/test/test_utf8_codecv...

I adopted it and It passes - no problem there.

...

creating of code convert facets including documentation about what they are and how to use them. Examples of usage would be the new utf8codecvt facet we're talking about.

How can I say this gently. codecvt facet is used there because it is "Standard" way of doing things and it is far from being flawless [1] but it exists and it is the ultimate way to convert between encodings in C++. std::locale is complex stuff with many issues by design including its codecvt facet - they are both hard to create and use. Most of users don't really need them - ideally you just run std::locale::global(std::locale("")) And everything just works for any stuff that needs to handle encoding. But in reality it does not. So you need to put workarounds and create stuff like utf8 facet because some standard libraries on some very well known operating system do not support UTF-8 locales. And consider it "confusing" that std::string is accidentally becomes utf-8 encoded string. All the char const *str = u8"привет-שלום" stuff to encode UTF-8 string was born in sin and probably die this way because of some specific vendors that ignore what the world had learned well. So ideally end users should not care about codecvt - that is why originally in nowide there is just a function called boost::nowide::nowide_filesystem() And magic happens. The problem to understand how the magic works you need to learn a lots of things and simple tutorial isn't enough - even entire library like Boost.Nowide isn't always enough. Regards, Artyom Beilis [1] One of the sings it does not allow to implement stateful encodings (such that can compose and decompose some characters the way iconv does)

Vladimir Prus

12:29 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 07-Oct-15 8:37 PM, Robert Ramey wrote:

...

Do this !

Note the test in the serialization library for the current facet. I don't know if it's useful - it has been in the past. I presume you can either move that test in or supply one that's at least as good.

...
And than other libraries should replace the broken one by a new one and remove boost/detail/utf8_codevt* all together.

You'll have to give us some warning - maybe skip a release or maybe just use header re-direction.

I would prefer for any such replacement to be still part of boost/detail. Depending on boost.locale just to get one header seems not perfect. - Volodya

Artyom Beilis

12:59 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

...

From: Vladimir Prus <vladimir.prus@gmail.com> On 07-Oct-15 8:37 PM, Robert Ramey wrote:

...
Do this !

Note the test in the serialization library for the current facet. I don't know if it's useful - it has been in the past. I presume you can either move that test in or supply one that's at least as good.

...
And than other libraries should replace the broken one by a new one and remove boost/detail/utf8_codevt* all together.

You'll have to give us some warning - maybe skip a release or maybe just use header re-direction.

I would prefer for any such replacement to be still part of boost/detail. Depending on boost.locale just to get one header seems not perfect.

- Volodya

The current utf8_codecvt can't be used as is and designed to be "copied" to specific libraries (with its own namespace) My codecvt in generic header only one that can and should be used outside boost specific libraries, i.e. if I want to read/write utf8 file using std::wfstream I can install this facet independently of rest of the boost libraries. Details is just like boost-private namespace. So isn't best place, on the other hand Unicode facilities is good place, especially that it is a component independent of entire locale library i.e. header only part similar to <boost/locale/utf.hpp> We can create a "Separate" codecvt library with its own formal review and it would be ready in best case in a year - or just use it in a place unicode processing belongs to. Artyom

Peter Dimov

2:03 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

Artyom Beilis wrote:

...

We can create a "Separate" codecvt library with its own formal review and it would be ready in best case in a year...

One option is to put it into utility; another is to use a mini-review if the new codecvt library is an implementation of the standard <codecvt> interface. std::codecvt_utf8 is not quite the same as boost::utf8_codecvt_facet, but on the other hand, from your previous message it seems that your utf8_codecvt_facet is not std::codecvt_utf8 but std::codecvt_utf8_utf16, or perhaps it's the latter when wchar_t is 16 bit and the former when it's 32 bit.

Artyom Beilis

2:54 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

----- Original Message -----

...

From: Peter Dimov <lists@pdimov.com> Artyom Beilis wrote:

...
We can create a "Separate" codecvt library with its own formal review and it would be ready in best case in a year...

One option is to put it into utility; another is to use a mini-review if the new codecvt library is an implementation of the standard <codecvt> interface.

std::codecvt_utf8 is not quite the same as boost::utf8_codecvt_facet, but on the other hand, from your previous message it seems that your utf8_codecvt_facet is not std::codecvt_utf8 but std::codecvt_utf8_utf16, or perhaps it's the latter when wchar_t is 16 bit and the former when it's 32 bit.

[BEGIN: Long description regarding <codecvt> ] To be honest I don't know what guys who designed <codecvt> in first place thought of - I feel string influence of broken MS Unicode policies std::codecvt_utf8 is actually quite misleading - it converts between utf8 and ucs-2/ucs-4 i.e. using it under windows with wchar_t you wouldn't get support of utf-16 at all. It basically does what boost::XXX:utf8_codecvt_facet does for std::codecvt_utf8<wchar_t>. Basically broken and useless as UCS-2 is subset of proper encoding. Now <codecvt>'s std::codecvt_mode is clear Microsoftism as for example using UTF-8 BOM is one of the many Unicode crimes Microsoft created - as storing UTF-16 files on disk. Another hilarious stuff is Maxcode = 0x10ffff template parameter for the facet... It is like creating template<double Pi_Value=3.14159> class circle; 0x10FFFF IS max value for Unicode codepoint, not 0xFFFF not anything else. std::codecvt_utf16 is an attempt to build "narrow" utf-16 encoding, just no comment... std::codecvt_utf8_utf16 is actually useful under windows and does what it is supposed to to with wchar_t... but under POSIX platform it is impossible to use std::codecvt_utf8_utf16 with wchar_t because wchar_t is UTF-32... So if you want to install utf8 to wchar_t codecvt facet that represents utf-16 or utf-32 according to platform you need to use if(sizeof(wchar_t) == 2) return new std::codecvt_utf8_utf16<wchar_t>(); else // sizeof(wchar_t) == 4 return new std::codecvt_utf8<wchar_t>(); So all <codecvt> was built wrong under strong Microsoft development policy influences and useless for any cross platform development. So... Boost community - please give yourself a favor Don't use <codecvt> unless you really understand what are you doing. [END: Long description regarding <codecvt> ] If you want to covert utf8 files properly to native wide character like for example for boost::filesystem, boost::serialization or std::fstream you need to use facet that converts to utf-16 or utf-32 according to what wchar_t holds and <codecvt> does not provide one (without platform specific tricks) So I'm not going to implement C++11 <codecvt> because IMHO it is broken by design in first place. Boost.Locale provides one but currently it is deep internal and complex part of library. The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part is codecvt that converts between utf8 and utf-16/32 according to size of character: boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix) boost::(nowide|or locale)::utf8_facet<char16_t> - utf-8 to utf-16 on any platform boost::(nowide|or locale)::utf8_facet<char32_t> - utf-8 to utf-32 on any platform That's it. It isn't <codecvt> because C++11 <codecvt> does not actually do the job needed. Artyom Beilis

Robert Ramey

4:07 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

On 10/8/15 7:54 AM, Artyom Beilis wrote:

...

----- Original Message -----

[BEGIN: Long description regarding <codecvt> ]

...

So... Boost community - please give yourself a favor Don't use <codecvt> unless you really understand what are you doing.

Well, I use <codecvt> and boost::utf8_codecvt and I definitely don't know what I'm doing. That (and the fact that I don't have any extra time) is the reason for using a library in first place. The whole, locale/facet/codecvt saga is long and very difficult to fathom. To make things worse it has a tortured history of library writers not getting it right. If one looks at the utf_codecvt facet there's lot's of workaround for older compilers and libraries. So it's high time this be rationalized. I think the concept has merit and would do well with a good library and educational documentation to match.

...

[END: Long description regarding <codecvt> ]

If you want to covert utf8 files properly to native wide character like for example for boost::filesystem,

boost::serialization or std::fstream you need to use facet that converts to utf-16 or utf-32 according to what wchar_t holds and <codecvt> does not provide one (without platform specific tricks)

I see that, but we could easily select which codecvt facet depending on the size of the wchar on the specific platform. I dislike libraries which do "too much" in order to "just" work. codecvt library should be a) A tool kit ot create codecvt facets b) some generated examples which will cover what most users need c) a bunch of tutorial information about how codecvt can be used - especially outside of stream i/o d) anything else which is useful. Note I'm aware that this is a huge task to do right - I certainly wouldn't blame anyone for not taking it on.

...

So I'm not going to implement C++11 <codecvt> because IMHO it is broken by design in first place.

Hmm - I'd have to think more about this. If <codecvt> is ill concieved - I'm sure one could propose an alternative.

...

Boost.Locale provides one but currently it is deep internal and complex part of library.

Hmmm - very interesting. Maybe it's a question of factoring out this part and repackaging it in a more digestible form. That would be interesting.

...

The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part is codecvt that converts between utf8 and utf-16/32 according to size of character:

...

boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix) boost::(nowide|or locale)::utf8_facet<char16_t> - utf-8 to utf-16 on any platform boost::(nowide|or locale)::utf8_facet<char32_t> - utf-8 to utf-32 on any platform

That's it. It isn't <codecvt> because C++11 <codecvt> does not actually do the job needed.

I'll have to take your word for it.

...

Artyom Beilis

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey

4:10 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

Hmm - is there a reason that a better name wouldn't be Boost.Codecvt ?

Peter Dimov

7:14 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Artyom Beilis wrote:

...

The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part is codecvt that converts between utf8 and utf-16/32 according to size of character:

boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix) boost::(nowide|or locale)::utf8_facet<char16_t> - utf-8 to utf-16 on any platform boost::(nowide|or locale)::utf8_facet<char32_t> - utf-8 to utf-32 on any platform

That's it. It isn't <codecvt> because C++11 <codecvt> does not actually do the job needed.

I agree that this makes the most sense. I only brought up <codecvt> because if we used the standard interface and names we wouldn't have needed a full review of the hypothetical libs/codecvt. As this stands, libs/utility seems the best bet, although I'm not overly fond of the practice of putting everything that doesn't fit elsewhere into Utility. :-) But it's better than Detail because it's documented and tested. One could make the case for libs/utf8 which would contain utf8_facet and the "obvious" bool is_valid_utf8( string const & s ); wstring utf8_decode( string const & s ); string utf8_encode( wstring const & s ); but this is already well into full review/bikeshed territory.

Robert Ramey

7:54 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/8/15 12:14 PM, Peter Dimov wrote:

...

One could make the case for libs/utf8 which would contain utf8_facet and the "obvious"

bool is_valid_utf8( string const & s ); wstring utf8_decode( string const & s ); string utf8_encode( wstring const & s );

but this is already well into full review/bikeshed territory.

I would like to see libs/codecvt which included the above functions Robert Ramey

Artyom Beilis

8:17 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

-------------------------------------------- On Thu, 10/8/15, Peter Dimov <lists@pdimov.com> wrote: Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet To: boost@lists.boost.org Date: Thursday, October 8, 2015, 10:14 PM

...

I agree that this makes the most sense. I only brought up <codecvt> because if we used the standard interface and names we wouldn't have needed a full review of the hypothetical libs/codecvt.

See... lots of stuff in standard library related to Unicode is broken. It wasn't fixed in C++11 and wouldn't be later. Also there is deep problem with Windows API that created Wide API and ignores any standard - both C and C++. i.e.. there are basic files that can't even be opened on Windows using plain C fopen or C++ std::fstream.

...

As this stands, libs/utility seems the best bet, although I'm not overly fond of the practice of putting everything that doesn't fit elsewhere into Utility. :-) But it's better than Detail because it's documented and tested. One could make the case for libs/utf8 which would contain utf8_facet and the "obvious"

bool is_valid_utf8( string const & s ); wstring utf8_decode( string const & s ); string utf8_encode( wstring const & s );

but this is already well into full review/bikeshed territory.

See, all this is already implemented in header only way in Boost.Locale - so no linking required. https://github.com/boostorg/locale/blob/master/include/boost/locale/utf.hpp https://github.com/boostorg/locale/blob/master/include/boost/locale/encoding... So just call boost::locale::conv::utf_to_utf<wchar_t>("Hello World"); Full codecvt_facet for many encodings - inluding UTF-8, ISO-8859-*, Windows-125* are already there as well However there is very useful specific codecvt - that converts between utf8 and wchar_t/char16_t/char32_t that can be implemented in header only without linking with big and complex Boost.Locale library. Also I'm going to make it little bit more generic so you can implement wchar_t/char16_t/char32_t to any stateless encoding easily (I want to improve some stuff withing Boost.Locale as well) So utf8 codecvt facet is INTEGRAL part of Boost.Locale already - it exists there. Just I think I'll make it more accessible to general libraries without requirement of linking and easiler to use by users without need of special locale generation. Ok... I decided what I'm going to do. Next step is for other libraries to adopt this utf8_codecvt facet. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

Robert Ramey

9:57 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/8/15 1:17 PM, Artyom Beilis wrote:

...

See, all this is already implemented in header only way in Boost.Locale - so no linking required.

good to hear.

...

https://github.com/boostorg/locale/blob/master/include/boost/locale/utf.hpp https://github.com/boostorg/locale/blob/master/include/boost/locale/encoding...

So just call boost::locale::conv::utf_to_utf<wchar_t>("Hello World");

Full codecvt_facet for many encodings - inluding UTF-8, ISO-8859-*, Windows-125* are already there as well

excellent

...

However there is very useful specific codecvt - that converts between utf8 and wchar_t/char16_t/char32_t that can be implemented in header only without linking with big and complex Boost.Locale library.

also good.

...

Also I'm going to make it little bit more generic so you can implement wchar_t/char16_t/char32_t to any stateless encoding easily (I want to improve some stuff withing Boost.Locale as well)

So utf8 codecvt facet is INTEGRAL part of Boost.Locale already - it exists there.

so what are we talking about?

...

Just I think I'll make it more accessible to general libraries without requirement of linking and easiler to use by users without need of special locale generation.

LOL - So what I'm hearing here that there is already the codecvt facet created as part of Boost.Locale and all we need is to enhance the documenation in order to make it more obvious to those of us that need it. I've been hoping all along that we've have something like this. Some time ago I have Boost.Locale a cursory look in search of such a thing. It's quite possible it's always been there and I didn't see it, that's it's "implicit" and obvious to someone who has read and understood he rest of the documentation, or that the library has evolved since I did this or all of the above. In any case I'm very interested in this.

...

Ok... I decided what I'm going to do.

OK - do this and let us know, I'll look it over again and give you some feedback from a novice.

...

Next step is for other libraries to adopt this utf8_codecvt facet.

LOL - you do your part - and the rest will happen. Robert Ramey

...

Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Peter Dimov

9 Oct 9 Oct

1:32 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Artyom Beilis wrote:

...

...
bool is_valid_utf8( string const & s ); wstring utf8_decode( string const & s ); string utf8_encode( wstring const & s );

See, all this is already implemented in header only way in Boost.Locale - so no linking required.

https://github.com/boostorg/locale/blob/master/include/boost/locale/utf.hpp https://github.com/boostorg/locale/blob/master/include/boost/locale/encoding...

So just call boost::locale::conv::utf_to_utf<wchar_t>("Hello World");

That's nice. I'd prefer UTF-8 support to not be tied to locales as it's more fundamental, but you have basically anticipated that and made Nowide, which is exactly what I need on a day-to-day basis - support for UTF-8 as the ubiquitous external encoding. Nowide however spells those functions in a different way, nowide::widen and nowide::narrow. One issue is that these functions are not immediately discoverable, in either place. One would expect UTF-8 functions to be in a library having UTF-8 in the name, or for the functions to have utf8 in their names.

Beman Dawes

2:17 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

...

To be honest I don't know what guys who designed <codecvt> in first place

It was done in the early and mid 1990's, with primary input coming from Asian national bodies and the now long gone Unix vendors who had a big presence in that market. thought of - I feel string influence of broken MS Unicode policies

...

This was years before Microsoft folks started to participate in the LWG.

...

So I'm not going to implement C++11 <codecvt> because IMHO it is broken by design in first place.

Header <codecvt> isn't what we need, as you point out below.

...

Boost.Locale provides one but currently it is deep internal and complex part of library.

The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part is codecvt that converts between utf8 and utf-16/32 according to size of character:

boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16 (windows) utf-32 (posix)

Don't forget utf-8 to utf-8 (some embedded systems). IMO, a critical aspect of all of those, including utf-8 to utf-8, is that they detect all utf-8 errors since ill-formed utf-8 is used as an attack vector. See Markus Kuhn's https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt I can contribute a Boost regression test friendly version of Kuhn's malformed tests. --Beman

Peter Dimov

2:41 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Beman Dawes wrote:

...

IMO, a critical aspect of all of those, including utf-8 to utf-8, is that they detect all utf-8 errors since ill-formed utf-8 is used as an attack vector.

That is what I alluded to earlier with my bikeshedding comment - I personally find this policy a bit too firm for my taste. Sure, sometimes I do want to reject any invalid UTF-8 with extreme prejudice, but at other times I do not. For instance, when I get a Windows file name, it can well be invalid UTF-16, which when converted will become invalid UTF-8 but which will roundtrip correctly back to its original invalid UTF-16 form and refer to the same file. That's why things like CESU-8 or WTF-8 exist. So I like the "method" argument of locale::conv::utf_to_utf, except that I think that it doesn't offer enough control.

Andrey Semashev

3:03 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

On 09.10.2015 17:41, Peter Dimov wrote:

...

Beman Dawes wrote:

...
IMO, a critical aspect of all of those, including utf-8 to utf-8, is that they detect all utf-8 errors since ill-formed utf-8 is used as an attack vector.

That is what I alluded to earlier with my bikeshedding comment - I personally find this policy a bit too firm for my taste. Sure, sometimes I do want to reject any invalid UTF-8 with extreme prejudice, but at other times I do not. For instance, when I get a Windows file name, it can well be invalid UTF-16, which when converted will become invalid UTF-8 but which will roundtrip correctly back to its original invalid UTF-16 form and refer to the same file. That's why things like CESU-8 or WTF-8 exist.

So I like the "method" argument of locale::conv::utf_to_utf, except that I think that it doesn't offer enough control.

I think, UTF-8 is UTF-8 (i.e. the character encoding that is described by the standard), and the tool for working with it should adhere to the specification. This includes signalling about invalid code sequences instead of producing them. WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

Peter Dimov

3:20 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Andrey Semashev wrote:

...

WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

The user doesn't write such things in practice. He writes things like string fn = get_file_name(); fopen( fn.c_str() ); and get_file_name and fopen must decide how to encode/decode UTF-8. So get_file_name gets some wchar_t[] sequence from Windows, which happens to be invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass this same sequence to it, it will be able to open the file. So your choice is whether you make this work, or make this fail. I choose to make it work. The functions would of course never produce invalid UTF-8 when passed a valid input (and will deterministically produce the least-invalid UTF-8 for a given input) but here again the definition of valid may change with time if, f.ex. more code points are added to Unicode beyond the current limit. You should also keep in mind that Unicode strings can have multiple representations even if using strict UTF-8. So one could argue that using strict UTF-8 provides a false sense of security.

Andrey Semashev

3:59 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

On 09.10.2015 18:20, Peter Dimov wrote:

...

Andrey Semashev wrote:

...
WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

The user doesn't write such things in practice. He writes things like

string fn = get_file_name(); fopen( fn.c_str() );

and get_file_name and fopen must decide how to encode/decode UTF-8. So get_file_name gets some wchar_t[] sequence from Windows, which happens to be invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass this same sequence to it, it will be able to open the file. So your choice is whether you make this work, or make this fail. I choose to make it work.

What I'm saying is that get_file_name implementation should not even spell UTF-8 anywhere, as the encoding it has to deal with is not UTF-8. Whatever the original encoding of the file name is (broken UTF-16, obtained from WinAPI, true UTF-8 obtained from network or a file), the target encoding has to match what fopen expects. As I remember, on Windows it is usually not UTF-8 anyway but something like CP1251. But even if you have a UTF-8 (in Windows terms) locale, Windows code conversion algorithm AFAIU actually implements something different so that it can handle invalif UTF-16. Your code should spell that 'something different' and not UTF-8. If it spells UTF-8 then it should fail on invalid code sequences.

...

The functions would of course never produce invalid UTF-8 when passed a valid input (and will deterministically produce the least-invalid UTF-8 for a given input)

There should be no such thing as 'least invalid' or 'almost valid' data. It's either valid or not. The tool should not produce invalid data, period. If you want to successfully convert invalid UTF-16 input to a multibyte encoding then choose that encoding and don't pretend it's UTF-8. Because UTF-8 cannot represent that input data.

...

but here again the definition of valid may change with time if, f.ex. more code points are added to Unicode beyond the current limit.

Unicode versioning is another issue. If it comes to this, we will decide what to do. We may well decide to go with utf8v2 in the naming, if the need for strict v1 conformance is strong enough in some cases.

...

You should also keep in mind that Unicode strings can have multiple representations even if using strict UTF-8. So one could argue that using strict UTF-8 provides a false sense of security.

There are normalization and string collation algorithms to deal with this. What's important is that the input to these and other algorithms is valid. Otherwise all bets are off.

Peter Dimov

4:27 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Andrey Semashev wrote:

...

...
string fn = get_file_name(); fopen( fn.c_str() );

What I'm saying is that get_file_name implementation should not even spell UTF-8 anywhere, as the encoding it has to deal with is not UTF-8. Whatever the original encoding of the file name is (broken UTF-16, obtained from WinAPI, true UTF-8 obtained from network or a file), the target encoding has to match what fopen expects.

'fopen' here is a function that decodes 'fn' and calls _wfopen, or CreateFileW, or whatever is appropriate. get_file_name and fopen work in tandem to make it so that the file selected by the first function is opened by the latter. And to do that, they may need to put invalid UTF-8 in 'fn'.

...

There should be no such thing as 'least invalid' or 'almost valid' data.

There exists a legitimate notion of more valid or less valid UTF-8 because it can be invalid in different ways, some more basic than others.

...

There are normalization and string collation algorithms to deal with this. What's important is that the input to these and other algorithms is valid.

This depends on the notion of valid. UTF-8 that encodes codepoints in more bytes than necessary corresponds to a valid codepoint sequence. Strict handling rejects it not because it's invalid Unicode, but because it's not the canonical representation of the codepoint sequence. But the codepoint sequence itself can be non-canonical, and hence code that assumes that "validated" UTF-8 is canonical is wrong. The policy of strict UTF-8 is not a bad idea in general, but it's merely a first line of defense as far as security is concerned. Properly written code should not need it.

Andrey Semashev

7 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

On 09.10.2015 19:27, Peter Dimov wrote:

...

Andrey Semashev wrote:

...
...
string fn = get_file_name(); fopen( fn.c_str() );

What I'm saying is that get_file_name implementation should not even spell UTF-8 anywhere, as the encoding it has to deal with is not UTF-8. Whatever the original encoding of the file name is (broken UTF-16, obtained from WinAPI, true UTF-8 obtained from network or a file), the target encoding has to match what fopen expects.

'fopen' here is a function that decodes 'fn' and calls _wfopen, or CreateFileW, or whatever is appropriate.

get_file_name and fopen work in tandem to make it so that the file selected by the first function is opened by the latter. And to do that, they may need to put invalid UTF-8 in 'fn'.

Right. Just don't call it UTF-8 anymore.

...

...
There should be no such thing as 'least invalid' or 'almost valid' data.

There exists a legitimate notion of more valid or less valid UTF-8 because it can be invalid in different ways, some more basic than others.

Could you point me to a definition of these degrees of validity? In my understanding the string is valid if it can be decoded by a conforming parser. E.g. it should not contain invalid code points (i.e. those not allowed by the standard) or sequences thereof.

...

...
There are normalization and string collation algorithms to deal with this. What's important is that the input to these and other algorithms is valid.

This depends on the notion of valid. UTF-8 that encodes codepoints in more bytes than necessary corresponds to a valid codepoint sequence.

AFAIU, no, it is not a valid encoding. At least, not according to this: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

...

Strict handling rejects it not because it's invalid Unicode, but because it's not the canonical representation of the codepoint sequence. But the codepoint sequence itself can be non-canonical, and hence code that assumes that "validated" UTF-8 is canonical is wrong.

Well, my Unicode kung fu is not very strong, but if the standard only allows minimal encoding then anything that doesn't follow is not conforming and should be rejected. For convenience we could provide a separate tool that would tolerate some deviations from the spec and produce valid UTF-8. But the user should have control over what exact deviations are allowed. Don't call the tool utf_to_utf though, as the name doesn't make sense to me.

...

The policy of strict UTF-8 is not a bad idea in general, but it's merely a first line of defense as far as security is concerned. Properly written code should not need it.

You mean all string-related code should be prepared for invalid input? Seems like too much overhead to me.

Peter Dimov

7:15 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Andrey Semashev wrote:

...

On 09.10.2015 19:27, Peter Dimov wrote:

...
...
...
string fn = get_file_name(); fopen( fn.c_str() );

... get_file_name and fopen work in tandem to make it so that the file selected by the first function is opened by the latter. And to do that, they may need to put invalid UTF-8 in 'fn'.

Right. Just don't call it UTF-8 anymore.

I don't know what this means.

...

...
There exists a legitimate notion of more valid or less valid UTF-8 because it can be invalid in different ways, some more basic than others.

Could you point me to a definition of these degrees of validity?

First, you can have invalid multibyte sequences in the input. Second, you can have overlong byte sequences. Third, the encoded codepoint sequence may be invalid, in various ways.

...

...
This depends on the notion of valid. UTF-8 that encodes codepoints in more bytes than necessary corresponds to a valid codepoint sequence.

AFAIU, no, it is not a valid encoding.

It's an invalid UTF-8 encoding of a valid codepoint sequence.

...

You mean all string-related code should be prepared for invalid input?

I don't understand this, either.

Andrey Semashev

7:39 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

On 09.10.2015 22:15, Peter Dimov wrote:

...

Andrey Semashev wrote:

...
On 09.10.2015 19:27, Peter Dimov wrote:

...
...
...
string fn = get_file_name(); fopen( fn.c_str() );

... get_file_name and fopen work in tandem to make it so that the file selected by the first function is opened by the latter. And to do that, they may need to put invalid UTF-8 in 'fn'.

Right. Just don't call it UTF-8 anymore.

I don't know what this means.

I mean as a result you will have a string fn, whose encoding is not UTF-8. As a consequence algorithms that require UTF-8 input cannot be expected to work with this string.

...

...
...
There exists a legitimate notion of more valid or less valid UTF-8 > because it can be invalid in different ways, some more basic than > others.

Could you point me to a definition of these degrees of validity?

First, you can have invalid multibyte sequences in the input. Second, you can have overlong byte sequences. Third, the encoded codepoint sequence may be invalid, in various ways.

Ok, all these count as just invalid to me.

...

...
...
This depends on the notion of valid. UTF-8 that encodes codepoints in > more bytes than necessary corresponds to a valid codepoint sequence.

AFAIU, no, it is not a valid encoding.

It's an invalid UTF-8 encoding of a valid codepoint sequence.

Yes, but valid codepoint sequence is not enough to interpret the string.

...

...
You mean all string-related code should be prepared for invalid input?

I don't understand this, either.

You said that properly written code should not require string validity. Should such code be always prepared for invalid strings, at any point? If so, this looks like unnecessary overhead to me.

Peter Dimov

8:25 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Andrey Semashev wrote:

...

...
...
Right. Just don't call it UTF-8 anymore.

I don't know what this means.

I mean as a result you will have a string fn, whose encoding is not UTF-8. As a consequence algorithms that require UTF-8 input cannot be expected to work with this string.

It's invalid UTF-8 and yes, algorithms that require valid UTF-8 will obviously not work with it. The point is that the implementation of these functions needs to encode/decode this not-quite-valid-UTF-8, for which it needs functions that encode/decode this not-quite-valid-UTF-8.

...

...
It's an invalid UTF-8 encoding of a valid codepoint sequence.

Yes, but valid codepoint sequence is not enough to interpret the string.

It's enough. What more would you need?

...

...
...
You mean all string-related code should be prepared for invalid input?

I don't understand this, either.

You said that properly written code should not require string validity. Should such code be always prepared for invalid strings, at any point? If so, this looks like unnecessary overhead to me.

I said that properly written code should not require minimal UTF-8 byte sequences, because properly written code validates the codepoint sequence (after normalizing it, if required), not the UTF-8 byte sequence. To expand on that, the reason UTF-8 overlong sequences are a source of security issues is because of code that does external input -> validate as NTBS -> ... -> pass to UTF-8 API -> decoding -> do something because if validation is supposed to reject ../passwords.txt, the attacker encodes the dots as two bytes and gets around the naive NTBS validation which no longer sees '..' but something else. But the actual problem with this code is that the validation should be done on the codepoint sequence, not on the byte sequence. And if you do that, you see the dot as a dot (and the slash as a slash and the NUL as a NUL) regardless of whether it's encoded with one byte or four. Anyway, that was a detour. In practice I can't think of valid cases for accepting overlong sequences except the long zero and maybe not even then.

Artyom Beilis

8:07 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

----- Original Message -----

...

From: Peter Dimov <lists@pdimov.com> Andrey Semashev wrote:

...
WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

The user doesn't write such things in practice. He writes things like

string fn = get_file_name(); fopen( fn.c_str() );

and get_file_name and fopen must decide how to encode/decode UTF-8. So get_file_name gets some wchar_t[] sequence from Windows, which happens to be invalid UTF-16. But Windows doesn't care for UTF-16 validity and if you pass

Ok... that is interesting point relevant to Boost.Nowide however irrelevant to utf8_codecvt facets. The only way UTF-16 can be invalid is to have non-properly paired utf-16 surrogate units. They can technically be encoded to invalid UTF-8 prepresenting code points in closed range reserved to surrogate pairs. i.e. boost::nowide::narrow should generate invalid UTF-8 from invalid UTF-8 and invalid in very special way UTF-8 to invalid UTF-16. It looks horrifying for me but it maybe actually solution for such a problem But this should never-ever-ever be used outside Boost.Nowide And to be honest - IMHO if a program fails on files that encoded in invalid UTF-16 when Windows states that the encoding is UTF-16... than I think they should fail.

...

You should also keep in mind that Unicode strings can have multiple

...

representations even if using strict UTF-8. So one could argue that using strict UTF-8 provides a false sense of security.

This isn't correct - you are missing normalization forms and codepoint representation. Yes properly localized software should generally use normalized strings. However a sequence of valid codepoints has one and only one representation in both UTF-8 and UTF-16. There is no such things as strict UTF-8 - there is either UTF-8 or not. Interesting note: on Mac OS X there is a requirement that strings should be NFC normalized UTF-8 strings. Artyom

Peter Dimov

8:30 p.m.

New subject: [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet

Artyom Beilis wrote:

...

Ok... that is interesting point relevant to Boost.Nowide however irrelevant to utf8_codecvt facets.

It's only relevant if Nowide (or an equivalent library) uses the UTF-8 facet to perform the conversion.

Peter Dimov

8:34 p.m.

New subject: [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet

Artyom Beilis wrote:

...

Interesting note: on Mac OS X there is a requirement that strings should be NFC normalized UTF-8 strings.

I think that it uses form D, not form C, but I may be misremembering. I'm also not sure if it enforces this requirement on input, that is, whether it doesn't normalize as needed.

Peter Dimov

3:41 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

Andrey Semashev wrote:

...

WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

In addition to what I wrote earlier, the choices here are not representable in a single U or W letter. When taking UTF-8, you need to decide whether to - accept codepoints over 10FFFF - accept codepoints encoded with more bytes than necessary - accept surrogates - probably more because Unicode is hard and then for each rejected byte sequence whether to - throw - ignore and skip - replace with U+FFFD

Andrey Semashev

4:05 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

On 09.10.2015 18:41, Peter Dimov wrote:

...

Andrey Semashev wrote:

...
WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

In addition to what I wrote earlier, the choices here are not representable in a single U or W letter. When taking UTF-8, you need to decide whether to

- accept codepoints over 10FFFF - accept codepoints encoded with more bytes than necessary - accept surrogates - probably more because Unicode is hard

and then for each rejected byte sequence whether to

- throw - ignore and skip - replace with U+FFFD

As long as the code sequences are described by the spec, I consider them valid. We can provide a number of options to influence the conversion process, but the result should be something that can be decoded by a conforming Unicode parser.

Artyom Beilis

8:11 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

...

Andrey Semashev wrote:

...

...
WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).

In addition to what I wrote earlier, the choices here are not representable in a single U or W letter. When taking UTF-8, you need to decide whether to

- accept codepoints over 10FFFF - accept codepoints encoded with more bytes than necessary - accept surrogates

No... all this isn't UTF-8. Period. Codepoints above 10FFFF is like lets assume Pi=3.15.. That is why the C++11 <codecvt> has basic design flaws. (See notes in previous e-mails)

...

- probably more because Unicode is hard

Unicode isn't hard - it is just treated with ignorance by even big organization not talking about average programmers. Artyom

Peter Dimov

8:40 p.m.

New subject: [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet

Artyom Beilis wrote:

...

Codepoints above 10FFFF is like lets assume Pi=3.15..

No, sorry. This is not at all the same. The reason we're in this mess is precisely because codepoints above 0xFFFF were like pi=3.15. And then it turned out they weren't.

...

...
- probably more because Unicode is hard

Unicode isn't hard - it is just treated with ignorance by even big organization not talking about average programmers.

What I meant by that is for instance - is 0xCC 0x81 a valid UTF-8 string? - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?

Artyom Beilis

9:16 p.m.

New subject: [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet

----- Original Message -----

...

From: Peter Dimov <lists@pdimov.com> To: boost@lists.boost.org Cc: Sent: Friday, October 9, 2015 11:40 PM Subject: Re: [boost] [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet

Artyom Beilis wrote:

...
Codepoints above 10FFFF is like lets assume Pi=3.15..

No, sorry. This is not at all the same. The reason we're in this mess is precisely because codepoints above 0xFFFF were like pi=3.15. And then it turned out they weren't.

Yeah but for UTF-16 it is over you can't go past it ;-)

...

...
...
- probably more because Unicode is hard

Unicode isn't hard - it is just treated with ignorance by even big organization not talking about average programmers.

What I meant by that is for instance

- is 0xCC 0x81 a valid UTF-8 string? - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?

Both are valid strings.. and both are meaningless on their own i.e. accent without letter or two same accents. Being illogical in human terms or representation does not make them UTF-8 illegal. UTF-8 is simple, human language processing is complex. Artyom

Peter Dimov

9:38 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Artyom Beilis wrote:

...

...
What I meant by that is for instance

- is 0xCC 0x81 a valid UTF-8 string? - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?

Both are valid strings.. and both are meaningless on their own i.e. accent without letter or two same accents.

Being illogical in human terms or representation does not make them UTF-8 illegal.

UTF-8 is simple, human language processing is complex.

My point here is that strictly valid UTF-8 is the valid multibyte encoding of a valid codepoint sequence, and that the definition of "valid codepoint sequence" may vary depending on context, such that the above sequences are considered invalid. Drawing a line at the place where codepoints over 10FFFF and single surrogates are invalid but the above sequences are valid is an arbitrary decision. Not that this decision is wrong, it isn't. But it may not be what the user needs. Saying "invalid UTF-8 is just invalid, period" doesn't always work very well, although it's a good default. There are cases in which you have to handle specific kinds of invalid UTF-8 (but not any invalid UTF-8) and having to write UTF-8 encoding/decoding functions for every such instance does not really contribute to either security or correctness. It's better - I posit - to have functions that can be configured to handle various invalid forms of UTF-8 (that is, to accept certain invalid UTF-8, not necessarily to produce it, of course).

Artyom Beilis

7:29 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

...

...
To be honest I don't know what guys who designed <codecvt> in first place

It was done in the early and mid 1990's, with primary input coming from Asian national bodies and the now long gone Unix vendors who had a big presence in that market.

I'm not talking about std::codecvt<> but new C++11 codecvt header that provides utf8_codecvt - which actually useless for char16_t or wchar_t on Windows. Because you need to use utf8_utf16_codecvt - very unintuitive and would likely to make lots of troubles in future. Major flaw of std::codecvt is mbstate_t that isn't well defined makeing it impossible to work with stateful encoding or do some composition/decomposition withing the facet.

...

Header <codecvt> isn't what we need, as you point out below.

...
Boost.Locale provides one but currently it is deep internal and complex part of library.

The code I written for Boost.Nowide or one I suggest to put into Boost.Locale header-only part is codecvt that converts between utf8 and utf-16/32 according to size of character:

boost::(nowide|or locale)::utf8_facet<wchar_t> - utf-8 to utf-16

(windows)

...
utf-32 (posix)

Don't forget utf-8 to utf-8 (some embedded systems).

IAFIR std::codecvt<char,char,mbstate_t> requires it would be noconv. Also another requirement is to actually be able to iterate over internal character one at a time which more difficult than for UTF-16.

...

IMO, a critical aspect of all of those, including utf-8 to utf-8, is that

...

they detect all utf-8 errors since ill-formed utf-8 is used as an attack vector.

See Markus Kuhn's https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

It should. Actually if you want to validate/encode/decode UTF (8/16/32) there is boost::locale::utf::utf_traits that does it for yyou Also it is good test to take a look on for boost.locale Artyom

Peter Dimov

9:39 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Beman Dawes wrote:

...

Don't forget utf-8 to utf-8 (some embedded systems).

What does it do?

Artyom Beilis

8 Oct 8 Oct

3:05 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

...

From: Peter Dimov <lists@pdimov.com>

...

Artyom Beilis wrote:

...
We can create a "Separate" codecvt library with its own formal review and it would be ready in best case in a year...

One option is to put it into utility;

Missed that - yes - utility is more proper place. Artyom

Beman Dawes

9 Oct 9 Oct

12:32 p.m.

New subject: [nowide] Library Updates and Boost's brokenUTF-8 codecvt facet

On Thu, Oct 8, 2015 at 10:03 AM, Peter Dimov <lists@pdimov.com> wrote:

...

Artyom Beilis wrote:

We can create a "Separate" codecvt library with its own formal review and

...
it would be ready in best case in a year...

One option is to put it into utility;

...

another is to use a mini-review if the new codecvt library is an implementation of the standard <codecvt> interface.

+1 It seems pretty useless to me if it isn't an implementation of the standard <codecvt> interface.

...

std::codecvt_utf8 is not quite the same as boost::utf8_codecvt_facet, but on the other hand, from your previous message it seems that your utf8_codecvt_facet is not std::codecvt_utf8 but std::codecvt_utf8_utf16, or perhaps it's the latter when wchar_t is 16 bit and the former when it's 32 bit.

I'm assuming Artyom is proposing implementing codecvt<wchar_t,char,mbstate_t> where externT is always UTF-8 encoding, and internT encoding is UTF-32 for 32-bit wchar_t (e.g. many POSIX-like systems), UTF-16 for 16-bit wchar_t (e.g. many Windows-like system), and UTF-8 for 8-bit wchar_t (e.g. some embedded systems). --Beman

Robert Ramey

8 Oct 8 Oct

1:03 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 10/8/15 5:29 AM, Vladimir Prus wrote:

...

I would prefer for any such replacement to be still part of boost/detail. Depending on boost.locale just to get one header seems not perfect.

I disagree with this. Putting into detail suggests that it's and implementation detail - when it's more than that. We're not making our libraries dependent on boost.locale - just this one header - just like if we left it in detail. This example illustrates the weakness of our concept of "dependency" between libraries. I've maintained that our efforts to diminish dependencies between libraries are based on fuzzy definition of what it means for libraries to be dependent. Here is a case which illustrates that. - Off topic I know - sorry. Robert Ramey

Ion Gaztañaga

2:36 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 07/10/2015 16:49, Artyom Beilis wrote:

...

Some updated regarding boost.nowide 1. Library moved to github and its format is converted to modular boost layout: https://github.com/artyom-beilis/nowide2. Fixed unsupported std::ios::ate flag by boost::nowide::fstream3. Added some C++11 interfaces to boost::nowide::fstream4. Added integration functionality with boost::filesystem: https://github.com/artyom-beilis/nowide/blob/master/include/boost/nowide/int...

Please don't make nowide dependent on Boost filesystem. It adds unneeded dependencies and disallows other "path" types (like std::experimental::path). Couldn't you template it on a "path" class? This will make the utility more general and avoids any unneeded dependency. Best, Ion

Artyom Beilis

3:40 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Note it isn't really dependency just a shortcut - header only - if you don't include one you don't get one. Nowide is about application of utf8 everywhere policies (see utf8everywhere.org) Note you can't do what I propose with std::experimental::path because there is no place to imbue encoding. [Issues with std::experimental::path] std::experimental::path proposal has problem std::experimantal::path("stuff.txt") on Windows would use native narrow encoding - i.e. local code page and you can't provide UTF-8 name on windows see: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4100.pdf i.e. you need to use fs::u8path to create it from UTF-8... Page 16 - note 4 and 5 - which is contrary to what nowide porposes and But using u8path is problematic on POSIX systems. For POSIX based operating systems with the native narrow encoding not set to UTF-8, a conversion to UTF-32 occurs, followed by a conversion to the current native narrow encoding. Some Unicode characters may have no native character set representation You can change "native" encoding on the fly per process on on runtime It isn't something global, even if current locale LC_ALL=C the utf8 filenames can still be valid u8path. Actually on Linux it does not even have to be in specific encoding a file named "\xFF\xFF.txt" is valid file name but not valid encoding. I think this std needs good and deep changes. [/Issues with std::experimental::path] Artyom Beilis ----- Original Message -----

...

From: Ion Gaztañaga <igaztanaga@gmail.com> To: boost@lists.boost.org Cc: Sent: Thursday, October 8, 2015 5:36 PM Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On 07/10/2015 16:49, Artyom Beilis wrote:

...
Some updated regarding boost.nowide 1. Library moved to github and its format is converted to modular boost layout: https://github.com/artyom-beilis/nowide2. Fixed unsupported std::ios::ate flag by boost::nowide::fstream3. Added some C++11 interfaces to boost::nowide::fstream4. Added integration functionality with boost::filesystem:

https://github.com/artyom-beilis/nowide/blob/master/include/boost/nowide/int...

Please don't make nowide dependent on Boost filesystem. It adds unneeded dependencies and disallows other "path" types (like std::experimental::path). Couldn't you template it on a "path" class? This will make the utility more general and avoids any unneeded dependency.

Best,

Ion

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Beman Dawes

9 Oct 9 Oct

12:38 p.m.

New subject: [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

On Thu, Oct 8, 2015 at 10:36 AM, Ion Gaztañaga <igaztanaga@gmail.com> wrote:

...

On 07/10/2015 16:49, Artyom Beilis wrote:

...
Some updated regarding boost.nowide 1. Library moved to github and its format is converted to modular boost layout: https://github.com/artyom-beilis/nowide2. Fixed unsupported std::ios::ate flag by boost::nowide::fstream3. Added some C++11 interfaces to boost::nowide::fstream4. Added integration functionality with boost::filesystem:

https://github.com/artyom-beilis/nowide/blob/master/include/boost/nowide/int...

Please don't make nowide dependent on Boost filesystem. It adds unneeded dependencies and disallows other "path" types (like std::experimental::path). Couldn't you template it on a "path" class? This will make the utility more general and avoids any unneeded dependency.

Right. As suggest by Peter Dimov, Utility would be a good place to put it. --Beman

3556

Age (days ago)

3558

Last active (days ago)

List overview

Download

47 comments

7 participants

participants (7)

Andrey Semashev
Artyom Beilis
Beman Dawes
Ion Gaztañaga
Peter Dimov
Robert Ramey
Vladimir Prus