Inconsistent unicode encoding between boost and wx on mac osx
My project uses both boost and wxwidgets and unicode encoding by both is different on Mac OSX. Everything works fine on windows. Problem: Boost and WX do end up encoding the strings differently when converting to unicode on OSX. I am detailing an example: WX's encoding is same on both windows and osx but Boost's encoding is different on both platforms. It is probably not a bug but I am unable to figure out the reason and how to make them both work together. The string: 国際的な一流のホールダー (I don't know if this foreign language string will show up fine in your email client). Hex dumps of Unicode encodings of this string
From Boost on Mac (32-bit):
e5 0 0 0 9b 0 0 0 bd 0 0 0 e9 0 0 0 9a 0 0 0 9b 0 0 0 e7 0 0 0...
From WX on Mac (32-bit):
fd 56 0 0 9b 96 0 0 84 76 0 0 6a 30 0 0 0 4e 0 0....
From WX and Boost on Windows (16-bit):
fd 56 9b 96 84 76 6a 30 0 4e 41 6d 6e 30... (To be more exact, this is how boost::filesystem encodes a directory name it read from disk, and its the wxChar** argv value from wxApp for a Unicode build of wx). So wx does same thing on windows and mac, and boost does that same thing on windows. But boost encodes it differently on mac and I have no idea why. My wild guess is its due to locale etc settings but I don't know how check for this or fix it. SG
On Sun, Feb 14, 2010 at 02:50:43AM +0530, Sachin Garg wrote:
My project uses both boost and wxwidgets and unicode encoding by both is different on Mac OSX. Everything works fine on windows.
Problem: Boost and WX do end up encoding the strings differently when converting to unicode on OSX. I am detailing an example:
WX's encoding is same on both windows and osx but Boost's encoding is different on both platforms. It is probably not a bug but I am unable to figure out the reason and how to make them both work together. Hex dumps of Unicode encodings of this string
Unicode has a bunch of different Normalization Forms [1]. A normalization form tells how diacritics and composite codepoints should be composed or decomposed when represented. The choice of NF is up to the OS, most importantly, OSX and Windows does it differently. The encoding of your strings seems to be the same, they're just composed differently. Boost likely uses OS functions to convert between encodings while I assume that WX uses its own internally consistent transcoding. [1] http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms -- Lars Viklund | zao@acc.umu.se
On Sun, Feb 14, 2010 at 3:06 AM, Lars Viklund
On Sun, Feb 14, 2010 at 02:50:43AM +0530, Sachin Garg wrote:
My project uses both boost and wxwidgets and unicode encoding by both is different on Mac OSX. Everything works fine on windows.
Problem: Boost and WX do end up encoding the strings differently when converting to unicode on OSX. I am detailing an example:
WX's encoding is same on both windows and osx but Boost's encoding is different on both platforms. It is probably not a bug but I am unable to figure out the reason and how to make them both work together. Hex dumps of Unicode encodings of this string
Unicode has a bunch of different Normalization Forms [1]. A normalization form tells how diacritics and composite codepoints should be composed or decomposed when represented.
The choice of NF is up to the OS, most importantly, OSX and Windows does it differently. The encoding of your strings seems to be the same, they're just composed differently.
Boost likely uses OS functions to convert between encodings while I assume that WX uses its own internally consistent transcoding.
[1] http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
Thanks, this explains a lot. Is there some std/boost way to specify which encoding/normalization to use? Or to find out which encoding boost defaults to? I will bring this up on WX list too, but in case there is no 'correct' way to decide which encoding to use, I will still need to make them compatible to make my software work. SG
Sachin Garg wrote:
My project uses both boost and wxwidgets and unicode encoding by both is different on Mac OSX. Everything works fine on windows.
Problem: Boost and WX do end up encoding the strings differently when converting to unicode on OSX. I am detailing an example:
WX's encoding is same on both windows and osx but Boost's encoding is different on both platforms. It is probably not a bug but I am unable to figure out the reason and how to make them both work together.
The string:
国際的な一流のホールダー
(I don't know if this foreign language string will show up fine in your email client).
Hex dumps of Unicode encodings of this string
From Boost on Mac (32-bit):
e5 0 0 0 9b 0 0 0 bd 0 0 0 e9 0 0 0 9a 0 0 0 9b 0 0 0 e7 0 0 0...
You'll need to somehow tell Boost.Filesystem to use UTF-8 as the external encoding. It doesn't seem to be aware that Mac OS X uses UTF-8 and seems to use the default locale. I'm not sure how this is done, and can't find it in the documentation, but I see in path.hpp a function wpath_traits::imbue, to which one should presumably pass an appropriate UTF-8 locale. I see utf8_codecvt_facet.cpp in the src directory of filesystem, but it doesn't seem to be used. All in all, this seems like a bug in Filesystem and you should file a Trac ticket at svn.boost.org. In the meantime, you could try boost::filesystem::wpath_traits::imbue( std::locale( ".UTF-8" ) ); and see if it works; I'm not sure what is the appropriate way to create an UTF-8 locale on Mac OS X, but the above might do it.
Peter Dimov wrote:
In the meantime, you could try
boost::filesystem::wpath_traits::imbue( std::locale( ".UTF-8" ) );
Unfortunately, this http://stackoverflow.com/questions/1745045/stdlocale-breakage-on-macos-10-6-... says that UTF-8 locales are broken on Mac OS X, so it probably won't work. Sorry.
On Sun, Feb 14, 2010 at 3:57 AM, Peter Dimov
Peter Dimov wrote:
In the meantime, you could try
boost::filesystem::wpath_traits::imbue( std::locale( ".UTF-8" ) );
Unfortunately, this
http://stackoverflow.com/questions/1745045/stdlocale-breakage-on-macos-10-6-...
says that UTF-8 locales are broken on Mac OS X, so it probably won't work. Sorry.
Peter, I am guessing that the problem is not the OSX locale which could be UTF-8 or UTF-16 or X (I am calling this 'source' encoding). Problem *seems* to be that boost converts that X into UTF16 (or UTF32) for storing it in wpath and that is using a normalization which is incompatible with normalization used by wx for its UTF16 (or 32). (Lets call this 'destination' encoding). It seems that wpath_traits::imbue is way to specify 'source' encoding, but maybe I need to change 'destination' encoding? I am guessing all this from LARS' post above in this thread, so he might correct me if I am off track :-) SG
Sachin Garg wrote:
Peter, I am guessing that the problem is not the OSX locale which could be UTF-8 or UTF-16 or X (I am calling this 'source' encoding).
The encoding of OS X's path names is always UTF-8. It does not depend on the OS locale.
Problem *seems* to be that boost converts that X into UTF16 (or UTF32) for storing it in wpath and that is using a normalization which is incompatible with normalization used by wx for its UTF16 (or 32).
I don't think that this has anything to do with normalization. By looking at the source, Filesystem, by default, uses the global locale to convert betwen narrow and wide paths, and the default locale in your case seems to perform no conversion. That is, it just takes every 8 bit char and converts it to a 32 bit wchar_t. This, of course, doesn't work when the source is UTF-8, regardless of its normalization.
On Sun, Feb 14, 2010 at 5:11 AM, Peter Dimov
Sachin Garg wrote:
Peter, I am guessing that the problem is not the OSX locale which could be UTF-8 or UTF-16 or X (I am calling this 'source' encoding).
The encoding of OS X's path names is always UTF-8. It does not depend on the OS locale.
Problem *seems* to be that boost converts that X into UTF16 (or UTF32) for storing it in wpath and that is using a normalization which is incompatible with normalization used by wx for its UTF16 (or 32).
I don't think that this has anything to do with normalization. By looking at the source, Filesystem, by default, uses the global locale to convert betwen narrow and wide paths, and the default locale in your case seems to perform no conversion. That is, it just takes every 8 bit char and converts it to a 32 bit wchar_t. This, of course, doesn't work when the source is UTF-8, regardless of its normalization.
It looks like you are correct, thanks :-) I added this to my code and it fixes the problem: std::locale global_loc = std::locale(); boost::filesystem::detail::utf8_codecvt_facet utf8_facet(1); std::locale loc(global_loc, &utf8_facet); boost::filesystem::wpath_traits::imbue(loc); Taken from: http://archives.free.net.ph/message/20071110.132150.af9dc620.en.html And it seems this is an issue since atleast 2007, I can't tell why more people haven't got hurt by this. This solution works on my computer but I don't know if it will work if OS uses any random different locale etc... Thanks again, SG
It seems that I jumped to some incorrect conclusions, having not read the post too closely. At least someone has learned something new about Unicode today. :) -- Lars Viklund | zao@acc.umu.se
On Sun, Feb 14, 2010 at 8:18 AM, Lars Viklund
It seems that I jumped to some incorrect conclusions, having not read the post too closely.
At least someone has learned something new about Unicode today. :)
Thanks for trying to help :-) SG
Sachin Garg wrote:
I don't think that this has anything to do with normalization. By looking at the source, Filesystem, by default, uses the global locale to convert betwen narrow and wide paths, and the default locale in your case seems to perform no conversion. That is, it just takes every 8 bit char and converts it to a 32 bit wchar_t. This, of course, doesn't work when the source is UTF-8, regardless of its normalization.
It looks like you are correct, thanks :-)
I added this to my code and it fixes the problem:
std::locale global_loc = std::locale(); boost::filesystem::detail::utf8_codecvt_facet utf8_facet(1); std::locale loc(global_loc, &utf8_facet); boost::filesystem::wpath_traits::imbue(loc);
Taken from: http://archives.free.net.ph/message/20071110.132150.af9dc620.en.html
And it seems this is an issue since atleast 2007,
Filing an issue on http://svn.boost.org generally increases the chances that something will be fixed. Of course, please try to provide a detailed description. - Volodya
On Sun, Feb 14, 2010 at 1:24 PM, Vladimir Prus
Sachin Garg wrote:
I don't think that this has anything to do with normalization. By looking at the source, Filesystem, by default, uses the global locale to convert betwen narrow and wide paths, and the default locale in your case seems to perform no conversion. That is, it just takes every 8 bit char and converts it to a 32 bit wchar_t. This, of course, doesn't work when the source is UTF-8, regardless of its normalization.
It looks like you are correct, thanks :-)
I added this to my code and it fixes the problem:
std::locale global_loc = std::locale(); boost::filesystem::detail::utf8_codecvt_facet utf8_facet(1); std::locale loc(global_loc, &utf8_facet); boost::filesystem::wpath_traits::imbue(loc);
Taken from: http://archives.free.net.ph/message/20071110.132150.af9dc620.en.html
And it seems this is an issue since atleast 2007,
Filing an issue on http://svn.boost.org generally increases the chances that something will be fixed. Of course, please try to provide a detailed description.
I have filed a bug, please add any information if my description isn't complete enough. https://svn.boost.org/trac/boost/ticket/3928 SG
Peter Dimov wrote:
Peter Dimov wrote:
In the meantime, you could try
boost::filesystem::wpath_traits::imbue( std::locale( ".UTF-8" ) );
Unfortunately, this
http://stackoverflow.com/questions/1745045/stdlocale-breakage-on-macos-10-6-...
says that UTF-8 locales are broken on Mac OS X, so it probably won't work. Sorry. What about making a locale from the C locale replacing the codecvt_facet in the locale with the UTF-8 one from boost. and then imbuing the new locale? Would it look like the C locale but with a UTF-8 codecvt_facet? Don't know if it would work.
Patrick
Sachin Garg
On Tue, Feb 16, 2010 at 3:09 AM, Beman Dawes
I've updated the Boost SVN trunk with a fix. Please give it a try and report any problems.
Thanks. I haven't ever built from trunk. I am guessing it will only require downloading/checking-out the trunk and then rest of the procedure will be same as with release downloads. Will test it soon (might not be able to check before next weekend). Thanks again, for this fix and fix to bug #3884. SG
On Tue, Feb 16, 2010 at 5:25 AM, Sachin Garg
On Tue, Feb 16, 2010 at 3:09 AM, Beman Dawes
wrote: I've updated the Boost SVN trunk with a fix. Please give it a try and report any problems.
Thanks.
I haven't ever built from trunk. I am guessing it will only require downloading/checking-out the trunk and then rest of the procedure will be same as with release downloads.
Will test it soon (might not be able to check before next weekend).
Thanks again, for this fix and fix to bug #3884.
I tested the trunk, the bug is fixed. Unicode in wpath works fine on OSX now. SG
participants (6)
-
Beman Dawes
-
Lars Viklund
-
Patrick Horgan
-
Peter Dimov
-
Sachin Garg
-
Vladimir Prus