[Locale] inconsistent results for utf-8 collation
Hello!
When comparing the following UTF-8 string pairs using Boost.Locale (any
backend) at the "identical" level (accents are relevant) and a UTF-8
locale (I tried de_DE.utf-8) on Debian Testing (boost 1.49), I get a
result that does not make sense to me.
"Muller" is considered less than "Müller" (as expected), but "Muller 2"
is considered more than "Müller 1", despite the different result for the
names alone.
Do I have bug in my code, in the underlying libraries or in my
expectations?
#include
----- Original Message -----
From: Patrick Ohly
To: boost-users@lists.boost.org Cc: Sent: Wednesday, August 29, 2012 10:27 AM Subject: [Boost-users] [Locale] inconsistent results for utf-8 collation Hello!
When comparing the following UTF-8 string pairs using Boost.Locale (any backend) at the "identical" level (accents are relevant) and a UTF-8 locale (I tried de_DE.utf-8) on Debian Testing (boost 1.49), I get a result that does not make sense to me.
"Muller" is considered less than "Müller" (as expected), but "Muller 2" is considered more than "Müller 1", despite the different result for the names alone.
Do I have bug in my code, in the underlying libraries or in my expectations?
Collations != Lexicographical Comparison. It is not a mistake that you get the same results for all backends: icu, posix and std. Take even OS C API strcoll you'll see the same behavior (for the reason) The point is that the difference between "B" and "A" is more important than the difference between "ü" and "u" i.e. it first sorts "Muller B" and "Müller A" without accents and than sorts if identical according to the accents. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/
Artyom Beilis
When comparing the following UTF-8 string pairs using Boost.Locale (any backend) at the "identical" level (accents are relevant) and a UTF-8 locale (I tried de_DE.utf-8) on Debian Testing (boost 1.49), I get a result that does not make sense to me. [...] Collations != Lexicographical Comparison.
It is not a mistake that you get the same results for all backends: icu, posix and std.
Take even OS C API strcoll you'll see the same behavior (for the reason)
strcoll() indeed reports the same result. However, it is uncertain at which level it operates. "Muller" and "Müller" are different, so it is not the primary level.
The point is that the difference between "B" and "A" is more important than the difference between "ü" and "u"
i.e. it first sorts "Muller B" and "Müller A" without accents and than sorts if identical according to the accents.
I'm was using the "identical" level with the expectation that this would make the difference because of accents as relevant as differences between characters. I now understand that this is not how the Unicode collation algorithm works. Thanks for pointing that out. Bye, Patrick
participants (2)
-
Artyom Beilis
-
Patrick Ohly