[Locale] inconsistent results for utf-8 collation

29 Aug 2012

      Hello!

When comparing the following UTF-8 string pairs using Boost.Locale (any
backend) at the "identical" level (accents are relevant) and a UTF-8
locale (I tried de_DE.utf-8) on Debian Testing (boost 1.49), I get a
result that does not make sense to me.

"Muller" is considered less than "Müller" (as expected), but "Muller 2"
is considered more than "Müller 1", despite the different result for the
names alone.

Do I have bug in my code, in the underlying libraries or in my
expectations?

#include <locale.h>

#include <boost/locale.hpp>
#include <boost/assign/std/vector.hpp>
#include <boost/foreach.hpp>
#include <boost/assign/list_of.hpp>
#include <boost/algorithm/string/join.hpp>
#include <iostream>

int main(int argc, char **argv)
{
    setlocale(LC_ALL, "");

    std::cout << "backends: " <<
        boost::join(boost::locale::localization_backend_manager::global().get_all_backends(),
                    ", ") << std::endl;
    boost::locale::localization_backend_manager::global().select(argc > 2 ? argv[2] : "icu");
    std::locale loc = boost::locale::generator()(argc > 1 ? argv[1] : "de_DE.UTF-8");

    typedef boost::tuple<std::string, std::string> string_pair_t;
    std::vector<string_pair_t> pairs =
        boost::assign::tuple_list_of("Muller", "Müller")
        ("Muller 2", "Müller 1")
        ("Muller B", "Müller A");
    BOOST_FOREACH (const string_pair_t &pair, pairs) {
        const std::string &a = boost::get<0>(pair),
            &b = boost::get<1>(pair);
        int cmp = std::use_facet<boost::locale::collator<char> >(loc).
            compare(boost::locale::collator_base::identical, a, b);
        std::cout <<
            a << " and " << b <<
            " are " <<
            (cmp == 0 ? "identical" : "different") <<
            " (" <<
            (cmp < 0 ? '<' :
                   cmp > 0 ? '>' : '=') <<
            ")" << std::endl;
    }

    return 0;
}

The output on my system:

$ /tmp/mueller de_DE.utf-8 icu
backends: icu, posix, std
Muller and Müller are different (<)
Muller 2 and Müller 1 are different (>)
Muller B and Müller A are different (>)

Bye, Patrick

Patrick Ohly

Artyom Beilis

Patrick Ohly

tags

participants (2)