Re: [Boost-users] find japanese character with boost regex++

13 Dec 2003

      I have just discovered the incredible boost and regex++ libraries but I have
encountered some difficulties...

I read that the japanese special encoding is handled in regex++, specially
using wide char wchar characters. In the regex++ faq, it is presented the
system of class , ex. [[:space]], in order to define a set of characters
with a same property. I have been looking for a kind of [[:Japanese
characters]] class. Actually I have a text with a lot of strange characters
and japanese one ( Hiragana, katakan, Kanji everything..!) and I want to
find these japanese sentence in order to translate them and replace in the
text. I need hence a way in order to identify a japanese sentence . A kind
of function const bool isJap( const wchar ) const would be fine.

So if somebody has any idea or a some links, I would appreciate it! Thanks!

~~~~~~~~~~~~~~~~~~~~~~~

Two options:

1) You can hack the traits class used by boost.regex:

Create your own traits class that inherits from boost::regex_traits and
which implements the following member functions:

   uint32_t lookup_classname(const char_type* first, const char_type*
last)const;
   bool is_class(char_type c, uint32_t f)const;

The first transforms your character-class name into a constant, the latter
checks to see if a character is a member of that class.  Choose a value for
your constant that isn't already in use by regex_traits.

Finally use reg_expression<wchar_t, your_traits_class<wchar_t> > rather than
boost::wregex.

2) Just use a character range - most Japanese characters are confined to a
specific character range (I forget what it is, but the info is publicly
available via the Unicode std).

John.

Re: [Boost-users] find japanese character with boost regex++

John Maddock