find japanese character with boost regex++
Hi, I have just discovered the incredible boost and regex++ libraries but I have encountered some difficulties... I read that the japanese special encoding is handled in regex++, specially using wide char wchar characters. In the regex++ faq, it is presented the system of class , ex. [[:space]], in order to define a set of characters with a same property. I have been looking for a kind of [[:Japanese characters]] class. Actually I have a text with a lot of strange characters and japanese one ( Hiragana, katakan, Kanji everything..!) and I want to find these japanese sentence in order to translate them and replace in the text. I need hence a way in order to identify a japanese sentence . A kind of function const bool isJap( const wchar ) const would be fine. So if somebody has any idea or a some links, I would appreciate it! Thanks! Schmid
Actually I have a text with a lot of strange characters and japanese one ( Hiragana, Katakana, Kanji everything..!) and I want to find these japanese sentence in order to translate them and replace in the text. I need hence a way in order to identify a japanese sentence . A kind of function const bool isJap( const wchar ) const would be fine.
Do you need to use regexes? I've not tried boost.regex yet so cannot help there. Is your text just ascii and Japanese? Or do you need to distinguish from other languages as well? If just ascii and Japanese, you could define a Japanese char as anything that is not ascii (beware shift-jis encoding though, as 2nd byte of a double byte character is in the ascii range). If your data is unicode it should also be easy to treat European characters as non-Japanese as well. Darren
I have just discovered the incredible boost and regex++ libraries but I have
encountered some difficulties...
I read that the japanese special encoding is handled in regex++, specially
using wide char wchar characters. In the regex++ faq, it is presented the
system of class , ex. [[:space]], in order to define a set of characters
with a same property. I have been looking for a kind of [[:Japanese
characters]] class. Actually I have a text with a lot of strange characters
and japanese one ( Hiragana, katakan, Kanji everything..!) and I want to
find these japanese sentence in order to translate them and replace in the
text. I need hence a way in order to identify a japanese sentence . A kind
of function const bool isJap( const wchar ) const would be fine.
So if somebody has any idea or a some links, I would appreciate it! Thanks!
~~~~~~~~~~~~~~~~~~~~~~~
Two options:
1) You can hack the traits class used by boost.regex:
Create your own traits class that inherits from boost::regex_traits and
which implements the following member functions:
uint32_t lookup_classname(const char_type* first, const char_type*
last)const;
bool is_class(char_type c, uint32_t f)const;
The first transforms your character-class name into a constant, the latter
checks to see if a character is a member of that class. Choose a value for
your constant that isn't already in use by regex_traits.
Finally use reg_expression
1) You can hack the traits class used by boost.regex:
Are the existing character-classes following a standard, or are you open to patches to extend them? It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:] [:wide_alpha:] [:wide_num:] [:wide_alphanum:] Defining the set of Japanese kanji would be harder. Darren
Are the existing character-classes following a standard, or are you open to patches to extend them?
Yes, they follow the POSIX and ECMA script standards to give: "alnum" "alpha", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper", "xdigit", "blank", "word", "unicode",
It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]
isn't that just [[:hiragana:][:katakana:]] ?
[:wide_alpha:] [:wide_num:] [:wide_alphanum:]
There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").
Defining the set of Japanese kanji would be harder.
How are they defined? It might be best to add a facility to add new character classes as a list of characters and ranges to include, something like: register_character_class("myname", "d-f"); Then we add all the Unicode block ranges as standard for wide character regexes. John.
Are the existing character-classes following a standard, or are you open to patches to extend them?
Yes, they follow the POSIX and ECMA script standards to give: ...
It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]
isn't that just [[:hiragana:][:katakana:]] ?
"hankaku" is half-width characters. In Shift-JIS they are encoded with a single byte. In unicode they are encoded in a different block to normal katakana.
[:wide_alpha:] [:wide_num:] [:wide_alphanum:]
There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").
That sounds like it has potential for problems so I wonder if we're talking about the same thing; by wide I mean the character occupies the same amount of screen space as a kanji, which is twice as wide as the ascii character. E.g. "A1" rather than "A1". However my main use case is not so much detecting with regex as converting them to ascii; e.g. given a list of email addresses, some of which the user has typed in with their Japanese IME still switched on. I'll convert to ascii then run the email address through a regex.
Defining the set of Japanese kanji would be harder.
How are they defined?
The Japanese kanji can be thought of a subset of Chinese, so the issue is where the subset ends, and there are various definitions. Joyo kanji is approx 2000 common ones, but people names often use others, and academics use more. I've not looked but it is possible Joyo kanji may be scattered around unicode. Simplest would be to define [:kanji:] as all Chinese characters, and anyone needing to distinguish Japanese from Chinese could then use a lookup table.
It might be best to add a facility to add new character classes as a list of characters and ranges to include, something like:
register_character_class("myname", "d-f");
Then we add all the Unicode block ranges as standard for wide character regexes.
Sounds good. Do you mean in an extra include file, e.g. "regex/unicode_classes.h" ? Darren
It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]
isn't that just [[:hiragana:][:katakana:]] ?
"hankaku" is half-width characters. In Shift-JIS they are encoded with a single byte. In unicode they are encoded in a different block to normal katakana.
Sorry I misread your original.
[:wide_alpha:] [:wide_num:] [:wide_alphanum:]
There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").
That sounds like it has potential for problems so I wonder if we're talking about the same thing; by wide I mean the character occupies the same amount of screen space as a kanji, which is twice as wide as the ascii character.
E.g. "A1" rather than "A1".
However my main use case is not so much detecting with regex as converting them to ascii; e.g. given a list of email addresses, some of which the user has typed in with their Japanese IME still switched on. I'll convert to ascii then run the email address through a regex.
Defining the set of Japanese kanji would be harder.
How are they defined?
The Japanese kanji can be thought of a subset of Chinese, so the issue is where the subset ends, and there are various definitions. Joyo kanji is approx 2000 common ones, but people names often use others, and academics use more. I've not looked but it is possible Joyo kanji may be scattered around unicode.
Simplest would be to define [:kanji:] as all Chinese characters, and anyone needing to distinguish Japanese from Chinese could then use a lookup
It might be best to add a facility to add new character classes as a
OK, do you mean what Unicode calls "Full Width" rather than "Half Width"? table. OK, I'm beginning to regret asking :-) list of
characters and ranges to include, something like:
register_character_class("myname", "d-f");
Then we add all the Unicode block ranges as standard for wide character regexes.
Sounds good. Do you mean in an extra include file, e.g. "regex/unicode_classes.h" ?
To be honest I haven't decided, I guess in the spirit of "only pay for what you use" that would be the best way. John.
participants (3)
-
Darren Cook
-
John Maddock
-
jschmid