find japanese character with boost regex++

jschmid

11 Dec 2003 11 Dec '03

5:36 p.m.

Hi, I have just discovered the incredible boost and regex++ libraries but I have encountered some difficulties... I read that the japanese special encoding is handled in regex++, specially using wide char wchar characters. In the regex++ faq, it is presented the system of class , ex. [[:space]], in order to define a set of characters with a same property. I have been looking for a kind of [[:Japanese characters]] class. Actually I have a text with a lot of strange characters and japanese one ( Hiragana, katakan, Kanji everything..!) and I want to find these japanese sentence in order to translate them and replace in the text. I need hence a way in order to identify a japanese sentence . A kind of function const bool isJap( const wchar ) const would be fine. So if somebody has any idea or a some links, I would appreciate it! Thanks! Schmid

Attachments:

attachment.html (text/html — 1.5 KB)

Show replies by date

Darren Cook

12 Dec 12 Dec

2:03 a.m.

...

Actually I have a text with a lot of strange characters and japanese one ( Hiragana, Katakana, Kanji everything..!) and I want to find these japanese sentence in order to translate them and replace in the text. I need hence a way in order to identify a japanese sentence . A kind of function const bool isJap( const wchar ) const would be fine.

Do you need to use regexes? I've not tried boost.regex yet so cannot help there. Is your text just ascii and Japanese? Or do you need to distinguish from other languages as well? If just ascii and Japanese, you could define a Japanese char as anything that is not ascii (beware shift-jis encoding though, as 2nd byte of a double byte character is in the ascii range). If your data is unicode it should also be easy to treat European characters as non-Japanese as well. Darren

John Maddock

13 Dec 13 Dec

12:03 p.m.

I have just discovered the incredible boost and regex++ libraries but I have encountered some difficulties... I read that the japanese special encoding is handled in regex++, specially using wide char wchar characters. In the regex++ faq, it is presented the system of class , ex. [[:space]], in order to define a set of characters with a same property. I have been looking for a kind of [[:Japanese characters]] class. Actually I have a text with a lot of strange characters and japanese one ( Hiragana, katakan, Kanji everything..!) and I want to find these japanese sentence in order to translate them and replace in the text. I need hence a way in order to identify a japanese sentence . A kind of function const bool isJap( const wchar ) const would be fine. So if somebody has any idea or a some links, I would appreciate it! Thanks! ~~~~~~~~~~~~~~~~~~~~~~~ Two options: 1) You can hack the traits class used by boost.regex: Create your own traits class that inherits from boost::regex_traits and which implements the following member functions: uint32_t lookup_classname(const char_type* first, const char_type* last)const; bool is_class(char_type c, uint32_t f)const; The first transforms your character-class name into a constant, the latter checks to see if a character is a member of that class. Choose a value for your constant that isn't already in use by regex_traits. Finally use reg_expression<wchar_t, your_traits_class<wchar_t> > rather than boost::wregex. 2) Just use a character range - most Japanese characters are confined to a specific character range (I forget what it is, but the info is publicly available via the Unicode std). John.

Darren Cook

14 Dec 14 Dec

12:32 a.m.

...

1) You can hack the traits class used by boost.regex:

Are the existing character-classes following a standard, or are you open to patches to extend them? It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:] [:wide_alpha:] [:wide_num:] [:wide_alphanum:] Defining the set of Japanese kanji would be harder. Darren

John Maddock

11:41 a.m.

...

Are the existing character-classes following a standard, or are you open to patches to extend them?

Yes, they follow the POSIX and ECMA script standards to give: "alnum" "alpha", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper", "xdigit", "blank", "word", "unicode",

...

It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]

isn't that just [[:hiragana:][:katakana:]] ?

...

[:wide_alpha:] [:wide_num:] [:wide_alphanum:]

There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").

...

Defining the set of Japanese kanji would be harder.

How are they defined? It might be best to add a facility to add new character classes as a list of characters and ranges to include, something like: register_character_class("myname", "d-f"); Then we add all the Unicode block ranges as standard for wide character regexes. John.

Darren Cook

11:27 p.m.

...

...
Are the existing character-classes following a standard, or are you open to patches to extend them?

Yes, they follow the POSIX and ECMA script standards to give: ...

...

...
It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]

isn't that just [[:hiragana:][:katakana:]] ?

"hankaku" is half-width characters. In Shift-JIS they are encoded with a single byte. In unicode they are encoded in a different block to normal katakana.

...

...
[:wide_alpha:] [:wide_num:] [:wide_alphanum:]

There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").

That sounds like it has potential for problems so I wonder if we're talking about the same thing; by wide I mean the character occupies the same amount of screen space as a kanji, which is twice as wide as the ascii character. E.g. "Ａ１" rather than "A1". However my main use case is not so much detecting with regex as converting them to ascii; e.g. given a list of email addresses, some of which the user has typed in with their Japanese IME still switched on. I'll convert to ascii then run the email address through a regex.

...

...
Defining the set of Japanese kanji would be harder.

How are they defined?

The Japanese kanji can be thought of a subset of Chinese, so the issue is where the subset ends, and there are various definitions. Joyo kanji is approx 2000 common ones, but people names often use others, and academics use more. I've not looked but it is possible Joyo kanji may be scattered around unicode. Simplest would be to define [:kanji:] as all Chinese characters, and anyone needing to distinguish Japanese from Chinese could then use a lookup table.

...

It might be best to add a facility to add new character classes as a list of characters and ranges to include, something like:

register_character_class("myname", "d-f");

Then we add all the Unicode block ranges as standard for wide character regexes.

Sounds good. Do you mean in an extra include file, e.g. "regex/unicode_classes.h" ? Darren

John Maddock

15 Dec 15 Dec

11:53 a.m.

...

...
...
It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]

isn't that just [[:hiragana:][:katakana:]] ?

"hankaku" is half-width characters. In Shift-JIS they are encoded with a single byte. In unicode they are encoded in a different block to normal katakana.

Sorry I misread your original.

...

...
...
[:wide_alpha:] [:wide_num:] [:wide_alphanum:]

There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").

That sounds like it has potential for problems so I wonder if we're talking about the same thing; by wide I mean the character occupies the same amount of screen space as a kanji, which is twice as wide as the ascii character.

E.g. "Ａ１" rather than "A1".

...

However my main use case is not so much detecting with regex as converting them to ascii; e.g. given a list of email addresses, some of which the user has typed in with their Japanese IME still switched on. I'll convert to ascii then run the email address through a regex.

...
...
Defining the set of Japanese kanji would be harder.

How are they defined?

The Japanese kanji can be thought of a subset of Chinese, so the issue is where the subset ends, and there are various definitions. Joyo kanji is approx 2000 common ones, but people names often use others, and academics use more. I've not looked but it is possible Joyo kanji may be scattered around unicode.

Simplest would be to define [:kanji:] as all Chinese characters, and anyone needing to distinguish Japanese from Chinese could then use a lookup

...

...
It might be best to add a facility to add new character classes as a

OK, do you mean what Unicode calls "Full Width" rather than "Half Width"? table. OK, I'm beginning to regret asking :-) list of

...

...
characters and ranges to include, something like:

register_character_class("myname", "d-f");

Then we add all the Unicode block ranges as standard for wide character regexes.

Sounds good. Do you mean in an extra include file, e.g. "regex/unicode_classes.h" ?

To be honest I haven't decided, I guess in the spirit of "only pay for what you use" that would be the best way. John.

7886

Age (days ago)

7890

Last active (days ago)

List overview

Download

0 comments

1 participants

participants (1)

Darren Cook
John Maddock
jschmid