Are the existing character-classes following a standard, or are you open to patches to extend them?
Yes, they follow the POSIX and ECMA script standards to give: ...
It might be nice to have at least: [:hiragana:] [:katakana:] [:hankaku_katakana:]
isn't that just [[:hiragana:][:katakana:]] ?
"hankaku" is half-width characters. In Shift-JIS they are encoded with a single byte. In unicode they are encoded in a different block to normal katakana.
[:wide_alpha:] [:wide_num:] [:wide_alphanum:]
There should be no need for those - [[:alpha:]] will detect wide character alphabetic characters perfectly well (provided the locale isn't "C").
That sounds like it has potential for problems so I wonder if we're talking about the same thing; by wide I mean the character occupies the same amount of screen space as a kanji, which is twice as wide as the ascii character. E.g. "A1" rather than "A1". However my main use case is not so much detecting with regex as converting them to ascii; e.g. given a list of email addresses, some of which the user has typed in with their Japanese IME still switched on. I'll convert to ascii then run the email address through a regex.
Defining the set of Japanese kanji would be harder.
How are they defined?
The Japanese kanji can be thought of a subset of Chinese, so the issue is where the subset ends, and there are various definitions. Joyo kanji is approx 2000 common ones, but people names often use others, and academics use more. I've not looked but it is possible Joyo kanji may be scattered around unicode. Simplest would be to define [:kanji:] as all Chinese characters, and anyone needing to distinguish Japanese from Chinese could then use a lookup table.
It might be best to add a facility to add new character classes as a list of characters and ranges to include, something like:
register_character_class("myname", "d-f");
Then we add all the Unicode block ranges as standard for wide character regexes.
Sounds good. Do you mean in an extra include file, e.g. "regex/unicode_classes.h" ? Darren