Re: [Boost-users] Interest in a Unicode library for Boost?

30 Oct 2019


      On 30.10.19 16:56, Zach Laine via Boost-users wrote:
...
On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users <
boost-users@lists.boost.org> wrote:
...
In summary, I think you should support NFD text types.  Either in
addition to FCC or instead of it.
NFD is not an unreasonable choice, though I don't know why you'd want to do
a search-replace that changes all het accents from acute to grave (is that
a real use-case, or just a for-instance?).
The specific example is just hypothetical, but wanting to operate on 
diacritics and base characters separately is real enough.  Better 
examples: checking that Chinese pinyin syllables have their tone markers 
on the correct vowel.  Or collecting statistics on the use of diacritics 
in a text.  Or testing if a font has all of the glyphs needed to render 
a text.  Or replacing a diacritic that's on my keyboard layout for 
another one that's not.  Or even just collation.
...
Unfortunately, the fast-path of
the collation algorithm implementation requires FCC, which is why ICU uses
it, and one of the main reasons why I picked it.  If we had NFD strings,
we'd have to normalize them to FCC first, if I'm not mistaken.  (Though I
should verify that with a test.)
It find that surprising, since FCC more than any other normalization 
form mixes precomposed and decomposed characters.  But I will say this 
for FCC: at least it's easy to transcode from FCC to NFD.  It could even 
be done in a fairly straightforward iterator adapter.


-- 
Rainer Deyke (rainerd@eldwood.com)