On 30.10.19 16:56, Zach Laine via Boost-users wrote:
On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:
In summary, I think you should support NFD text types. Either in addition to FCC or instead of it.
NFD is not an unreasonable choice, though I don't know why you'd want to do a search-replace that changes all het accents from acute to grave (is that a real use-case, or just a for-instance?).
The specific example is just hypothetical, but wanting to operate on diacritics and base characters separately is real enough. Better examples: checking that Chinese pinyin syllables have their tone markers on the correct vowel. Or collecting statistics on the use of diacritics in a text. Or testing if a font has all of the glyphs needed to render a text. Or replacing a diacritic that's on my keyboard layout for another one that's not. Or even just collation.
Unfortunately, the fast-path of the collation algorithm implementation requires FCC, which is why ICU uses it, and one of the main reasons why I picked it. If we had NFD strings, we'd have to normalize them to FCC first, if I'm not mistaken. (Though I should verify that with a test.)
It find that surprising, since FCC more than any other normalization form mixes precomposed and decomposed characters. But I will say this for FCC: at least it's easy to transcode from FCC to NFD. It could even be done in a fairly straightforward iterator adapter. -- Rainer Deyke (rainerd@eldwood.com)