On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:
NFC, very close to FCC, is more popular, due to its compactness. I
On 26.10.19 18:41, Zach Laine via Boost-users wrote: picked
the normalization form with the most readily available time and space optimizations, and then stuck to just that one -- the alternative is many text types with different normalizations having to interoperate, which sounds like hell.
I can understand that, all other things being equal, the more compact form might be preferable. I mean, if you know nothing about Unicode normalization forms other than that one is more compact than the other, then you might as well pick the more compact one, right?
But all other things are clearly /not/ equal, or you would just use NFC. And the difference in compactness between NFC and NFD is completely trivial. I challenge you to find any real-world text where the difference is size between NFC and NFD is big enough that I should care about it, both in absolute and relative terms.
I consider FCC a non-solution to a non-problem. The advantage of NFC over NFD is not compactness, but compatibility with interfaces that expect NFC. Since FCC does not provide that advantage, there is no reason to choose FCC over NFD. On the other hand, there are several good reasons for choosing NFD over FCC. Aside from the obvious one - compatibility with interfaces that expect NFD - there's also cleaner, simpler code with fewer surprises. For example, it is a completely straightforward operation to replace all acute accents in a NFD text with grave accents or to remove acute accents entirely, whereas the FCC equivalent requires effectively transcoding to NFD.
In summary, I think you should support NFD text types. Either in addition to FCC or instead of it.
NFD is not an unreasonable choice, though I don't know why you'd want to do a search-replace that changes all het accents from acute to grave (is that a real use-case, or just a for-instance?). Unfortunately, the fast-path of the collation algorithm implementation requires FCC, which is why ICU uses it, and one of the main reasons why I picked it. If we had NFD strings, we'd have to normalize them to FCC first, if I'm not mistaken. (Though I should verify that with a test.) Zach