Re: [Boost-users] Interest in a Unicode library for Boost?

30 Oct 2019


      On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users <
boost-users@lists.boost.org> wrote:
...
...
NFC, very close to FCC, is more popular, due to its compactness.  I
On 26.10.19 18:41, Zach Laine via Boost-users wrote:
picked
...
the normalization form with the most readily available time and space
optimizations, and then stuck to just that one -- the alternative is many
text types with different normalizations having to interoperate, which
sounds like hell.
I can understand that, all other things being equal, the more compact
form might be preferable.  I mean, if you know nothing about Unicode
normalization forms other than that one is more compact than the other,
then you might as well pick the more compact one, right?
But all other things are clearly /not/ equal, or you would just use NFC.
  And the difference in compactness between NFC and NFD is completely
trivial.  I challenge you to find any real-world text where the
difference is size between NFC and NFD is big enough that I should care
about it, both in absolute and relative terms.
I consider FCC a non-solution to a non-problem.  The advantage of NFC
over NFD is not compactness, but compatibility with interfaces that
expect NFC.  Since FCC does not provide that advantage, there is no
reason to choose FCC over NFD.  On the other hand, there are several
good reasons for choosing NFD over FCC.  Aside from the obvious one -
compatibility with interfaces that expect NFD - there's also cleaner,
simpler code with fewer surprises.  For example, it is a completely
straightforward operation to replace all acute accents in a NFD text
with grave accents or to remove acute accents entirely, whereas the FCC
equivalent requires effectively transcoding to NFD.
In summary, I think you should support NFD text types.  Either in
addition to FCC or instead of it.
NFD is not an unreasonable choice, though I don't know why you'd want to do
a search-replace that changes all het accents from acute to grave (is that
a real use-case, or just a for-instance?).  Unfortunately, the fast-path of
the collation algorithm implementation requires FCC, which is why ICU uses
it, and one of the main reasons why I picked it.  If we had NFD strings,
we'd have to normalize them to FCC first, if I'm not mistaken.  (Though I
should verify that with a test.)

Zach