Re: [boost] [review] [text] Text formal review

16 Jun 2020

      On Sun, Jun 14, 2020 at 7:25 AM Rainer Deyke via Boost
<boost@lists.boost.org> wrote:
...
On 14.06.20 01:25, Zach Laine via Boost wrote:
...
On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
<boost@lists.boost.org> wrote:
...
A memmapped string_view /is/ a contiguous sequence of char.  I don't see
the difference.
The difference is mutability.  There's no perf concern with erasing
the first element of a string_view, if that's not even a supported
operation.
A /lot/ of strings, probably the vast majority, will never be mutated.
Ok, then those should more appropriately be string_views.
...
And for the rest, the majority will only be mutated by appending.
That does not help, unless the capacity is so large that a
reallocation is unnecessary.
...
Erasing the first element is a nice to have but expensive and rarely
used feature.  If you find yourself doing that a lot, then you probably
do want a rope.
Any mutation might cause a reallocation.  I named one of the
worst-case operations rhetorically, but appending is also bad if it
causes that reallocation.  It's not a question of what kind of
mutating operation you're doing, but whether you're mutating or not.
...
...
...
I hadn't thought through the interface in detail.  I just saw that this
was a feature of the text layer, and thought it would be nice to have in
the unicode layer, because I don't want to use the text layer (in its
current form).
I don't need a detailed interface.  Pseudocode would be fine too.
insert_nfd(string, position, thing_to_insert)
// Insert 'thing_to_insert' into 'string' at 'position'.  Both 'string'
// and 'thing_to_insert' are required to be in NFD.  The area around the
// insertion is renormalized to NFD.
I see -- no surprises here.  As I said, I like this idea a lot!
However, see below.
...
...
...
Having to renormalize at API boundaries can be prohibitively expensive.
Sure.  Anything can be prohibitively expensive in some context.  If
that's the case in a particular program, I think it is likely to be
unacceptable to use text::operator+(string_view) as well, since that
also does on-the-fly normalization.
Hopefully only on the string_view and the area immediately surrounding
the insertion.
No, that's why I picked string_view, and not text_view.  text_view
insertion does not normalize the incoming text, but string_view
insertion does.  This is in keeping with the philosophy:

- At program I/O boundaries (not all API boundaries), convert to UTF-8 and FCC.
- Internal interfaces that take UTF-8/FCC will not transcode or normalize.
- Internal interface that take non-UTF-8/FCC will transcode and
normalize as needed.

text::operator+(string_view sv) does not know the normalization of sv,
so it normalizes.  The alternative is clunky -- you have to make a new
string somewhere to normalize into, and then use operator+() on the
result.
...
...
Someone, somewhere, has to pay
that cost if you want to use two chunks of text in
encoding/normalization A and B.  You might be able to keep working in
A for some text and keep working in B separately for other text, but I
think code that works like that is going to be hard to reason about,
and will be as common as code that freely mixes wstring and string
(and I mean not only at program boundaries).  That is, not very
common.
Which is why I want to avoid just that.
Your suggestions:
void f() {
     // renormalizes to fcc
     text::text t = api_funtion_that_returns_nfd();
     do_something_with(t);
     string s;
     text::normalize_to_nfd(t.extract(), back_inserter(s));
     api_function_that_accepts_nfd(s);
   }
My suggestion:
void f() {
     text::text<nfd, std::string> t = api_function_that_returns_nfd();
     do_something_with(t);
     api_function_that_accepts_nfd(t.extract());
   }
Right, I get it.  I just think you're leaving out the lack of
interoperability with text::text<nfc, std::wstring>, etc.  That's not
a trivial concern.

If you have code that needs to stay NFC as in your example, you should
be able to use std::string and insert_nfc() and friends.  This is yet
another case where a perf tradeoff forces you to write a bit more
code.  That does not seem onerous to me.

Zach