Re: [boost] [review] [text] Text formal review

13 Jun 2020

      On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
<boost@lists.boost.org> wrote:
...
On 12.06.20 21:56, Zach Laine via Boost wrote:
...
...
(And no,
unencoded_rope would not be a better choice.  I can't memmap ropes, but
I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though.  That is a
simple workaround for this corner case you have mentioned, but it is
truly a corner case compared to the much more common case of using
strings for holding contiguous sequences of char.  Contiguous
sequences of char really should not be anywhere near 2GB, for
efficiency reasons.
A memmapped string_view /is/ a contiguous sequence of char.  I don't see
the difference.
The difference is mutability.  There's no perf concern with erasing
the first element of a string_view, if that's not even a supported
operation.
...
...
I don't really get what you mean about the runtime cost.  Could you be
more explicit?
Somewhere in the implementation of operator[] and operator(), there has
to be a branch on index < 0 (or >= 0) in order for that negative index
trick to work, which the compiler can't always optimize away.  Branches
are often affordable but they're not free.
Ah, I see, thanks.  Would it make you feel better if negative indexing
 were only used when getting substrings?
...
...
...
One things that appears to be missing is a normalization-preserving
append/insert/erase operators.  These are available in the text layer,
but that means being tied to the specific text classes provided by that
layer.
Hm.  I had not considered making these available as algorithms, and I
generally like the approach.  But could you be more specific?  In
particular, do you mean that insert() would take a container C and do
C.insert(), then renormalize?  This is the approach used by C++ 20's
erase() and erase_if() free functions.  Or did you mean something
else?
I hadn't thought through the interface in detail.  I just saw that this
was a feature of the text layer, and thought it would be nice to have in
the unicode layer, because I don't want to use the text layer (in its
current form).
I don't need a detailed interface.  Pseudocode would be fine too.
...
...
...
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC.  Why can't
other normalization forms be supported, given that the unicode layer
supports them?
Here is the philosophy:  If you have a template parameter for
UTF-encoding and/or normalization form, you have an interoperability
problem in your code.  Multiple text<...>'s may exist for which there
is no convenient or efficient interop story.  If you instead convert
all of your text to one UTF+normalization that you use throughout your
code, you can do all your work in that one scheme and transcode and/or
renormalize at the program input/output boundaries.
Having to renormalize at API boundaries can be prohibitively expensive.
Sure.  Anything can be prohibitively expensive in some context.  If
that's the case in a particular program, I think it is likely to be
unacceptable to use text::operator+(string_view) as well, since that
also does on-the-fly normalization.  Someone, somewhere, has to pay
that cost if you want to use two chunks of text in
encoding/normalization A and B.  You might be able to keep working in
A for some text and keep working in B separately for other text, but I
think code that works like that is going to be hard to reason about,
and will be as common as code that freely mixes wstring and string
(and I mean not only at program boundaries).  That is, not very
common.

However, that is a minority of cases.  The majority case is that texts
have to be able to interop within your program arbitrarily, and so you
need to pay the conversion cost somewhere eventually anyway.

FWIW, I'm planning to write standardization papers for the Unicode
layer stuff for C++23, and the text stuff in the C++26 timeframe.  My
hope is that we will adopt my text design here into Boost in plenty of
time to see whether it is actually as workable as I claim.  I'm open
to the idea of being wrong about its design and changing it to a
template if a nontemplate design turns out to be problematic.
...
...
Because, again, I want there to be trivial interop.  Having
text<text::string> and text<std::string> serves what purpose exactly?
That is, I have never seen a compelling use case for needing both at
the same time.  I'm open to persuasion, of course.
The advantage of text<std::string> is API interop with functions that
accept std::string arguments.
Sure.  That exists now, though it does require a copy.  It could also
be done via a move if I replace text::string with std::string within
text::text, which I expect to as a result of this review.
...
I'm not sure what the advantage of
text<boost::text::string> is.  But if we accept that boost::text::rope
(which is would just be text<boost::text::unencoded_rope> in my scheme)
That does not work.  Strings and ropes have different APIs.
...
is useful, then it logically follows that
text<some_other_string_implementation> could also be useful.
That's what I don't get.  Could you explain how text<A> and text<B>
are useful in a specific case?  "Could also be useful" is not
sufficient motivation to me.  I understand the impulse, but I think
that veers into over-generality in a way that I have found to be
problematic over and over in my career.
...
...
...
.../the_unicode_layer/searching.html: the note at the end of the
page is wrong, assuming you implemented the algorithms correctly.  The
concerns for searching NFD strings are similar to the concerns for
searching FCC strings.
In both FCC and NFD:
    - There is a distinction between A+grave+acute and A+acute+grave,
because they are not canonically equivalent.
    - A+grave is a partial grapheme match for A+grave+acute.
    - A+acute is not a partial grapheme match for A+grave+acute.
    - A+grave is not a partial grapheme match for A+acute+grave.
    - A+acute is a partial grapheme match for A+acute+grave.
But:
    - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm.  That note was added because the specific case mentioned fails to
work for NFD, but works for FCC and NFC.  I think the grapheme
matching analysis above addresses something different from what the
note is talking about -- the note is concerned with code-point-level
searching results producing (perhaps) surprising results.  The
grapheme-based search does not, so which CPs are matched by a
particular grapheme does not seem to be relevant.  Perhaps I'm missing
something?
I am talking about code point level matches here.  ("Partial grapheme
match" means "matches some code points within a grapheme, not not the
whole grapheme".)
Ah, I see.  Thanks.  I'll update the note.

Zach