Re: [boost] [review] [text] Text formal review

12 Jun 2020


      On Fri, Jun 12, 2020 at 12:36 PM Rainer Deyke via Boost
<boost@lists.boost.org> wrote:
...
My name is Rainer Deyke.  My experience of the problem domain is mostly
as a user of other unicode libraries, ICU in particular.  Here is my
formal review of Boost.Text.
...
- What is your evaluation of the library's:
 * Design
String layer: overall, a solution looking for a problem.
The problem this library was designed to solve was the problem that I
hate std::string. :)  It solves this problem for me quite nicely by
being a minimal alternative to same.  That being said, as I mentioned
earlier in this thread, the string layer (except for unencoded_rope)
can all just go away if that is reviewers' preference.
...
The use of 'int' as the size_type in string/string_view is a time
bomb waiting to go off.  I don't /typically/ have strings longer than
2GB, but I do use some text-based file formats for which there is no
/inherent/ reason why the individual files can't exceed 2GB.
There is also no inherent reason why max_size() is less than SIZE_MAX,
and yet it is on every implementation popular today.  To be clear, my
point is that typical use is going to be well within either of these
for mutable string data.  Anything atypical can use something else.
...
(And no,
unencoded_rope would not be a better choice.  I can't memmap ropes, but
I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though.  That is a
simple workaround for this corner case you have mentioned, but it is
truly a corner case compared to the much more common case of using
strings for holding contiguous sequences of char.  Contiguous
sequences of char really should not be anywhere near 2GB, for
efficiency reasons.
...
I like the use of negative indices for indexing from the end in
Python, but I am ambivalent about using the same feature in C++.  None
of the other types I regularly use in C++ work like that, and the
runtime cost involved is a lot more noticeable in C++.
I don't really get what you mean about the runtime cost.  Could you be
more explicit?
...
Also, using the
same type of counting-from-end indices and counting-from-beginning
indices seems unsafe.  A separate type for counting-from-end would be
safer and faster, at the cost of being more syntactically heavy.
Interesting.  Do you mean something like a strong typedef for "index"
and "negative-index", or something else?  Some example code might be
instructive.
...
My first impression for unencoded_rope_view was that it's useless.
Then I saw the undo/redo example for segmented_vector, and I have since
amended my opinion to "mostly useless".  To clarify:
   - In the most common use for strings, where the string is created
once, used, and discarded without ever being modified, unencoded_rope is
worse than a contiguous string: slower lookup, slower traversal, and
worse data locality.
Correct.  That's not its use case.
...
- The most common forms of "modification", in my experience, are
most efficiently done by reconstructing the entire string, which again
makes unencoded_rope useless.  (And concatenation, I guess, for which
unencoded_rope actually great.)
Well, that really depends on the length of the string, the number and
kind of future edits, etc.
...
- Even if my modifications are primarily of the type for which
unencoded_rope is optimized, there would need to be a fairly high
modification-to-read-operation ratio for unencoded_string to be a net win.
Agreed.  If you use it in cases for which it is designed, it works
better than if you don't.  I'm not trying to be snarky; my point is
that it is a niche type -- it is useful anywhere you would like COW
semantics, or anywhere you would like to edit very long strings -- and
you should not use it outside of its niche.  How long "very long" is
is best revealed by measuring performance.  For most repeated editing
cases, I expect it to be far shorter than 2GB -- the limit of
text::string.
...
- But the fact that copies on unencoded_ropes are cheap, and the
fact that these copies remain cheap when one copy is modified, makes
unencoded_rope very useful for implementing undo/redo in a text
editor operating on very large text files.
Exactly.  That is only one of its two primary use cases, though, as I
indicated above.  It also has an even more niche case, in that it is
easily used in a threadsafe manner.  My point is that it has its uses,
even if you don't typically need them yourself.
...
Unicode layer: overall, very useful.
One things that appears to be missing is a normalization-preserving
append/insert/erase operators.  These are available in the text layer,
but that means being tied to the specific text classes provided by that
layer.
Hm.  I had not considered making these available as algorithms, and I
generally like the approach.  But could you be more specific?  In
particular, do you mean that insert() would take a container C and do
C.insert(), then renormalize?  This is the approach used by C++ 20's
erase() and erase_if() free functions.  Or did you mean something
else?
...
Requiring unicode text to be in Stream-Safe Format is another time bomb
waiting to go off, but it's also usability issue.  The library should
provide an algorithm to put unicode text in Stream-Safe Format, and
should automatically apply that algorithm whenever text is normalized.
This would make it safe to use Boost.Text on data from an untrusted
source so long as the data is normalized first, which you have to do
with untrusted data anyway.
This seems like a good idea to me.  I went back and forth over whether
or not to supply the SSF algorithm, since it's not an official Unicode
algorithm, but adding it to text's normalization step would be reason
enough to do so.
...
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC.  Why can't
other normalization forms be supported, given that the unicode layer
supports them?
Here is the philosophy:  If you have a template parameter for
UTF-encoding and/or normalization form, you have an interoperability
problem in your code.  Multiple text<...>'s may exist for which there
is no convenient or efficient interop story.  If you instead convert
all of your text to one UTF+normalization that you use throughout your
code, you can do all your work in that one scheme and transcode and/or
renormalize at the program input/output boundaries.

Practically, the world seems to have standardized on UTF-8, NFC (or is
doing so rapidly).  I'm not thrilled that I have to use FCC instead of
NFC, but it's close.  The reason for using FCC is that it removes the
need to normalize to NFD before doing collation.  Collation is
expensive enough without adding memory allocation and NFD
normalization to it.  I'm experimenting with adding extra tailorings
to Boost.Text's collation tables to make collation compatible with
NFC.  I'm not yet sure if that can be made to work.
...
On the other hand, there is the restriction to the base types from the
string layer.  Why can't 'text' be a class template taking the
underlying container class as an argument?  (Including std::string?)
Because, again, I want there to be trivial interop.  Having
text<text::string> and text<std::string> serves what purpose exactly?
That is, I have never seen a compelling use case for needing both at
the same time.  I'm open to persuasion, of course.
...
I have a lot of code, much of it in third-party libraries (such as
Boost!), that uses std::string as a basic vocabulary type.  Converting
between Boost.Text strings and std::strings at all API boundaries is not
viable.  If I can't use the text layer with std::string, then I don't
want it.
I understand.  See my previous comments about my lack of preciousness
about text::string. :)
...
...
* Implementation
I didn't look at it in detail.
...
* Documentation
Overall, very clear and useful.
The warning about undefined behavior if the data is not in Stream-Safe
Format should be at the top of every page to which it applies, in big
and bold text, on a red background, ideally blinking.  Or, the library
could take measures to handle it reasonably safe for handling untrusted
data.
The SSF assumption is explicitly allowed in the Unicode standard, and
it's less onerous than not checking array-bounds access in operator[]
in one's array-like types.  Buffer overflows are really common, and
SSF violations are not.  That being said, I can add the
SSF-conformance algorithm as mentioned above.
...
.../the_unicode_layer/collation.html: Unicode Technical Note #5
describes an /extension/ to the unicode collation algorithm which /also/
allows it to handle FCC /in addition to/ NFD.  If your collation
algorithm doesn't handle NFD correctly, then there is something
seriously wrong with it, and I except that it will also fail on the
corner cases of FCC.
It does not fail for FCC, but does fail for NFC.  I honestly have not
tried it with NFD yet.  I will do so.
...
Describing NFD as "memory-hogging" is FUD, as the actual difference in
memory usage between NFD and FCC is trivial for most real-world data.
(I suppose Korean Hangul would be a major exception to this.)
Ok.  I can certainly remove that from the docs.
...
.../the_unicode_layer/searching.html: the note at the end of the
page is wrong, assuming you implemented the algorithms correctly.  The
concerns for searching NFD strings are similar to the concerns for
searching FCC strings.
In both FCC and NFD:
   - There is a distinction between A+grave+acute and A+acute+grave,
because they are not canonically equivalent.
   - A+grave is a partial grapheme match for A+grave+acute.
   - A+acute is not a partial grapheme match for A+grave+acute.
   - A+grave is not a partial grapheme match for A+acute+grave.
   - A+acute is a partial grapheme match for A+acute+grave.
But:
   - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm.  That note was added because the specific case mentioned fails to
work for NFD, but works for FCC and NFC.  I think the grapheme
matching analysis above addresses something different from what the
note is talking about -- the note is concerned with code-point-level
searching results producing (perhaps) surprising results.  The
grapheme-based search does not, so which CPs are matched by a
particular grapheme does not seem to be relevant.  Perhaps I'm missing
something?
...
...
* Tests
I found it hard to get a good overview of the tests, seeing as a lot of
them are dynamically generated test files.
Yes, and it's difficult for me, too.  There are a lot of files in
there.  As Emil suggested on another thread, I'm going to partition
the tests into generated and non-.  Unfortunately, that will happen
after the review.
...
One thing I noticed is that there are extensive normalization tests for
the four basic unicode normalization forms, but none for FCC.
Yeah, that's because they don't exist, as far as I know.  There are no
such tests in ICU or on the Unicode website that I can find.  That
being said, since everything related to collation and the Text layer
uses FCC, and it all works, I think that sufficient coverage exists.
If it makes you feel any more sure, the algorithm I use for
normalization is just an adaptation of ICU's code, and the NFC and FCC
code paths differ in only a couple of places, controlled by a bool
passed to the normalization algorithm -- it's 99.9% the same code.
...
In the text_.cpp test file, there are some raw unicode code points
without either the actual character or its name, making it hard to see
what's going on.
That was to make it clear to me which code point is being used. :)  To
each his own, I guess.
...
In the text_.cpp test file, in the normalization tests, there are some
edge cases that do not seem to be tested:
   - Erasing a combining character such that another combining character
merges into the previous base character.  (Example: removing the cedilla
in a A+cedilla+ring_above sequence, yielding the precomposed
A_with_ring_above character.)
   - Inserting a combining character such that this causes a precomposed
character to decompose.  (Example: inserting a cedilla after
A_with_ring_above, yielding a A+cedilla+ring_above.)
   - Inserting a combining character that requires existing combining
characters to be reordered.
Thanks!  I'll add those.
...
"empty" is misspelled as "emtpy" in text_.cpp.
Ah, thanks.
...
...
* Usefulness - Did you attempt to use the library?
I was not able to get the library to build, so I was not able to test
it.  But it does look like it should be a good ICU replacement for my
purposes, assuming it doesn't have any serious bugs.
Huh.  What was the build problem?  Was it simply that it takes forever
to build all the tests?

Zach

Re: [boost] [review] [text] Text formal review

Zach Laine