Re: [boost] [review] [text] Text formal review

12 Jun 2020

      My name is Rainer Deyke.  My experience of the problem domain is mostly 
as a user of other unicode libraries, ICU in particular.  Here is my 
formal review of Boost.Text.
...
- What is your evaluation of the library's:
 * Design
String layer: overall, a solution looking for a problem.

The use of 'int' as the size_type in string/string_view is a time
bomb waiting to go off.  I don't /typically/ have strings longer than
2GB, but I do use some text-based file formats for which there is no
/inherent/ reason why the individual files can't exceed 2GB.  (And no, 
unencoded_rope would not be a better choice.  I can't memmap ropes, but 
I can memmap string_views.)

I like the use of negative indices for indexing from the end in
Python, but I am ambivalent about using the same feature in C++.  None
of the other types I regularly use in C++ work like that, and the
runtime cost involved is a lot more noticeable in C++.  Also, using the
same type of counting-from-end indices and counting-from-beginning
indices seems unsafe.  A separate type for counting-from-end would be
safer and faster, at the cost of being more syntactically heavy.

My first impression for unencoded_rope_view was that it's useless.
Then I saw the undo/redo example for segmented_vector, and I have since
amended my opinion to "mostly useless".  To clarify:
   - In the most common use for strings, where the string is created
once, used, and discarded without ever being modified, unencoded_rope is
worse than a contiguous string: slower lookup, slower traversal, and 
worse data locality.
   - The most common forms of "modification", in my experience, are
most efficiently done by reconstructing the entire string, which again
makes unencoded_rope useless.  (And concatenation, I guess, for which 
unencoded_rope actually great.)
   - Even if my modifications are primarily of the type for which
unencoded_rope is optimized, there would need to be a fairly high
modification-to-read-operation ratio for unencoded_string to be a net win.
   - But the fact that copies on unencoded_ropes are cheap, and the
fact that these copies remain cheap when one copy is modified, makes 
unencoded_rope very useful for implementing undo/redo in a text
editor operating on very large text files.

Unicode layer: overall, very useful.

One things that appears to be missing is a normalization-preserving 
append/insert/erase operators.  These are available in the text layer, 
but that means being tied to the specific text classes provided by that 
layer.

Requiring unicode text to be in Stream-Safe Format is another time bomb 
waiting to go off, but it's also usability issue.  The library should 
provide an algorithm to put unicode text in Stream-Safe Format, and 
should automatically apply that algorithm whenever text is normalized. 
This would make it safe to use Boost.Text on data from an untrusted 
source so long as the data is normalized first, which you have to do 
with untrusted data anyway.

Text layer: overall, I don't like it.

On one hand, there is the gratuitous restriction to FCC.  Why can't 
other normalization forms be supported, given that the unicode layer 
supports them?

On the other hand, there is the restriction to the base types from the 
string layer.  Why can't 'text' be a class template taking the 
underlying container class as an argument?  (Including std::string?)

I have a lot of code, much of it in third-party libraries (such as 
Boost!), that uses std::string as a basic vocabulary type.  Converting 
between Boost.Text strings and std::strings at all API boundaries is not 
viable.  If I can't use the text layer with std::string, then I don't 
want it.
...
* Implementation
I didn't look at it in detail.
...
* Documentation
Overall, very clear and useful.

The warning about undefined behavior if the data is not in Stream-Safe 
Format should be at the top of every page to which it applies, in big 
and bold text, on a red background, ideally blinking.  Or, the library 
could take measures to handle it reasonably safe for handling untrusted 
data.

.../the_unicode_layer/collation.html: Unicode Technical Note #5
describes an /extension/ to the unicode collation algorithm which /also/
allows it to handle FCC /in addition to/ NFD.  If your collation
algorithm doesn't handle NFD correctly, then there is something
seriously wrong with it, and I except that it will also fail on the
corner cases of FCC.

Describing NFD as "memory-hogging" is FUD, as the actual difference in 
memory usage between NFD and FCC is trivial for most real-world data. 
(I suppose Korean Hangul would be a major exception to this.)

.../the_unicode_layer/searching.html: the note at the end of the
page is wrong, assuming you implemented the algorithms correctly.  The
concerns for searching NFD strings are similar to the concerns for
searching FCC strings.

In both FCC and NFD:
   - There is a distinction between A+grave+acute and A+acute+grave,
because they are not canonically equivalent.
   - A+grave is a partial grapheme match for A+grave+acute.
   - A+acute is not a partial grapheme match for A+grave+acute.
   - A+grave is not a partial grapheme match for A+acute+grave.
   - A+acute is a partial grapheme match for A+acute+grave.
But:
   - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
...
* Tests
I found it hard to get a good overview of the tests, seeing as a lot of
them are dynamically generated test files.

One thing I noticed is that there are extensive normalization tests for 
the four basic unicode normalization forms, but none for FCC.

In the text_.cpp test file, there are some raw unicode code points 
without either the actual character or its name, making it hard to see 
what's going on.

In the text_.cpp test file, in the normalization tests, there are some 
edge cases that do not seem to be tested:
   - Erasing a combining character such that another combining character 
merges into the previous base character.  (Example: removing the cedilla 
in a A+cedilla+ring_above sequence, yielding the precomposed 
A_with_ring_above character.)
   - Inserting a combining character such that this causes a precomposed 
character to decompose.  (Example: inserting a cedilla after 
A_with_ring_above, yielding a A+cedilla+ring_above.)
   - Inserting a combining character that requires existing combining 
characters to be reordered.

"empty" is misspelled as "emtpy" in text_.cpp.
...
* Usefulness - Did you attempt to use the library?
I was not able to get the library to build, so I was not able to test 
it.  But it does look like it should be a good ICU replacement for my 
purposes, assuming it doesn't have any serious bugs.

My vote: ACCEPT the unicode layer, but only the unicode layer.  ABSTAIN 
for the other layers, with the caveat that if they are accepted, the 
issues I have raised should be addressed.

-- 
Rainer Deyke (rainerd@eldwood.com)