Re: [boost] [review] [text] Text formal review

12 Jun 2020


      On 12.06.20 21:56, Zach Laine via Boost wrote:
...
...
(And no,
unencoded_rope would not be a better choice.  I can't memmap ropes, but
I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though.  That is a
simple workaround for this corner case you have mentioned, but it is
truly a corner case compared to the much more common case of using
strings for holding contiguous sequences of char.  Contiguous
sequences of char really should not be anywhere near 2GB, for
efficiency reasons.
A memmapped string_view /is/ a contiguous sequence of char.  I don't see 
the difference.
...
...
I like the use of negative indices for indexing from the end in
Python, but I am ambivalent about using the same feature in C++.  None
of the other types I regularly use in C++ work like that, and the
runtime cost involved is a lot more noticeable in C++.
I don't really get what you mean about the runtime cost.  Could you be
more explicit?
Somewhere in the implementation of operator[] and operator(), there has 
to be a branch on index < 0 (or >= 0) in order for that negative index 
trick to work, which the compiler can't always optimize away.  Branches 
are often affordable but they're not free.
...
...
Also, using the
same type of counting-from-end indices and counting-from-beginning
indices seems unsafe.  A separate type for counting-from-end would be
safer and faster, at the cost of being more syntactically heavy.
Interesting.  Do you mean something like a strong typedef for "index"
and "negative-index", or something else?
Yes, something like that.
...
...
One things that appears to be missing is a normalization-preserving
append/insert/erase operators.  These are available in the text layer,
but that means being tied to the specific text classes provided by that
layer.
Hm.  I had not considered making these available as algorithms, and I
generally like the approach.  But could you be more specific?  In
particular, do you mean that insert() would take a container C and do
C.insert(), then renormalize?  This is the approach used by C++ 20's
erase() and erase_if() free functions.  Or did you mean something
else?
I hadn't thought through the interface in detail.  I just saw that this 
was a feature of the text layer, and thought it would be nice to have in 
the unicode layer, because I don't want to use the text layer (in its 
current form).
...
...
Requiring unicode text to be in Stream-Safe Format is another time bomb
waiting to go off, but it's also usability issue.  The library should
provide an algorithm to put unicode text in Stream-Safe Format, and
should automatically apply that algorithm whenever text is normalized.
This would make it safe to use Boost.Text on data from an untrusted
source so long as the data is normalized first, which you have to do
with untrusted data anyway.
This seems like a good idea to me.  I went back and forth over whether
or not to supply the SSF algorithm, since it's not an official Unicode
algorithm, but adding it to text's normalization step would be reason
enough to do so.
...
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC.  Why can't
other normalization forms be supported, given that the unicode layer
supports them?
Here is the philosophy:  If you have a template parameter for
UTF-encoding and/or normalization form, you have an interoperability
problem in your code.  Multiple text<...>'s may exist for which there
is no convenient or efficient interop story.  If you instead convert
all of your text to one UTF+normalization that you use throughout your
code, you can do all your work in that one scheme and transcode and/or
renormalize at the program input/output boundaries.
Having to renormalize at API boundaries can be prohibitively expensive.
...
Because, again, I want there to be trivial interop.  Having
text<text::string> and text<std::string> serves what purpose exactly?
That is, I have never seen a compelling use case for needing both at
the same time.  I'm open to persuasion, of course.
The advantage of text<std::string> is API interop with functions that 
accept std::string arguments.  I'm not sure what the advantage of 
text<boost::text::string> is.  But if we accept that boost::text::rope 
(which is would just be text<boost::text::unencoded_rope> in my scheme) 
is useful, then it logically follows that 
text<some_other_string_implementation> could also be useful.
...
The SSF assumption is explicitly allowed in the Unicode standard, and
it's less onerous than not checking array-bounds access in operator[]
in one's array-like types.  Buffer overflows are really common, and
SSF violations are not.  That being said, I can add the
SSF-conformance algorithm as mentioned above.
Unintential SSF violations are rare.  Intentional SSF violations can be 
used as an attack vector, if "undefined behavior" translates to "memory 
error".
...
...
.../the_unicode_layer/searching.html: the note at the end of the
page is wrong, assuming you implemented the algorithms correctly.  The
concerns for searching NFD strings are similar to the concerns for
searching FCC strings.
In both FCC and NFD:
    - There is a distinction between A+grave+acute and A+acute+grave,
because they are not canonically equivalent.
    - A+grave is a partial grapheme match for A+grave+acute.
    - A+acute is not a partial grapheme match for A+grave+acute.
    - A+grave is not a partial grapheme match for A+acute+grave.
    - A+acute is a partial grapheme match for A+acute+grave.
But:
    - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm.  That note was added because the specific case mentioned fails to
work for NFD, but works for FCC and NFC.  I think the grapheme
matching analysis above addresses something different from what the
note is talking about -- the note is concerned with code-point-level
searching results producing (perhaps) surprising results.  The
grapheme-based search does not, so which CPs are matched by a
particular grapheme does not seem to be relevant.  Perhaps I'm missing
something?
I am talking about code point level matches here.  ("Partial grapheme 
match" means "matches some code points within a grapheme, not not the 
whole grapheme".)
...
...
I was not able to get the library to build, so I was not able to test
it.  But it does look like it should be a good ICU replacement for my
purposes, assuming it doesn't have any serious bugs.
Huh.  What was the build problem?  Was it simply that it takes forever
to build all the tests?
The configuration step failed because it tries to compile and run test 
programs in order to gather information about my environment, and I was 
running in a cross-compile context which prevents CMake from running the 
programs that it compiles.  Probably not too hard to work around on my 
part by simply not using a cross-compile context.


-- 
Rainer Deyke (rainerd@eldwood.com)