On Fri, Jun 12, 2020 at 12:36 PM Rainer Deyke via Boost
My name is Rainer Deyke. My experience of the problem domain is mostly as a user of other unicode libraries, ICU in particular. Here is my formal review of Boost.Text.
- What is your evaluation of the library's: * Design
String layer: overall, a solution looking for a problem.
The problem this library was designed to solve was the problem that I hate std::string. :) It solves this problem for me quite nicely by being a minimal alternative to same. That being said, as I mentioned earlier in this thread, the string layer (except for unencoded_rope) can all just go away if that is reviewers' preference.
The use of 'int' as the size_type in string/string_view is a time bomb waiting to go off. I don't /typically/ have strings longer than 2GB, but I do use some text-based file formats for which there is no /inherent/ reason why the individual files can't exceed 2GB.
There is also no inherent reason why max_size() is less than SIZE_MAX, and yet it is on every implementation popular today. To be clear, my point is that typical use is going to be well within either of these for mutable string data. Anything atypical can use something else.
(And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though. That is a simple workaround for this corner case you have mentioned, but it is truly a corner case compared to the much more common case of using strings for holding contiguous sequences of char. Contiguous sequences of char really should not be anywhere near 2GB, for efficiency reasons.
I like the use of negative indices for indexing from the end in Python, but I am ambivalent about using the same feature in C++. None of the other types I regularly use in C++ work like that, and the runtime cost involved is a lot more noticeable in C++.
I don't really get what you mean about the runtime cost. Could you be more explicit?
Also, using the same type of counting-from-end indices and counting-from-beginning indices seems unsafe. A separate type for counting-from-end would be safer and faster, at the cost of being more syntactically heavy.
Interesting. Do you mean something like a strong typedef for "index" and "negative-index", or something else? Some example code might be instructive.
My first impression for unencoded_rope_view was that it's useless. Then I saw the undo/redo example for segmented_vector, and I have since amended my opinion to "mostly useless". To clarify: - In the most common use for strings, where the string is created once, used, and discarded without ever being modified, unencoded_rope is worse than a contiguous string: slower lookup, slower traversal, and worse data locality.
Correct. That's not its use case.
- The most common forms of "modification", in my experience, are most efficiently done by reconstructing the entire string, which again makes unencoded_rope useless. (And concatenation, I guess, for which unencoded_rope actually great.)
Well, that really depends on the length of the string, the number and kind of future edits, etc.
- Even if my modifications are primarily of the type for which unencoded_rope is optimized, there would need to be a fairly high modification-to-read-operation ratio for unencoded_string to be a net win.
Agreed. If you use it in cases for which it is designed, it works better than if you don't. I'm not trying to be snarky; my point is that it is a niche type -- it is useful anywhere you would like COW semantics, or anywhere you would like to edit very long strings -- and you should not use it outside of its niche. How long "very long" is is best revealed by measuring performance. For most repeated editing cases, I expect it to be far shorter than 2GB -- the limit of text::string.
- But the fact that copies on unencoded_ropes are cheap, and the fact that these copies remain cheap when one copy is modified, makes unencoded_rope very useful for implementing undo/redo in a text editor operating on very large text files.
Exactly. That is only one of its two primary use cases, though, as I indicated above. It also has an even more niche case, in that it is easily used in a threadsafe manner. My point is that it has its uses, even if you don't typically need them yourself.
Unicode layer: overall, very useful.
One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer.
Hm. I had not considered making these available as algorithms, and I generally like the approach. But could you be more specific? In particular, do you mean that insert() would take a container C and do C.insert(), then renormalize? This is the approach used by C++ 20's erase() and erase_if() free functions. Or did you mean something else?
Requiring unicode text to be in Stream-Safe Format is another time bomb waiting to go off, but it's also usability issue. The library should provide an algorithm to put unicode text in Stream-Safe Format, and should automatically apply that algorithm whenever text is normalized. This would make it safe to use Boost.Text on data from an untrusted source so long as the data is normalized first, which you have to do with untrusted data anyway.
This seems like a good idea to me. I went back and forth over whether or not to supply the SSF algorithm, since it's not an official Unicode algorithm, but adding it to text's normalization step would be reason enough to do so.
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them?
Here is the philosophy: If you have a template parameter for UTF-encoding and/or normalization form, you have an interoperability problem in your code. Multiple text<...>'s may exist for which there is no convenient or efficient interop story. If you instead convert all of your text to one UTF+normalization that you use throughout your code, you can do all your work in that one scheme and transcode and/or renormalize at the program input/output boundaries. Practically, the world seems to have standardized on UTF-8, NFC (or is doing so rapidly). I'm not thrilled that I have to use FCC instead of NFC, but it's close. The reason for using FCC is that it removes the need to normalize to NFD before doing collation. Collation is expensive enough without adding memory allocation and NFD normalization to it. I'm experimenting with adding extra tailorings to Boost.Text's collation tables to make collation compatible with NFC. I'm not yet sure if that can be made to work.
On the other hand, there is the restriction to the base types from the string layer. Why can't 'text' be a class template taking the underlying container class as an argument? (Including std::string?)
Because, again, I want there to be trivial interop. Having texttext::string and textstd::string serves what purpose exactly? That is, I have never seen a compelling use case for needing both at the same time. I'm open to persuasion, of course.
I have a lot of code, much of it in third-party libraries (such as Boost!), that uses std::string as a basic vocabulary type. Converting between Boost.Text strings and std::strings at all API boundaries is not viable. If I can't use the text layer with std::string, then I don't want it.
I understand. See my previous comments about my lack of preciousness about text::string. :)
* Implementation
I didn't look at it in detail.
* Documentation
Overall, very clear and useful.
The warning about undefined behavior if the data is not in Stream-Safe Format should be at the top of every page to which it applies, in big and bold text, on a red background, ideally blinking. Or, the library could take measures to handle it reasonably safe for handling untrusted data.
The SSF assumption is explicitly allowed in the Unicode standard, and it's less onerous than not checking array-bounds access in operator[] in one's array-like types. Buffer overflows are really common, and SSF violations are not. That being said, I can add the SSF-conformance algorithm as mentioned above.
.../the_unicode_layer/collation.html: Unicode Technical Note #5 describes an /extension/ to the unicode collation algorithm which /also/ allows it to handle FCC /in addition to/ NFD. If your collation algorithm doesn't handle NFD correctly, then there is something seriously wrong with it, and I except that it will also fail on the corner cases of FCC.
It does not fail for FCC, but does fail for NFC. I honestly have not tried it with NFD yet. I will do so.
Describing NFD as "memory-hogging" is FUD, as the actual difference in memory usage between NFD and FCC is trivial for most real-world data. (I suppose Korean Hangul would be a major exception to this.)
Ok. I can certainly remove that from the docs.
.../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings.
In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm. That note was added because the specific case mentioned fails to work for NFD, but works for FCC and NFC. I think the grapheme matching analysis above addresses something different from what the note is talking about -- the note is concerned with code-point-level searching results producing (perhaps) surprising results. The grapheme-based search does not, so which CPs are matched by a particular grapheme does not seem to be relevant. Perhaps I'm missing something?
* Tests
I found it hard to get a good overview of the tests, seeing as a lot of them are dynamically generated test files.
Yes, and it's difficult for me, too. There are a lot of files in there. As Emil suggested on another thread, I'm going to partition the tests into generated and non-. Unfortunately, that will happen after the review.
One thing I noticed is that there are extensive normalization tests for the four basic unicode normalization forms, but none for FCC.
Yeah, that's because they don't exist, as far as I know. There are no such tests in ICU or on the Unicode website that I can find. That being said, since everything related to collation and the Text layer uses FCC, and it all works, I think that sufficient coverage exists. If it makes you feel any more sure, the algorithm I use for normalization is just an adaptation of ICU's code, and the NFC and FCC code paths differ in only a couple of places, controlled by a bool passed to the normalization algorithm -- it's 99.9% the same code.
In the text_.cpp test file, there are some raw unicode code points without either the actual character or its name, making it hard to see what's going on.
That was to make it clear to me which code point is being used. :) To each his own, I guess.
In the text_.cpp test file, in the normalization tests, there are some edge cases that do not seem to be tested: - Erasing a combining character such that another combining character merges into the previous base character. (Example: removing the cedilla in a A+cedilla+ring_above sequence, yielding the precomposed A_with_ring_above character.) - Inserting a combining character such that this causes a precomposed character to decompose. (Example: inserting a cedilla after A_with_ring_above, yielding a A+cedilla+ring_above.) - Inserting a combining character that requires existing combining characters to be reordered.
Thanks! I'll add those.
"empty" is misspelled as "emtpy" in text_.cpp.
Ah, thanks.
* Usefulness - Did you attempt to use the library?
I was not able to get the library to build, so I was not able to test it. But it does look like it should be a good ICU replacement for my purposes, assuming it doesn't have any serious bugs.
Huh. What was the build problem? Was it simply that it takes forever to build all the tests? Zach