My name is Rainer Deyke. My experience of the problem domain is mostly as a user of other unicode libraries, ICU in particular. Here is my formal review of Boost.Text.
- What is your evaluation of the library's: * Design
String layer: overall, a solution looking for a problem. The use of 'int' as the size_type in string/string_view is a time bomb waiting to go off. I don't /typically/ have strings longer than 2GB, but I do use some text-based file formats for which there is no /inherent/ reason why the individual files can't exceed 2GB. (And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.) I like the use of negative indices for indexing from the end in Python, but I am ambivalent about using the same feature in C++. None of the other types I regularly use in C++ work like that, and the runtime cost involved is a lot more noticeable in C++. Also, using the same type of counting-from-end indices and counting-from-beginning indices seems unsafe. A separate type for counting-from-end would be safer and faster, at the cost of being more syntactically heavy. My first impression for unencoded_rope_view was that it's useless. Then I saw the undo/redo example for segmented_vector, and I have since amended my opinion to "mostly useless". To clarify: - In the most common use for strings, where the string is created once, used, and discarded without ever being modified, unencoded_rope is worse than a contiguous string: slower lookup, slower traversal, and worse data locality. - The most common forms of "modification", in my experience, are most efficiently done by reconstructing the entire string, which again makes unencoded_rope useless. (And concatenation, I guess, for which unencoded_rope actually great.) - Even if my modifications are primarily of the type for which unencoded_rope is optimized, there would need to be a fairly high modification-to-read-operation ratio for unencoded_string to be a net win. - But the fact that copies on unencoded_ropes are cheap, and the fact that these copies remain cheap when one copy is modified, makes unencoded_rope very useful for implementing undo/redo in a text editor operating on very large text files. Unicode layer: overall, very useful. One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer. Requiring unicode text to be in Stream-Safe Format is another time bomb waiting to go off, but it's also usability issue. The library should provide an algorithm to put unicode text in Stream-Safe Format, and should automatically apply that algorithm whenever text is normalized. This would make it safe to use Boost.Text on data from an untrusted source so long as the data is normalized first, which you have to do with untrusted data anyway. Text layer: overall, I don't like it. On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them? On the other hand, there is the restriction to the base types from the string layer. Why can't 'text' be a class template taking the underlying container class as an argument? (Including std::string?) I have a lot of code, much of it in third-party libraries (such as Boost!), that uses std::string as a basic vocabulary type. Converting between Boost.Text strings and std::strings at all API boundaries is not viable. If I can't use the text layer with std::string, then I don't want it.
* Implementation
I didn't look at it in detail.
* Documentation
Overall, very clear and useful. The warning about undefined behavior if the data is not in Stream-Safe Format should be at the top of every page to which it applies, in big and bold text, on a red background, ideally blinking. Or, the library could take measures to handle it reasonably safe for handling untrusted data. .../the_unicode_layer/collation.html: Unicode Technical Note #5 describes an /extension/ to the unicode collation algorithm which /also/ allows it to handle FCC /in addition to/ NFD. If your collation algorithm doesn't handle NFD correctly, then there is something seriously wrong with it, and I except that it will also fail on the corner cases of FCC. Describing NFD as "memory-hogging" is FUD, as the actual difference in memory usage between NFD and FCC is trivial for most real-world data. (I suppose Korean Hangul would be a major exception to this.) .../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings. In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
* Tests
I found it hard to get a good overview of the tests, seeing as a lot of them are dynamically generated test files. One thing I noticed is that there are extensive normalization tests for the four basic unicode normalization forms, but none for FCC. In the text_.cpp test file, there are some raw unicode code points without either the actual character or its name, making it hard to see what's going on. In the text_.cpp test file, in the normalization tests, there are some edge cases that do not seem to be tested: - Erasing a combining character such that another combining character merges into the previous base character. (Example: removing the cedilla in a A+cedilla+ring_above sequence, yielding the precomposed A_with_ring_above character.) - Inserting a combining character such that this causes a precomposed character to decompose. (Example: inserting a cedilla after A_with_ring_above, yielding a A+cedilla+ring_above.) - Inserting a combining character that requires existing combining characters to be reordered. "empty" is misspelled as "emtpy" in text_.cpp.
* Usefulness - Did you attempt to use the library?
I was not able to get the library to build, so I was not able to test it. But it does look like it should be a good ICU replacement for my purposes, assuming it doesn't have any serious bugs. My vote: ACCEPT the unicode layer, but only the unicode layer. ABSTAIN for the other layers, with the caveat that if they are accepted, the issues I have raised should be addressed. -- Rainer Deyke (rainerd@eldwood.com)