[review] [text] Text formal review
Dear Boost community, The formal review of Zach Laine's Unicode library, Text, begins today (June 11th) and ends on June 20th. The library provides three layers: - The string layer, a set of types that constitute "a better std::string" - The Unicode layer, consisting of the Unicode algorithms and data - The text layer, a set of types like the string layer types, but providing transparent Unicode support Code: https://github.com/tzlaine/text Docs: https://tzlaine.github.io/text The master branch of the repository will be frozen for the duration of the review. Text requires C++14 or higher, and supports GCC 5+, Clang 3.6+, and MSVC 2017. Zach initially presented the library at C++Now in 2018. He also intends to standardize a library solution for Unicode based on this work. Talk: https://www.youtube.com/watch?v=944GjKxwMBo Slides: https://tinyurl.com/boost-text-slides We encourage your participation in this review. At a minimum, kindly state: - Whether you believe the library should be accepted into Boost * Any conditions for acceptance - Your name and knowledge of the problem domain. You are strongly encouraged to also provide additional information: - What is your evaluation of the library's: * Design * Implementation * Documentation * Tests * Usefulness - Did you attempt to use the library? If so: * Which compiler(s) * What was the experience? Any problems? - How much effort did you put into your evaluation of the review? Even if you are not submitting a formal review, your feedback on the library is welcome. Glen
On 11/06/2020 19:37, Glen Fernandes via Boost wrote:
The library provides three layers: - The string layer, a set of types that constitute "a better std::string" - The Unicode layer, consisting of the Unicode algorithms and data - The text layer, a set of types like the string layer types, but providing transparent Unicode support
Firstly, I'd like to say that proposing a new string implementation is
probably one of the most masochistic things that you can do in C++. Even
more than proposing a result
On Fri, Jun 12, 2020 at 6:40 AM Niall Douglas via Boost
Zach, could you take this opportunity to compare your choice of string design with the string designs implemented by each of the following libraries please?
- LLVM strings, string refs, twines etc. - Abseil's strings, string pieces. - Folly's strings, string pieces and ranges. - CopperSpice's CsString.
There's also boost::json::string, which resembles std::string except that: * It is not a class template. * Most of the function definitions are in a library TU rather than a header. * It uses boost::json::storage_ptr instead of Allocator or std::pmr::memory_resource* * Some function signatures have been collapsed or adjusted for simplicity, in ways that would be desirable for std::string except that it would break ABI. json::string: https://github.com/CPPAlliance/json/blob/f47fbeb1a54b832221a35c17912febb7a2c... json::storage_ptr http://vinniefalco.github.io/doc/json/json/usage/allocators.html http://vinniefalco.github.io/doc/json/json/ref/boost__json__storage_ptr.html Regards
On Fri, Jun 12, 2020 at 8:40 AM Niall Douglas via Boost
On 11/06/2020 19:37, Glen Fernandes via Boost wrote:
The library provides three layers: - The string layer, a set of types that constitute "a better std::string" - The Unicode layer, consisting of the Unicode algorithms and data - The text layer, a set of types like the string layer types, but providing transparent Unicode support
Firstly, I'd like to say that proposing a new string implementation is probably one of the most masochistic things that you can do in C++. Even more than proposing a result
type. So, I take a bow to you Mr. Laine, and I salute your bravery. I'll put aside the Unicode and Text layers for now, and just consider the String layer. I have to admit, I'm not keen on the string layer. Firstly I hate the naming. Everything there ought to get more descriptive naming. But more importantly, how the design of the string implementation has been broken up and designed, there's just something about it which doesn't sit right with me. You seem to me to have prioritised certain use cases over others which would not be my own choices i.e. I don't think the balance of tradeoffs is right in there. For example, I wouldn't have chosen an atomically reference counted rope design the way you have at all: I'd have gone for a fusion of your static string builder with constexpr-possible (i.e. non-atomic) reference counted fragments, using expression templates to lazily canonicalise the string depending on sink (e.g. if the sink is cout, use gather i/o sequence instead of creating a new string). That sort of thing.
Zach, could you take this opportunity to compare your choice of string design with the string designs implemented by each of the following libraries please?
- LLVM strings, string refs, twines etc.
- Abseil's strings, string pieces.
- Folly's strings, string pieces and ranges.
- CopperSpice's CsString.
I feel like I am forgetting at least another two. But, point is, I'd like to know why you chose a different design to each of the above, in those situations where you did so.
No, because I don't honestly care about the string layer that much any more. It was originally a major reason -- the reason, really -- for the library at the outset. Now it's mostly cruft. If people object to it enough (and it seems they will), I can certainly remove it entirely, except for unencoded_rope, which is needed in rope. Replacing boost::text::string with std::string within boost::text::text is straightforward, and will have no visible effect on uses of text::text, except for extract() and replace(). The only reason I left the string bits of the library in place when I changed the focus to be Unicode-centric is that is was less work to do so.
I'll nail my own colours to the mast on this topic: I've thought about this long and hard over many many years, and I've personally arrived on the opinion that C needs to gain an integral string object built into the language, which builds on top of an integral variably sized array object (NOT current C VLAs). Said same built-in string object would also be available to C++, by definition.
I have arrived at this opinion because I don't think that ANY library solution can have the right balance of tradeoffs between all the competing factors. I think that only a built-in object to the language itself can deliver the perfect string object, because only the compiler can deliver a balance of optimisability with developer convenience.
I won't go into any more detail, as this is a review of the Text C++ library. And I know I've already discussed my opinion on SG16 where you Zach were present, so you've heard all my thoughts on this already. However, if you were feeling keen, I'd like to know if you could think of any areas where language changes would aid implementing better strings in C++?
I think the big thing for me would be to have language-level support for discriminating between char * strings and string literals. String literals are special in certain ways that are useful to take advantage of: 1) they are not necessary to copy, since they're in ROM; 2) they are encoded by the compiler into the execution encoding used in phase 5 of translation. This second one is pretty important to detect in some cases, like making a printf-like upgrade to std::format() "just work". Zach
On Fri, Jun 12, 2020 at 6:40 AM Niall Douglas via Boost < boost@lists.boost.org> wrote:
Said same built-in string object would also be available to C++, by definition.
C++ already has such an object, it's called std::string, the implementation can do what it wants with it. To use it in C, all you need to do is compile your C program with a C++ compiler.
My name is Rainer Deyke. My experience of the problem domain is mostly as a user of other unicode libraries, ICU in particular. Here is my formal review of Boost.Text.
- What is your evaluation of the library's: * Design
String layer: overall, a solution looking for a problem. The use of 'int' as the size_type in string/string_view is a time bomb waiting to go off. I don't /typically/ have strings longer than 2GB, but I do use some text-based file formats for which there is no /inherent/ reason why the individual files can't exceed 2GB. (And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.) I like the use of negative indices for indexing from the end in Python, but I am ambivalent about using the same feature in C++. None of the other types I regularly use in C++ work like that, and the runtime cost involved is a lot more noticeable in C++. Also, using the same type of counting-from-end indices and counting-from-beginning indices seems unsafe. A separate type for counting-from-end would be safer and faster, at the cost of being more syntactically heavy. My first impression for unencoded_rope_view was that it's useless. Then I saw the undo/redo example for segmented_vector, and I have since amended my opinion to "mostly useless". To clarify: - In the most common use for strings, where the string is created once, used, and discarded without ever being modified, unencoded_rope is worse than a contiguous string: slower lookup, slower traversal, and worse data locality. - The most common forms of "modification", in my experience, are most efficiently done by reconstructing the entire string, which again makes unencoded_rope useless. (And concatenation, I guess, for which unencoded_rope actually great.) - Even if my modifications are primarily of the type for which unencoded_rope is optimized, there would need to be a fairly high modification-to-read-operation ratio for unencoded_string to be a net win. - But the fact that copies on unencoded_ropes are cheap, and the fact that these copies remain cheap when one copy is modified, makes unencoded_rope very useful for implementing undo/redo in a text editor operating on very large text files. Unicode layer: overall, very useful. One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer. Requiring unicode text to be in Stream-Safe Format is another time bomb waiting to go off, but it's also usability issue. The library should provide an algorithm to put unicode text in Stream-Safe Format, and should automatically apply that algorithm whenever text is normalized. This would make it safe to use Boost.Text on data from an untrusted source so long as the data is normalized first, which you have to do with untrusted data anyway. Text layer: overall, I don't like it. On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them? On the other hand, there is the restriction to the base types from the string layer. Why can't 'text' be a class template taking the underlying container class as an argument? (Including std::string?) I have a lot of code, much of it in third-party libraries (such as Boost!), that uses std::string as a basic vocabulary type. Converting between Boost.Text strings and std::strings at all API boundaries is not viable. If I can't use the text layer with std::string, then I don't want it.
* Implementation
I didn't look at it in detail.
* Documentation
Overall, very clear and useful. The warning about undefined behavior if the data is not in Stream-Safe Format should be at the top of every page to which it applies, in big and bold text, on a red background, ideally blinking. Or, the library could take measures to handle it reasonably safe for handling untrusted data. .../the_unicode_layer/collation.html: Unicode Technical Note #5 describes an /extension/ to the unicode collation algorithm which /also/ allows it to handle FCC /in addition to/ NFD. If your collation algorithm doesn't handle NFD correctly, then there is something seriously wrong with it, and I except that it will also fail on the corner cases of FCC. Describing NFD as "memory-hogging" is FUD, as the actual difference in memory usage between NFD and FCC is trivial for most real-world data. (I suppose Korean Hangul would be a major exception to this.) .../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings. In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
* Tests
I found it hard to get a good overview of the tests, seeing as a lot of them are dynamically generated test files. One thing I noticed is that there are extensive normalization tests for the four basic unicode normalization forms, but none for FCC. In the text_.cpp test file, there are some raw unicode code points without either the actual character or its name, making it hard to see what's going on. In the text_.cpp test file, in the normalization tests, there are some edge cases that do not seem to be tested: - Erasing a combining character such that another combining character merges into the previous base character. (Example: removing the cedilla in a A+cedilla+ring_above sequence, yielding the precomposed A_with_ring_above character.) - Inserting a combining character such that this causes a precomposed character to decompose. (Example: inserting a cedilla after A_with_ring_above, yielding a A+cedilla+ring_above.) - Inserting a combining character that requires existing combining characters to be reordered. "empty" is misspelled as "emtpy" in text_.cpp.
* Usefulness - Did you attempt to use the library?
I was not able to get the library to build, so I was not able to test it. But it does look like it should be a good ICU replacement for my purposes, assuming it doesn't have any serious bugs. My vote: ACCEPT the unicode layer, but only the unicode layer. ABSTAIN for the other layers, with the caveat that if they are accepted, the issues I have raised should be addressed. -- Rainer Deyke (rainerd@eldwood.com)
On Fri, Jun 12, 2020 at 12:36 PM Rainer Deyke via Boost
My name is Rainer Deyke. My experience of the problem domain is mostly as a user of other unicode libraries, ICU in particular. Here is my formal review of Boost.Text.
- What is your evaluation of the library's: * Design
String layer: overall, a solution looking for a problem.
The problem this library was designed to solve was the problem that I hate std::string. :) It solves this problem for me quite nicely by being a minimal alternative to same. That being said, as I mentioned earlier in this thread, the string layer (except for unencoded_rope) can all just go away if that is reviewers' preference.
The use of 'int' as the size_type in string/string_view is a time bomb waiting to go off. I don't /typically/ have strings longer than 2GB, but I do use some text-based file formats for which there is no /inherent/ reason why the individual files can't exceed 2GB.
There is also no inherent reason why max_size() is less than SIZE_MAX, and yet it is on every implementation popular today. To be clear, my point is that typical use is going to be well within either of these for mutable string data. Anything atypical can use something else.
(And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though. That is a simple workaround for this corner case you have mentioned, but it is truly a corner case compared to the much more common case of using strings for holding contiguous sequences of char. Contiguous sequences of char really should not be anywhere near 2GB, for efficiency reasons.
I like the use of negative indices for indexing from the end in Python, but I am ambivalent about using the same feature in C++. None of the other types I regularly use in C++ work like that, and the runtime cost involved is a lot more noticeable in C++.
I don't really get what you mean about the runtime cost. Could you be more explicit?
Also, using the same type of counting-from-end indices and counting-from-beginning indices seems unsafe. A separate type for counting-from-end would be safer and faster, at the cost of being more syntactically heavy.
Interesting. Do you mean something like a strong typedef for "index" and "negative-index", or something else? Some example code might be instructive.
My first impression for unencoded_rope_view was that it's useless. Then I saw the undo/redo example for segmented_vector, and I have since amended my opinion to "mostly useless". To clarify: - In the most common use for strings, where the string is created once, used, and discarded without ever being modified, unencoded_rope is worse than a contiguous string: slower lookup, slower traversal, and worse data locality.
Correct. That's not its use case.
- The most common forms of "modification", in my experience, are most efficiently done by reconstructing the entire string, which again makes unencoded_rope useless. (And concatenation, I guess, for which unencoded_rope actually great.)
Well, that really depends on the length of the string, the number and kind of future edits, etc.
- Even if my modifications are primarily of the type for which unencoded_rope is optimized, there would need to be a fairly high modification-to-read-operation ratio for unencoded_string to be a net win.
Agreed. If you use it in cases for which it is designed, it works better than if you don't. I'm not trying to be snarky; my point is that it is a niche type -- it is useful anywhere you would like COW semantics, or anywhere you would like to edit very long strings -- and you should not use it outside of its niche. How long "very long" is is best revealed by measuring performance. For most repeated editing cases, I expect it to be far shorter than 2GB -- the limit of text::string.
- But the fact that copies on unencoded_ropes are cheap, and the fact that these copies remain cheap when one copy is modified, makes unencoded_rope very useful for implementing undo/redo in a text editor operating on very large text files.
Exactly. That is only one of its two primary use cases, though, as I indicated above. It also has an even more niche case, in that it is easily used in a threadsafe manner. My point is that it has its uses, even if you don't typically need them yourself.
Unicode layer: overall, very useful.
One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer.
Hm. I had not considered making these available as algorithms, and I generally like the approach. But could you be more specific? In particular, do you mean that insert() would take a container C and do C.insert(), then renormalize? This is the approach used by C++ 20's erase() and erase_if() free functions. Or did you mean something else?
Requiring unicode text to be in Stream-Safe Format is another time bomb waiting to go off, but it's also usability issue. The library should provide an algorithm to put unicode text in Stream-Safe Format, and should automatically apply that algorithm whenever text is normalized. This would make it safe to use Boost.Text on data from an untrusted source so long as the data is normalized first, which you have to do with untrusted data anyway.
This seems like a good idea to me. I went back and forth over whether or not to supply the SSF algorithm, since it's not an official Unicode algorithm, but adding it to text's normalization step would be reason enough to do so.
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them?
Here is the philosophy: If you have a template parameter for UTF-encoding and/or normalization form, you have an interoperability problem in your code. Multiple text<...>'s may exist for which there is no convenient or efficient interop story. If you instead convert all of your text to one UTF+normalization that you use throughout your code, you can do all your work in that one scheme and transcode and/or renormalize at the program input/output boundaries. Practically, the world seems to have standardized on UTF-8, NFC (or is doing so rapidly). I'm not thrilled that I have to use FCC instead of NFC, but it's close. The reason for using FCC is that it removes the need to normalize to NFD before doing collation. Collation is expensive enough without adding memory allocation and NFD normalization to it. I'm experimenting with adding extra tailorings to Boost.Text's collation tables to make collation compatible with NFC. I'm not yet sure if that can be made to work.
On the other hand, there is the restriction to the base types from the string layer. Why can't 'text' be a class template taking the underlying container class as an argument? (Including std::string?)
Because, again, I want there to be trivial interop. Having texttext::string and textstd::string serves what purpose exactly? That is, I have never seen a compelling use case for needing both at the same time. I'm open to persuasion, of course.
I have a lot of code, much of it in third-party libraries (such as Boost!), that uses std::string as a basic vocabulary type. Converting between Boost.Text strings and std::strings at all API boundaries is not viable. If I can't use the text layer with std::string, then I don't want it.
I understand. See my previous comments about my lack of preciousness about text::string. :)
* Implementation
I didn't look at it in detail.
* Documentation
Overall, very clear and useful.
The warning about undefined behavior if the data is not in Stream-Safe Format should be at the top of every page to which it applies, in big and bold text, on a red background, ideally blinking. Or, the library could take measures to handle it reasonably safe for handling untrusted data.
The SSF assumption is explicitly allowed in the Unicode standard, and it's less onerous than not checking array-bounds access in operator[] in one's array-like types. Buffer overflows are really common, and SSF violations are not. That being said, I can add the SSF-conformance algorithm as mentioned above.
.../the_unicode_layer/collation.html: Unicode Technical Note #5 describes an /extension/ to the unicode collation algorithm which /also/ allows it to handle FCC /in addition to/ NFD. If your collation algorithm doesn't handle NFD correctly, then there is something seriously wrong with it, and I except that it will also fail on the corner cases of FCC.
It does not fail for FCC, but does fail for NFC. I honestly have not tried it with NFD yet. I will do so.
Describing NFD as "memory-hogging" is FUD, as the actual difference in memory usage between NFD and FCC is trivial for most real-world data. (I suppose Korean Hangul would be a major exception to this.)
Ok. I can certainly remove that from the docs.
.../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings.
In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm. That note was added because the specific case mentioned fails to work for NFD, but works for FCC and NFC. I think the grapheme matching analysis above addresses something different from what the note is talking about -- the note is concerned with code-point-level searching results producing (perhaps) surprising results. The grapheme-based search does not, so which CPs are matched by a particular grapheme does not seem to be relevant. Perhaps I'm missing something?
* Tests
I found it hard to get a good overview of the tests, seeing as a lot of them are dynamically generated test files.
Yes, and it's difficult for me, too. There are a lot of files in there. As Emil suggested on another thread, I'm going to partition the tests into generated and non-. Unfortunately, that will happen after the review.
One thing I noticed is that there are extensive normalization tests for the four basic unicode normalization forms, but none for FCC.
Yeah, that's because they don't exist, as far as I know. There are no such tests in ICU or on the Unicode website that I can find. That being said, since everything related to collation and the Text layer uses FCC, and it all works, I think that sufficient coverage exists. If it makes you feel any more sure, the algorithm I use for normalization is just an adaptation of ICU's code, and the NFC and FCC code paths differ in only a couple of places, controlled by a bool passed to the normalization algorithm -- it's 99.9% the same code.
In the text_.cpp test file, there are some raw unicode code points without either the actual character or its name, making it hard to see what's going on.
That was to make it clear to me which code point is being used. :) To each his own, I guess.
In the text_.cpp test file, in the normalization tests, there are some edge cases that do not seem to be tested: - Erasing a combining character such that another combining character merges into the previous base character. (Example: removing the cedilla in a A+cedilla+ring_above sequence, yielding the precomposed A_with_ring_above character.) - Inserting a combining character such that this causes a precomposed character to decompose. (Example: inserting a cedilla after A_with_ring_above, yielding a A+cedilla+ring_above.) - Inserting a combining character that requires existing combining characters to be reordered.
Thanks! I'll add those.
"empty" is misspelled as "emtpy" in text_.cpp.
Ah, thanks.
* Usefulness - Did you attempt to use the library?
I was not able to get the library to build, so I was not able to test it. But it does look like it should be a good ICU replacement for my purposes, assuming it doesn't have any serious bugs.
Huh. What was the build problem? Was it simply that it takes forever to build all the tests? Zach
On 12.06.20 21:56, Zach Laine via Boost wrote:
(And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though. That is a simple workaround for this corner case you have mentioned, but it is truly a corner case compared to the much more common case of using strings for holding contiguous sequences of char. Contiguous sequences of char really should not be anywhere near 2GB, for efficiency reasons.
A memmapped string_view /is/ a contiguous sequence of char. I don't see the difference.
I like the use of negative indices for indexing from the end in Python, but I am ambivalent about using the same feature in C++. None of the other types I regularly use in C++ work like that, and the runtime cost involved is a lot more noticeable in C++.
I don't really get what you mean about the runtime cost. Could you be more explicit?
Somewhere in the implementation of operator[] and operator(), there has to be a branch on index < 0 (or >= 0) in order for that negative index trick to work, which the compiler can't always optimize away. Branches are often affordable but they're not free.
Also, using the same type of counting-from-end indices and counting-from-beginning indices seems unsafe. A separate type for counting-from-end would be safer and faster, at the cost of being more syntactically heavy.
Interesting. Do you mean something like a strong typedef for "index" and "negative-index", or something else?
Yes, something like that.
One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer.
Hm. I had not considered making these available as algorithms, and I generally like the approach. But could you be more specific? In particular, do you mean that insert() would take a container C and do C.insert(), then renormalize? This is the approach used by C++ 20's erase() and erase_if() free functions. Or did you mean something else?
I hadn't thought through the interface in detail. I just saw that this was a feature of the text layer, and thought it would be nice to have in the unicode layer, because I don't want to use the text layer (in its current form).
Requiring unicode text to be in Stream-Safe Format is another time bomb waiting to go off, but it's also usability issue. The library should provide an algorithm to put unicode text in Stream-Safe Format, and should automatically apply that algorithm whenever text is normalized. This would make it safe to use Boost.Text on data from an untrusted source so long as the data is normalized first, which you have to do with untrusted data anyway.
This seems like a good idea to me. I went back and forth over whether or not to supply the SSF algorithm, since it's not an official Unicode algorithm, but adding it to text's normalization step would be reason enough to do so.
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them?
Here is the philosophy: If you have a template parameter for UTF-encoding and/or normalization form, you have an interoperability problem in your code. Multiple text<...>'s may exist for which there is no convenient or efficient interop story. If you instead convert all of your text to one UTF+normalization that you use throughout your code, you can do all your work in that one scheme and transcode and/or renormalize at the program input/output boundaries.
Having to renormalize at API boundaries can be prohibitively expensive.
Because, again, I want there to be trivial interop. Having texttext::string and textstd::string serves what purpose exactly? That is, I have never seen a compelling use case for needing both at the same time. I'm open to persuasion, of course.
The advantage of textstd::string is API interop with functions that
accept std::string arguments. I'm not sure what the advantage of
textboost::text::string is. But if we accept that boost::text::rope
(which is would just be textboost::text::unencoded_rope in my scheme)
is useful, then it logically follows that
text
The SSF assumption is explicitly allowed in the Unicode standard, and it's less onerous than not checking array-bounds access in operator[] in one's array-like types. Buffer overflows are really common, and SSF violations are not. That being said, I can add the SSF-conformance algorithm as mentioned above.
Unintential SSF violations are rare. Intentional SSF violations can be used as an attack vector, if "undefined behavior" translates to "memory error".
.../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings.
In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm. That note was added because the specific case mentioned fails to work for NFD, but works for FCC and NFC. I think the grapheme matching analysis above addresses something different from what the note is talking about -- the note is concerned with code-point-level searching results producing (perhaps) surprising results. The grapheme-based search does not, so which CPs are matched by a particular grapheme does not seem to be relevant. Perhaps I'm missing something?
I am talking about code point level matches here. ("Partial grapheme match" means "matches some code points within a grapheme, not not the whole grapheme".)
I was not able to get the library to build, so I was not able to test it. But it does look like it should be a good ICU replacement for my purposes, assuming it doesn't have any serious bugs.
Huh. What was the build problem? Was it simply that it takes forever to build all the tests?
The configuration step failed because it tries to compile and run test programs in order to gather information about my environment, and I was running in a cross-compile context which prevents CMake from running the programs that it compiles. Probably not too hard to work around on my part by simply not using a cross-compile context. -- Rainer Deyke (rainerd@eldwood.com)
On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
On 12.06.20 21:56, Zach Laine via Boost wrote:
(And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though. That is a simple workaround for this corner case you have mentioned, but it is truly a corner case compared to the much more common case of using strings for holding contiguous sequences of char. Contiguous sequences of char really should not be anywhere near 2GB, for efficiency reasons.
A memmapped string_view /is/ a contiguous sequence of char. I don't see the difference.
The difference is mutability. There's no perf concern with erasing the first element of a string_view, if that's not even a supported operation.
I don't really get what you mean about the runtime cost. Could you be more explicit?
Somewhere in the implementation of operator[] and operator(), there has to be a branch on index < 0 (or >= 0) in order for that negative index trick to work, which the compiler can't always optimize away. Branches are often affordable but they're not free.
Ah, I see, thanks. Would it make you feel better if negative indexing were only used when getting substrings?
One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer.
Hm. I had not considered making these available as algorithms, and I generally like the approach. But could you be more specific? In particular, do you mean that insert() would take a container C and do C.insert(), then renormalize? This is the approach used by C++ 20's erase() and erase_if() free functions. Or did you mean something else?
I hadn't thought through the interface in detail. I just saw that this was a feature of the text layer, and thought it would be nice to have in the unicode layer, because I don't want to use the text layer (in its current form).
I don't need a detailed interface. Pseudocode would be fine too.
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them?
Here is the philosophy: If you have a template parameter for UTF-encoding and/or normalization form, you have an interoperability problem in your code. Multiple text<...>'s may exist for which there is no convenient or efficient interop story. If you instead convert all of your text to one UTF+normalization that you use throughout your code, you can do all your work in that one scheme and transcode and/or renormalize at the program input/output boundaries.
Having to renormalize at API boundaries can be prohibitively expensive.
Sure. Anything can be prohibitively expensive in some context. If that's the case in a particular program, I think it is likely to be unacceptable to use text::operator+(string_view) as well, since that also does on-the-fly normalization. Someone, somewhere, has to pay that cost if you want to use two chunks of text in encoding/normalization A and B. You might be able to keep working in A for some text and keep working in B separately for other text, but I think code that works like that is going to be hard to reason about, and will be as common as code that freely mixes wstring and string (and I mean not only at program boundaries). That is, not very common. However, that is a minority of cases. The majority case is that texts have to be able to interop within your program arbitrarily, and so you need to pay the conversion cost somewhere eventually anyway. FWIW, I'm planning to write standardization papers for the Unicode layer stuff for C++23, and the text stuff in the C++26 timeframe. My hope is that we will adopt my text design here into Boost in plenty of time to see whether it is actually as workable as I claim. I'm open to the idea of being wrong about its design and changing it to a template if a nontemplate design turns out to be problematic.
Because, again, I want there to be trivial interop. Having texttext::string and textstd::string serves what purpose exactly? That is, I have never seen a compelling use case for needing both at the same time. I'm open to persuasion, of course.
The advantage of textstd::string is API interop with functions that accept std::string arguments.
Sure. That exists now, though it does require a copy. It could also be done via a move if I replace text::string with std::string within text::text, which I expect to as a result of this review.
I'm not sure what the advantage of textboost::text::string is. But if we accept that boost::text::rope (which is would just be textboost::text::unencoded_rope in my scheme)
That does not work. Strings and ropes have different APIs.
is useful, then it logically follows that text
could also be useful.
That's what I don't get. Could you explain how text<A> and text<B> are useful in a specific case? "Could also be useful" is not sufficient motivation to me. I understand the impulse, but I think that veers into over-generality in a way that I have found to be problematic over and over in my career.
.../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings.
In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm. That note was added because the specific case mentioned fails to work for NFD, but works for FCC and NFC. I think the grapheme matching analysis above addresses something different from what the note is talking about -- the note is concerned with code-point-level searching results producing (perhaps) surprising results. The grapheme-based search does not, so which CPs are matched by a particular grapheme does not seem to be relevant. Perhaps I'm missing something?
I am talking about code point level matches here. ("Partial grapheme match" means "matches some code points within a grapheme, not not the whole grapheme".)
Ah, I see. Thanks. I'll update the note. Zach
On 14.06.20 01:25, Zach Laine via Boost wrote:
On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
wrote: A memmapped string_view /is/ a contiguous sequence of char. I don't see the difference.
The difference is mutability. There's no perf concern with erasing the first element of a string_view, if that's not even a supported operation.
A /lot/ of strings, probably the vast majority, will never be mutated. And for the rest, the majority will only be mutated by appending. Erasing the first element is a nice to have but expensive and rarely used feature. If you find yourself doing that a lot, then you probably do want a rope.
Somewhere in the implementation of operator[] and operator(), there has to be a branch on index < 0 (or >= 0) in order for that negative index trick to work, which the compiler can't always optimize away. Branches are often affordable but they're not free.
Ah, I see, thanks. Would it make you feel better if negative indexing were only used when getting substrings?
That does address the performance problem, so yes.
I hadn't thought through the interface in detail. I just saw that this was a feature of the text layer, and thought it would be nice to have in the unicode layer, because I don't want to use the text layer (in its current form).
I don't need a detailed interface. Pseudocode would be fine too.
insert_nfd(string, position, thing_to_insert) // Insert 'thing_to_insert' into 'string' at 'position'. Both 'string' // and 'thing_to_insert' are required to be in NFD. The area around the // insertion is renormalized to NFD.
Having to renormalize at API boundaries can be prohibitively expensive.
Sure. Anything can be prohibitively expensive in some context. If that's the case in a particular program, I think it is likely to be unacceptable to use text::operator+(string_view) as well, since that also does on-the-fly normalization.
Hopefully only on the string_view and the area immediately surrounding the insertion.
Someone, somewhere, has to pay that cost if you want to use two chunks of text in encoding/normalization A and B. You might be able to keep working in A for some text and keep working in B separately for other text, but I think code that works like that is going to be hard to reason about, and will be as common as code that freely mixes wstring and string (and I mean not only at program boundaries). That is, not very common.
Which is why I want to avoid just that.
Your suggestions:
void f() {
// renormalizes to fcc
text::text t = api_funtion_that_returns_nfd();
do_something_with(t);
string s;
text::normalize_to_nfd(t.extract(), back_inserter(s));
api_function_that_accepts_nfd(s);
}
My suggestion:
void f() {
text::text
That's what I don't get. Could you explain how text<A> and text<B> are useful in a specific case?
text
On Sun, Jun 14, 2020 at 7:25 AM Rainer Deyke via Boost
On 14.06.20 01:25, Zach Laine via Boost wrote:
On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
wrote: A memmapped string_view /is/ a contiguous sequence of char. I don't see the difference.
The difference is mutability. There's no perf concern with erasing the first element of a string_view, if that's not even a supported operation.
A /lot/ of strings, probably the vast majority, will never be mutated.
Ok, then those should more appropriately be string_views.
And for the rest, the majority will only be mutated by appending.
That does not help, unless the capacity is so large that a reallocation is unnecessary.
Erasing the first element is a nice to have but expensive and rarely used feature. If you find yourself doing that a lot, then you probably do want a rope.
Any mutation might cause a reallocation. I named one of the worst-case operations rhetorically, but appending is also bad if it causes that reallocation. It's not a question of what kind of mutating operation you're doing, but whether you're mutating or not.
I hadn't thought through the interface in detail. I just saw that this was a feature of the text layer, and thought it would be nice to have in the unicode layer, because I don't want to use the text layer (in its current form).
I don't need a detailed interface. Pseudocode would be fine too.
insert_nfd(string, position, thing_to_insert) // Insert 'thing_to_insert' into 'string' at 'position'. Both 'string' // and 'thing_to_insert' are required to be in NFD. The area around the // insertion is renormalized to NFD.
I see -- no surprises here. As I said, I like this idea a lot! However, see below.
Having to renormalize at API boundaries can be prohibitively expensive.
Sure. Anything can be prohibitively expensive in some context. If that's the case in a particular program, I think it is likely to be unacceptable to use text::operator+(string_view) as well, since that also does on-the-fly normalization.
Hopefully only on the string_view and the area immediately surrounding the insertion.
No, that's why I picked string_view, and not text_view. text_view insertion does not normalize the incoming text, but string_view insertion does. This is in keeping with the philosophy: - At program I/O boundaries (not all API boundaries), convert to UTF-8 and FCC. - Internal interfaces that take UTF-8/FCC will not transcode or normalize. - Internal interface that take non-UTF-8/FCC will transcode and normalize as needed. text::operator+(string_view sv) does not know the normalization of sv, so it normalizes. The alternative is clunky -- you have to make a new string somewhere to normalize into, and then use operator+() on the result.
Someone, somewhere, has to pay that cost if you want to use two chunks of text in encoding/normalization A and B. You might be able to keep working in A for some text and keep working in B separately for other text, but I think code that works like that is going to be hard to reason about, and will be as common as code that freely mixes wstring and string (and I mean not only at program boundaries). That is, not very common.
Which is why I want to avoid just that.
Your suggestions:
void f() { // renormalizes to fcc text::text t = api_funtion_that_returns_nfd(); do_something_with(t); string s; text::normalize_to_nfd(t.extract(), back_inserter(s)); api_function_that_accepts_nfd(s); }
My suggestion:
void f() { text::text
t = api_function_that_returns_nfd(); do_something_with(t); api_function_that_accepts_nfd(t.extract()); }
Right, I get it. I just think you're leaving out the lack of
interoperability with text::text
On Fri, Jun 12, 2020 at 10:36 AM Rainer Deyke via Boost < boost@lists.boost.org> wrote:
String layer: overall, a solution looking for a problem.
I tend to agree with this but I had assumed there is a reason why it exists. Zach indicated that he doesn't care much about it, so it probably should be deleted, but I don't feel strongly about this because there are plenty of C++ APIs which define their own string type, and I don't see this as a problem. In C++, we have many of everything.
On Fri, Jun 12, 2020 at 10:36 AM Rainer Deyke via Boost
String layer: overall, a solution looking for a problem.
I created boost::json::string to use instead of std::string for a couple of reasons: 1. It uses a type-erased, reference counted memory resource instead of Allocator template parameter 2. The class is laid out a certain way to greatly minimize its size 3. The class layout is optimized to keep the size of the enclosing variant (the JSON DOM value type) small 4. The API is the same for all C++ versions (unlike std::string) 5. It is not a class template I would like to know what are the exact _technical_ benefits of Boost.Text's String layer, beyond "because I don't like std::string" Thanks
sob., 13 cze 2020 o 18:42 Vinnie Falco via Boost
On Fri, Jun 12, 2020 at 10:36 AM Rainer Deyke via Boost
wrote: String layer: overall, a solution looking for a problem.
I created boost::json::string to use instead of std::string for a couple of reasons:
1. It uses a type-erased, reference counted memory resource instead of Allocator template parameter 2. The class is laid out a certain way to greatly minimize its size 3. The class layout is optimized to keep the size of the enclosing variant (the JSON DOM value type) small 4. The API is the same for all C++ versions (unlike std::string) 5. It is not a class template
I would like to know what are the exact _technical_ benefits of Boost.Text's String layer, beyond "because I don't like std::string"
At least some of the bullets above indicate that you need a string type tailored to the performance characteristics of your library (which is fine). The details of performance/interface requirements of Boost.Text is a part of a more general problem: people will use different string-like types for various reasons; can they use their string in Boost.JSON? It should be possible to parametrize Boost.JSON with the string type. This does not have to necessarily compromise the goal of Boost.JSON of avoiding templates. You can provide a templated library, provide your string type, provide an explicit instantiation of your library template for your string type. This way users can include cheap headers and link with the precompiled sources. But other users can choose to use your library as a template. Template-ness would still be visible at the ABI level, but maybe that is not the problem? Regards, &rzej;
On Mon, 15 Jun 2020 at 06:19, Andrzej Krzemienski via Boost-users < boost-users@lists.boost.org> wrote:
At least some of the bullets above indicate that you need a string type tailored to the performance characteristics of your library (which is fine).
Is fine, internally. The use of a custom string in 'the app' is not gonna improve speed as much as a better algorithm.
On Mon, Jun 15, 2020 at 4:18 AM Andrzej Krzemienski
can they use their string in Boost.JSON?
No. The design of the library explicitly avoids allowing customizability.
It should be possible to parametrize Boost.JSON with the string type.
This has been explored in other libraries, and I don't like the
result. Here is nlohmann JSON's value type:
template<
template
On Mon, Jun 15, 2020 at 4:18 AM Andrzej Krzemienski
It should be possible to parametrize Boost.JSON with the string type.
Another problem with doing this, is that it breaks the invariant of Boost.JSON's DOM value type: * All child elements in a DOM will inherit the allocator of the root element. If we allow user-defined strings, first they probably won't work with storage_ptr (Boost.JSON's custom allocator container interface) and second they can break the invariant. My goal is for boost::json::value to be suited as a vocabulary type. Allowing customization works against that. An example of an existing useless customization point would be char traits in basic_string and/or basic_string_view. I don't want to create more. Thanks
This is a review for the Boost.Text library, submitted a day late (but
hopefully not a dollar short! (U.S. colloquialism, don't mind me!)).
The library has 3 somewhat related but (somewhat?) separable
sub-libraries. In "building block" order, these are:
- A string layer (a new std::string)
- A unicode layer (algorithms and data)
- A text layer (string, but if it gave a single flying crap about Unicode)
There are 4 (3?) types to care about in the string layer:
unencoded_rope, string, segmented_vector (and string_builder...?).
There are not many new types in the unicode layer save for things that
help the algorithms/data do the things well and report findings from
algorithm calls.
There are 2 types to care about in the unicode layer: text, and rope.
We will start with the lowest building block layer. This will likely
not be a typical review: most others have called out in the
documentation and other places things that have failed, so I will
focus primarily on the utility and design of the layers and what they
can bring to the table.
======
Layer 0
======
[[string]]
It gets rid of char_traits (yay!) but then also throws out the
allocator (ewwww!). ... That's it.
This type does not affect my view of the library because it can be
(mostly) safely ignored. It can be nuked from orbit and nothing of
value will be lost. I would actually recommend std::string be used
underneath, because why do this to the ecosystem for the (N+1)th time?
[[unencoded_rope]][[segmented_vector]]
These two data structures are FAR more spicy and incredibly
interesting. They provide different guarantees of insertion and
erasure complexity. Of course, neither have allocators built in so I
can't really customize how this works without hijacking global new and
delete, but Zach has made clear his distaste for the allocator world
and having recently built several standard containers I don't blame
him.
Nevertheless, both of these data structures are being talked about
together because they provide the same type of functionality:
unencoded_rope is just specialized for char storage. Notably,
segmented_vector has an insert for (iterator, value_type) while
unencoded_rope seemed to be missing that and only wanted to deal with
"strings"/ranges, rather than single elements. This made me applying
my fun text-wrapper on unencoded_rope mildly annoying because
single-insert was just not present:
phd::text::basic_text
On Sun, Jun 21, 2020 at 7:16 AM JeanHeyd Meneide via Boost
...neither have allocators built in so I can't really customize how this works without hijacking global new and delete, but Zach has made clear his distaste for the allocator world ... - Allocators are shit! ... "who died and made you King of my string...memory allocation? ...
The inability to control how and where memory is allocated is a significant obstacle to using ANY container in a concurrent network server. In every case, the first step of improving the performance of such a program is to reduce the number of calls to allocate memory. Often, simply reducing memory allocations is sufficient to give the program desired performance. A typical asynchronous network program follows a predictable cycle of operations in a thread: 1. Data is received from the remote peer 2. Memory is allocated to perform the operation 3. The operation is performed 4. Allocated memory is freed 5. Control of the thread is returned to the I/O subsystem It is in steps 2 and 4 where a custom allocator is used to optimize the I/O cycle. The techniques are simple: 1. Reuse a previously allocated block of memory 2. Utilize the stack for storage (preferred when the storage requirements for the operation are bounded) Without the ability to control the allocator, these optimizations are not possible with the containers in Boost.Text. I have not looked deeply enough into Boost.Text, nor am I knowledgeable enough about Unicode to write a competent review. However, were I to write such a review I would REJECT the library for the sole reason that it does not support allocators and thus cannot be used optimally in a network program. If the author decides to add allocator support, I suggest rather than using an Allocator template parameter, to instead use `boost::container::memory_resource&` as a constructor parameter (or `std::pmr::memory_resource&` if available). This allows the container to be implemented as an ordinary class instead of a class template. Thanks
On Sun, 21 Jun 2020 at 18:32, Vinnie Falco via SG16
If the author decides to add allocator support, I suggest rather than using an Allocator template parameter, to instead use `boost::container::memory_resource&` as a constructor parameter (or `std::pmr::memory_resource&` if available). This allows the container to be implemented as an ordinary class instead of a class template.
The unfortunate bit in doing that is that a memory_resource doesn't provide hooks for deciding how it propagates, so you'll be nailing that question down to one answer. It sure looks to me that there's more than one answer to that question. If you're using an allocator, make it a template parameter. Some proponents of polymorphic allocators tell their audiences that with the advent of polymorphic allocators, allocators as template parameters become unnecessary. Based on my experience in this area, they are mistaken. Or perhaps in other words: a memory_resource is not an allocator. It's a mechanism for acquiring and releasing memory, but that's not all an allocator is, there's more to it than that.
On Sun, Jun 21, 2020 at 8:54 AM Ville Voutilainen
a memory_resource doesn't provide hooks for deciding how it propagates, so you'll be nailing that question down to one answer. It sure looks to me that there's more than one answer to that question.
A benefit of targeting Boost is that a library can experiment with making such decisions without etching it in stone (as would happen if it was a Standard C++ component). For example as I have done with Boost.JSON, which provides a novel "storage_ptr": a smart pointer container to a reference-counted memory_resource, and has some opinions on exactly how it is propagated to children: http://vinniefalco.github.io/doc/json/json/usage/allocators.html Thanks
On Sun, Jun 21, 2020 at 9:16 AM JeanHeyd Meneide via SG16
This is a review for the Boost.Text library, submitted a day late (but hopefully not a dollar short! (U.S. colloquialism, don't mind me!)).
The library has 3 somewhat related but (somewhat?) separable sub-libraries. In "building block" order, these are:
- A string layer (a new std::string) - A unicode layer (algorithms and data) - A text layer (string, but if it gave a single flying crap about Unicode)
[snip]
====== Layer 2 ======
This is the layer I am -- on a library design level and a personal philosophy level -- the most opposed to.
But my answer is still to accept it (well, modulo it being based on the above string type. Please just use std::string).
That seems to be the consensus.
[[ text ]] [[ rope ]] While these containers can be evaluated individually, other reviews have picked up a great deal of pickings at them and so I won't bother. There was some grumbling about how a rope-like data structure is not interesting enough to be included and I will just quietly wave that off as "my use case is the only use case that matters and therefore I don't care about other people's invariants or needs".
There are many implicitly (and explicitly) stated and maintained opinions in this layer:
- UTF-8 is the way, truth, and life. - Unicode is the only encoding that matters ever, for all time, in perpetuity. - Allocators are shit! - NFC is probably the best we can do here for varying reasons. - Who needs non-contiguous storage anyways? - Who needs non-SBO storage, anyways?
These are all opinions, many of which are present in the design of the text container. And they allow this text container to ship. But that lack of flexibility -- while okay for Qt or Apple's CoreText or whatever other platform-specific hoo-ha you want to get involved with -- does not help. In fact, it cuts them off: more than one person during Meeting C++ spoke to me of Boost.Text and said it could not meet their needs because it maintained encoding or normalization invariants that did not interoperate with their existing system. Storage is also an issue: while "I use boost::text::string underneath" is fine and dandy, many systems (next to none, maybe?) are going to speak in "text" or its related string type. They will want the underlying container to speak to. For duck-type purposes, it works. But for everyone else, it fails.
Since the string layer uses an `int` for its size and capacity, it is lopsidedly incompatible with existing STL's implementations of string, to the point that a reinterpret_cast -- however evil -- is not suitable for transporting a reference-without-copy into these APIs.
When text::text changes to use std::string, the size_type will naturally have to change to size_t. I intend to change the size_type of the others to be internally consistent. In for a penny, in for a pound.
God bless string_view and its friends, because it allows us to at least continue to talk to some APIs since the text type guarantees contiguous storage. This means that at the boundaries of an application -- or even as a plugin to a wider ecosystem -- I am paying a (sometimes obscene) cost to interoperate between std::string/llvm::SmallString/unicode_code_unit_sequence and all the other things people have developed to sit between them and what they believe their string needs are. And while it is whack that so many of these classes exist,
they do.
That lack of interoperability -- and once again, the lack of an allocator template parameter -- hampers this library from COMPLETELY DOMINATING the string scene. It will always be used as a solution, maybe even 80% of the time. Those seeking more will have to figure out how to build their own UTF16 containers, or their own special-encoded containers, with very little support from the text library (save for some transcoding functions they can leverage, but only from specific Unicode encodings).
I hope and expect that the idea floated earlier in the review -- of adding Unicode-invariant-maintaining free functions -- will make implementing your own types all but trivial. Zach
participants (10)
-
Andrzej Krzemienski
-
degski
-
Emil Dotchevski
-
Glen Fernandes
-
JeanHeyd Meneide
-
Niall Douglas
-
Rainer Deyke
-
Ville Voutilainen
-
Vinnie Falco
-
Zach Laine