[review] [text] Text review begins June 11
The Boost formal review of Zach Laine's Unicode library, Text, commences on June 11th and concludes June 20th. We welcome the participation of Boost developers and users, and the C++ community. Quote: This library includes three layers: - The string layer, a set of types that constitute "a better std::string" - The Unicode layer, consisting of the Unicode algorithms and data - The text layer, a set of types like the string layer types, but providing transparent Unicode support Code: https://github.com/tzlaine/text Docs: https://tzlaine.github.io/text Text requires C++14 or higher, and supports GCC 5+, Clang 3.6+, and MSVC 2017. Zach initially presented the library at C++Now in 2018. He also intends to standardize a library solution for Unicode based on this work. Talk: https://www.youtube.com/watch?v=944GjKxwMBo Slides: https://tinyurl.com/boost-text-slides Try the library today, and please consider submitting a review, or even contributing to the discussion next week. Glen
On Tue, Jun 2, 2020 at 7:44 PM Glen Fernandes via Boost-users < boost-users@lists.boost.org> wrote:
The Boost formal review of Zach Laine's Unicode library, Text, commences on June 11th and concludes June 20th. We welcome the participation of Boost developers and users, and the C++ community.
Here is my review of Text in the form of a list of notes/questions, in no particular order. Reading through the documentation, everything makes sense. The motivation for each type and operation is easy to understand. I welcome the use of signed int as index/size type, but I wonder if this would interact in some negative way when mixed with similar types from the standard library. I don't think so, but who knows. I don't like the multi-page format of the documentation. I'd prefer a single HTML page for easy searching. What is the purpose of the throw_on_encoding_error type here? https://github.com/tzlaine/text/blob/b8bb1a1101e7cf53b460671a3ead19516e7636f... I searched and couldn't find any place in the documentation that explains how ErrorHandler is used. Looking at the source code, it's just a function, but its interface should be documented. Generally I dislike interfaces with configurable error handling but it appears that in the case of the conversion iterators this is warranted: both replacement character and throwing are useful. I wonder though if it would be better to delete this template parameter and always output 0xfffd, and in addition provide an iterator adaptor type which can be used to wrap e.g. an output iterator, and which throws if it sees 0xfffd being output. Both .c_str() and .data() should be added to string for compatibility. Also, it is not true that s.begin() is equivalent to s.c_str(), because in debug iterators should not be pointers. Perhaps &*s.begin() is equivalent, but this should not be valid on an empty string (assert), null termination notwithstanding. The note about passing ropes by value, I think, is redundant. Logically, copying a rope gets you an independent copy, the fact that internally the copies share state is an implementation detail. Therefore, the documentation can just specify that a rope is as thread-safe as an int. Speaking of thread-safety, where are the thread-safety guarantees of rope documented? Where are the relevant tests? Speaking of tests, I suggest separating the generated test sources in a subdirectory so things like test/string.cpp are easy to find. Not to mention github.com truncates the test directory to 1000 files. The grammar seems wrong here; 'I don't know if you've ever written an undo system that can do, undo, or redo any command you can think of, in less than 40 lines of code, but there one is." "Every text editor you've ever used that handles large files gracefully has done something to break the file into chunks" What about the buffer gap algorithm? I think Borland editors used this way back in the day. :) In conclusion: I spent a couple of hours studying the documentation and parts of the source code. I vote to ACCEPT Text into Boost.
On 3/06/2020 18:22, Emil Dotchevski wrote:
Therefore, the documentation can just specify that a rope is as thread-safe as an int.
I strongly dislike the phrase "as thread-safe as an int" because it's often applied to structures containing more than one pointer-size member (eg. shared_ptr), and there are architectures (notably x86) where an int is considerably more thread-safe than this could be, because they have a (architecture, not language) guarantee of coherence lacking in larger values (or in multi-member structs regardless of size). (Granted, portable code cannot assume pointer-size values are coherent, but often a given architecture can be assumed.) I don't recall where intrusive_ptr (the core of rope) falls on this spectrum; it is using an atomic refcount but there's still the separate operations of copying the pointer and updating the refcount. I think it's safe if done in the correct order, however; but not otherwise.
The grammar seems wrong here; 'I don't know if you've ever written an undo system that can do, undo, or redo any command you can think of, in less than 40 lines of code, but there one is."
It sounds correct to me; however it probably would be slightly better to put "in less than 40 lines of code" in parentheses or dashes rather than comma clauses.
On Wed, Jun 3, 2020 at 1:22 AM Emil Dotchevski via Boost-users
On Tue, Jun 2, 2020 at 7:44 PM Glen Fernandes via Boost-users
wrote: The Boost formal review of Zach Laine's Unicode library, Text, commences on June 11th and concludes June 20th. We welcome the participation of Boost developers and users, and the C++ community.
Here is my review of Text in the form of a list of notes/questions, in no particular order.
Thanks for the review, Emil.
Reading through the documentation, everything makes sense. The motivation for each type and operation is easy to understand.
I welcome the use of signed int as index/size type, but I wonder if this would interact in some negative way when mixed with similar types from the standard library. I don't think so, but who knows.
In some cases, that is sure to be the case. text::string needs to exist only so that I can experiment with integrating something like text::text into it, to show how we might add to std::string if std::text ever becomes a thing. So, I don't expect the signed/unsigned interop issue to come up much, if at all -- I don't expect people to use text::string much directly.
I don't like the multi-page format of the documentation. I'd prefer a single HTML page for easy searching.
There's a lot of it! In previous Boost reviews, I've gotten the opposite of this feedback, FWIW.
What is the purpose of the throw_on_encoding_error type here? https://github.com/tzlaine/text/blob/b8bb1a1101e7cf53b460671a3ead19516e7636f...
I searched and couldn't find any place in the documentation that explains how ErrorHandler is used. Looking at the source code, it's just a function, but its interface should be documented.
Right! That's an oversight. I'll add documentation on its use and interface.
Generally I dislike interfaces with configurable error handling but it appears that in the case of the conversion iterators this is warranted: both replacement character and throwing are useful. I wonder though if it would be better to delete this template parameter and always output 0xfffd, and in addition provide an iterator adaptor type which can be used to wrap e.g. an output iterator, and which throws if it sees 0xfffd being output.
This is possible, and if others agree I can certainly make this change.
Both .c_str() and .data() should be added to string for compatibility. Also, it is not true that s.begin() is equivalent to s.c_str(), because in debug iterators should not be pointers. Perhaps &*s.begin() is equivalent, but this should not be valid on an empty string (assert), null termination notwithstanding.
Well, I don't have any non-pointer debug iterators to consider, and an empty text::string is null-terminated. .c_str() .data() and .begin() truly are synonyms in this case. I see no need to have more than one.
The note about passing ropes by value, I think, is redundant. Logically, copying a rope gets you an independent copy, the fact that internally the copies share state is an implementation detail. Therefore, the documentation can just specify that a rope is as thread-safe as an int.
I don't know about that. That is, I don't know if it's actually as threadsafe as an int. I'm simply trying to remind folks that if you pass a rope via const& you're not going to get the thread-safety you want. I've seen a lot of shared_ptrs passed that way, and I'm trying to short-circuit the internal decision making that might lead to the same choice when passing a rope.
Speaking of thread-safety, where are the thread-safety guarantees of rope documented? Where are the relevant tests?
The documentation of threadsafety is here: https://tzlaine.github.io/text/doc/html/boost_text__proposed_/the_string_lay... You have to read down a bit. The tests are in rope_threadsafety.cpp, which I have run with and without TSan.
Speaking of tests, I suggest separating the generated test sources in a subdirectory so things like test/string.cpp are easy to find. Not to mention github.com truncates the test directory to 1000 files.
This seems like a good idea -- it would certainly make things easier for me too! I'll do this.
The grammar seems wrong here; 'I don't know if you've ever written an undo system that can do, undo, or redo any command you can think of, in less than 40 lines of code, but there one is."
It parses for me, and apparently Gavin, but I agree it is a bit awkward. I'll reword it.
"Every text editor you've ever used that handles large files gracefully has done something to break the file into chunks" What about the buffer gap algorithm? I think Borland editors used this way back in the day. :)
I have no idea. I *think* this confirms what I was trying to say, though -- that your editor is dealing with sections of your large file (and not the whole thing) for efficiency. The rope approach I use in Boost.Text is just one possible implementation of this high-level approach. Zach
participants (4)
-
Emil Dotchevski
-
Gavin Lambert
-
Glen Fernandes
-
Zach Laine