On Fri, Jun 12, 2020 at 8:40 AM Niall Douglas via Boost
On 11/06/2020 19:37, Glen Fernandes via Boost wrote:
The library provides three layers: - The string layer, a set of types that constitute "a better std::string" - The Unicode layer, consisting of the Unicode algorithms and data - The text layer, a set of types like the string layer types, but providing transparent Unicode support
Firstly, I'd like to say that proposing a new string implementation is probably one of the most masochistic things that you can do in C++. Even more than proposing a result
type. So, I take a bow to you Mr. Laine, and I salute your bravery. I'll put aside the Unicode and Text layers for now, and just consider the String layer. I have to admit, I'm not keen on the string layer. Firstly I hate the naming. Everything there ought to get more descriptive naming. But more importantly, how the design of the string implementation has been broken up and designed, there's just something about it which doesn't sit right with me. You seem to me to have prioritised certain use cases over others which would not be my own choices i.e. I don't think the balance of tradeoffs is right in there. For example, I wouldn't have chosen an atomically reference counted rope design the way you have at all: I'd have gone for a fusion of your static string builder with constexpr-possible (i.e. non-atomic) reference counted fragments, using expression templates to lazily canonicalise the string depending on sink (e.g. if the sink is cout, use gather i/o sequence instead of creating a new string). That sort of thing.
Zach, could you take this opportunity to compare your choice of string design with the string designs implemented by each of the following libraries please?
- LLVM strings, string refs, twines etc.
- Abseil's strings, string pieces.
- Folly's strings, string pieces and ranges.
- CopperSpice's CsString.
I feel like I am forgetting at least another two. But, point is, I'd like to know why you chose a different design to each of the above, in those situations where you did so.
No, because I don't honestly care about the string layer that much any more. It was originally a major reason -- the reason, really -- for the library at the outset. Now it's mostly cruft. If people object to it enough (and it seems they will), I can certainly remove it entirely, except for unencoded_rope, which is needed in rope. Replacing boost::text::string with std::string within boost::text::text is straightforward, and will have no visible effect on uses of text::text, except for extract() and replace(). The only reason I left the string bits of the library in place when I changed the focus to be Unicode-centric is that is was less work to do so.
I'll nail my own colours to the mast on this topic: I've thought about this long and hard over many many years, and I've personally arrived on the opinion that C needs to gain an integral string object built into the language, which builds on top of an integral variably sized array object (NOT current C VLAs). Said same built-in string object would also be available to C++, by definition.
I have arrived at this opinion because I don't think that ANY library solution can have the right balance of tradeoffs between all the competing factors. I think that only a built-in object to the language itself can deliver the perfect string object, because only the compiler can deliver a balance of optimisability with developer convenience.
I won't go into any more detail, as this is a review of the Text C++ library. And I know I've already discussed my opinion on SG16 where you Zach were present, so you've heard all my thoughts on this already. However, if you were feeling keen, I'd like to know if you could think of any areas where language changes would aid implementing better strings in C++?
I think the big thing for me would be to have language-level support for discriminating between char * strings and string literals. String literals are special in certain ways that are useful to take advantage of: 1) they are not necessary to copy, since they're in ROM; 2) they are encoded by the compiler into the execution encoding used in phase 5 of translation. This second one is pretty important to detect in some cases, like making a printf-like upgrade to std::format() "just work". Zach