Re: [boost] [review] [text] Text formal review

12 Jun 2020


      On Fri, Jun 12, 2020 at 8:40 AM Niall Douglas via Boost
<boost@lists.boost.org> wrote:
...
On 11/06/2020 19:37, Glen Fernandes via Boost wrote:
...
The library provides three layers:
 - The string layer, a set of types that constitute "a better std::string"
 - The Unicode layer, consisting of the Unicode algorithms and data
 - The text layer, a set of types like the string layer types, but
providing transparent Unicode support
Firstly, I'd like to say that proposing a new string implementation is
probably one of the most masochistic things that you can do in C++. Even
more than proposing a result<T, E> type. So, I take a bow to you Mr.
Laine, and I salute your bravery.
I'll put aside the Unicode and Text layers for now, and just consider
the String layer. I have to admit, I'm not keen on the string layer.
Firstly I hate the naming. Everything there ought to get more
descriptive naming. But more importantly, how the design of the string
implementation has been broken up and designed, there's just something
about it which doesn't sit right with me. You seem to me to have
prioritised certain use cases over others which would not be my own
choices i.e. I don't think the balance of tradeoffs is right in there.
For example, I wouldn't have chosen an atomically reference counted rope
design the way you have at all: I'd have gone for a fusion of your
static string builder with constexpr-possible (i.e. non-atomic)
reference counted fragments, using expression templates to lazily
canonicalise the string depending on sink (e.g. if the sink is cout, use
gather i/o sequence instead of creating a new string). That sort of thing.
Zach, could you take this opportunity to compare your choice of string
design with the string designs implemented by each of the following
libraries please?
- LLVM strings, string refs, twines etc.
- Abseil's strings, string pieces.
- Folly's strings, string pieces and ranges.
- CopperSpice's CsString.
I feel like I am forgetting at least another two. But, point is, I'd
like to know why you chose a different design to each of the above, in
those situations where you did so.
No, because I don't honestly care about the string layer that much any
more.  It was originally a major reason -- the reason, really -- for
the library at the outset.  Now it's mostly cruft.  If people object
to it enough (and it seems they will), I can certainly remove it
entirely, except for unencoded_rope, which is needed in rope.
Replacing boost::text::string with std::string within
boost::text::text is straightforward, and will have no visible effect
on uses of text::text, except for extract() and replace().  The only
reason I left the string bits of the library in place when I changed
the focus to be Unicode-centric is that is was less work to do so.
...
I'll nail my own colours to the mast on this topic: I've thought about
this long and hard over many many years, and I've personally arrived on
the opinion that C needs to gain an integral string object built into
the language, which builds on top of an integral variably sized array
object (NOT current C VLAs). Said same built-in string object would also
be available to C++, by definition.
I have arrived at this opinion because I don't think that ANY library
solution can have the right balance of tradeoffs between all the
competing factors. I think that only a built-in object to the language
itself can deliver the perfect string object, because only the compiler
can deliver a balance of optimisability with developer convenience.
I won't go into any more detail, as this is a review of the Text C++
library. And I know I've already discussed my opinion on SG16 where you
Zach were present, so you've heard all my thoughts on this already.
However, if you were feeling keen, I'd like to know if you could think
of any areas where language changes would aid implementing better
strings in C++?
I think the big thing for me would be to have language-level support
for discriminating between char * strings and string literals.  String
literals are special in certain ways that are useful to take advantage
of:  1) they are not necessary to copy, since they're in ROM; 2) they
are encoded by the compiler into the execution encoding used in phase
5 of translation.  This second one is pretty important to detect in
some cases, like making a printf-like upgrade to std::format() "just
work".

Zach

Re: [boost] [review] [text] Text formal review

Zach Laine