[boost] Boost.Text review

18 Jun 2020

      This is my review of Zach Laine's proposed Boost.Text.

Zach is offering several things here: Unicode, a "better string",
a trie, and more. I'm pleased that these components are all
individually exposed to users and are independent of each other,
rather than some being hidden as implementation details.

I've spent a couple of days looking at a few parts of the library.
I'll discuss at the Unicode stuff first.

Boost gained support for Unicode functionality when Boost.Locale
was accepted almost a decade ago, but I was not particularly
impressed by that library and have not used it in my own work.
Does the current proposal do better? I think it does, and
though it's not perfect I think it would be worth accepting but
subject to a concern about copyright.

See another thread for the copyright discussion. Basically,
if I understand correctly, Zach's code includes tables that are
derived from Unicode data and the licence for that data requires
a copyright attribution. The Boost licence does not require
attribution and I support that. So if I've understood the
situation correctly - and I am not a lawyer! - this means that
sadly Boost must REJECT the affected code.

I'll say some more about the other aspects of the Unicode code,
just in case the copyright issue turns out to be surmountable.

I've tried just a couple of things. Firstly, I tried the UTF-8
to UTF-32 conversion. This seems to work, and the API is
reasonable. Performance is not bad, but it could be better
(see another thread); that could be fixed later. I think it
would be worth having just a UTF library in Boost!

Secondly I looked at the line-splitting algorithm, i.e.

template<typename CPIter, typename Sentinel, typename Extent,
         typename CPExtentFunc>
  unspecified lines(CPIter first, Sentinel last, Extent max_extent,
                    CPExtentFunc cp_extent, bool break_overlong_lines = true);

Again this seems to work, and the API is reasonable. Reference
documentation is somewhat lacking though; in the description of
lines(), the requirements for CPExtentFunc need to be described,
and the "unspecified" return type needs to document that it includes
members including iter and hard_line_break. I think there are
probably similar omissions in much of the reference documentation.

I'm unconvinced by the use of the word "sentinel" here. A sentinel
is a value with a special meaning, e.g. a terminating null; what
you have here is an end-iterator. It's good that you're allowing
the begin and end iterators to have different types, but I think
you should name that type something like CPEndIter.

The implementation of line splitting seems to call its word
length callback more often than necessary, i.e. when a word
overflows a line, that word is subsequently measured a second
time. That should be easily fixable.

Moving on to other parts of the proposal, I'm less happy with
text::string. The first thing I noticed was the very strange
decision to use int, rather than (signed?) size_t, for indexing;
yes I do work with chunks of text larger than 2 GB! I would
REJECT text::string for this alone.

I also dislike the use of operator() to extract substrings; the
meaning just isn't obvious:

auto x = y(4);

Is that the 4th character of y? Or the substring from 4 to the
end? No, it's the substring from the start to 3. I'd need to
add a comment to explain what it's doing. But actually there's
another problem with that line: x is a non-owning view of that
substring. I think that methods that return a view (or span,
etc.) should have a name that makes that clear, e.g. "subview".
(This wouldn't have been a problem before auto.)  This came up in
the static_string review, where the substr() method was originally
proposed to return a view; that was renamed subview() with
substr() returning a copied string.

I don't mind that the string isn't templated on the char type,
but I do miss the allocator support; I've used strings with arena
allocators (for performance) and with shared memory allocators.
I know allocator support is a lot of work but it is necessary, IMO.

text::text seems to be the obvious combination of text::string
with unicode functionality, and I've not looked at that.

I have had a look at the trie; I have sometimes wondered to what
extent a trie can be implemented in C++ that works as much as
possible like a standard container so this is interesting.

The docs should mention briefly why trie is here, i.e. where it
is used, just in anticipation of that obvious question from users.

trie::operator[] returns an "optional ref". But it's not like
a std::optional; as well as operator* it has conversion operators
to access the value. Zach notes that this is "wonky" when the
value type is bool, but it also seems to break for me when it's
int:

boost::trie::trie_map<std::string, std::string> counts;
auto x = counts["Hello"];
if (x) std::cout << "present\n"; else std::cout << "absent\n";
// Prints absent, OK.

boost::trie::trie_map<std::string, int> int_counts;  // Now int.
auto y = int_counts["Hello"];
if (y) std::cout << "present\n"; else std::cout << "absent\n";

// ../text/include/boost/text/trie.hpp:78:
// boost::trie::optional_ref<T, Const>::operator T&() &
// [with T = int; bool Const = false]: Assertion `t_' failed.

operator T&() is being called by if(y), rather than
operator bool(). There aren't has_value() and value() methods
that I can use to work around this.

Like a few other things in this proposal, Zach has implemented
trie::operator[] how he wishes the standard library worked.
As another example, dereferencing a trie_map iterator doesn't
return a pair<KEY,VALUE> as a normal associative container
would, but rather a struct with members called key and value.

At worst, this will hit problems that the standard solutions
avoid (as above). At best, it will make the library more
difficult to learn for users who are familiar with the standard
library (including its problems), and make generic code less
likely to work with it.

As a test I wrote a simple program to count word frequencies:

using counts_t = std::map<std::string,int>;
counts_t counts;

while (std::cin.good()) {
  std::string s;
  std::cin >> s;
  counts[s]++;
}

for (auto i: counts) {
  std::cout << i.first << " = " << i.second << "\n";
}

It would be great if I could just change it to:

using counts_t = boost::trie_map<std::string,int>;

but I can't do that; I need to change things:

struct count { int value = 0; };  // Wrap the int.
boost::trie::trie_map<std::string, count> counts;

while (std::cin.good()) {
  std::string s;
  std::cin >> s;

  // operator[] doesn't insert, so:
  auto r = counts[s];
  if (r) r->value++;
  else counts.insert(s,count{1});
}

for (auto i: counts) {
  std::cout << i.key << " = " << i.value.value << "\n";
}

That works; how about performance? Trie should perform
well when the stored strings frequently share common prefixes,
so I've tested with a large web server log file; this has
timestamps and URLs which share common prefixes. The results:

std::map: run time 1.2 seconds, memory 1007 pages
trie_map: run time 1.8 seconds, memory 4868 pages

So that's rather disappointing.

Some general comments:

Some people care about whether libraries are header-only or not,
and the docs should mention this, and possibly say which
functionality can be expected to work header-only.

I got quite a few missing field initializer and unused parameter
warnings; please fix these if at all possible. (gcc 6.3, -W -Wall).
I think you should re-order the docs so that the "Hello World"
example is much earlier.

Some functions clearly need to embed large Unicode tables; it
would be useful if the docs mentioned roughly how much the
executable size is increased by.

Some source files are executable.

Conclusions:

I fear that the only part of this that Boost should accept is
the UTF transcoding. The implementation is not perfect but we
could work on that. The other Unicode stuff would be good to
have but the copyright issue is a blocker.

Regards, Phil.

[boost] Boost.Text review

Phil Endecott