Re: [Boost-users] Interest in a Unicode library for Boost?

28 Oct 2019


      On Mon, Oct 28, 2019 at 3:35 AM David Demelier via Boost-users <
boost-users@lists.boost.org> wrote:
...
Le 26/10/2019 à 03:11, Zach Laine via Boost-users a écrit :
...
About 14 months ago I posted the same thing.  There was significant work
that needed to be done to Boost.Text (the proposed library), and I was a
bit burned out.
Now I've managed to make the necessary changes, and I feel the library
is ready for review, if there is interest.
This library, in part, is something I want to standardize.
It started as a better string library for namespace "std2", with minimal
Unicode support.  Though "std2" will almost certainly never happen now,
those string types are still in there, and the library has grown to also
include all the Unicode features most users will ever need.
Github: https://github.com/tzlaine/text
Online docs: https://tzlaine.github.io/text
I've read the intro on why is std::string so bad and I have to disagree
with many points.
1. The Fat Interface
In which way is std::string bloat? Of course some functions are probably
here as synonymous but to say it's bloat is kinda false. Just look at
Java's String numerous functions instead [0].
Comparing std::string to Java's string class is not doing std::string any
favors.
...
And I
2. The Missing Unicode Support
Yes, many newcomers may be surprised to see that a string "é" has a size
of 2 bytes (assuming UTF-8). But it's also the case of UTF-16 strings
which may have surrotage pairs...
UTF-8 is the way to go and effectively stored. One could argue that we
should have some utf8 iterators or things like that. But std::string is
still a good candidate for string manipulations.
I agree that UTF-8 is the way to go (and as I think you've seen, the
library reflects that).  However, UTF-8 encoding is only part of the
story.  There is also normalization.  If you use UTF-8-in-std::strings,
normalization will not be enforced.  (Neither will UTF-8 encoding, but
that's less of a problem if you always intend to produce replacement
characters for broken UTF-8.)  Most users will want a type that enforces
normalization as a class invariant.  Those that do not have the tools --
the algorithms and iterators in the Unicode layer -- to do that in a
std::string if they want.
...
3. Miscellaneous Limitations
Not thread-safe being an issue? Thanks god it is not. Imagine the
overhead of a threadsafe version of a string. The purpose of a library
is not to be threadsafe on every objects. This has to be on the user side.
I don't think all string types should be threadsafe, but having a
threadsafe option is nice.  That was not an explicit goal of adding ropes,
but it is a nice side-effect of the choice I made for how to implement the
ropes in Boost.Text.
...
That said, I really hope for a better unicode support in std:: in the
near future. Your library is well designed and API is clean, I hope it
could be added in Boost :-).
Thanks, me too. :)

Zach