Re: [boost] [Boost-users] Interest in a Unicode library for Boost?

1 Nov 2019

      Mathias pointed out that I sent this just to him.  So I'm replying again to
get this onto the list.  Sorry for the noise.

On Fri, Nov 1, 2019 at 11:09 AM Zach Laine <whatwasthataddress@gmail.com>
wrote:
...
On Fri, Nov 1, 2019 at 6:35 AM Mathias Gaunard <
mathias.gaunard@ens-lyon.org> wrote:
...
On Sat, 26 Oct 2019 at 02:11, Zach Laine via Boost-users
<boost-users@lists.boost.org> wrote:
...
About 14 months ago I posted the same thing.  There was significant
work that needed to be done to Boost.Text (the proposed library), and I was
a bit burned out.
...
Now I've managed to make the necessary changes, and I feel the library
is ready for review, if there is interest.
...
This library, in part, is something I want to standardize.
It started as a better string library for namespace "std2", with
minimal Unicode support.  Though "std2" will almost certainly never happen
now, those string types are still in there, and the library has grown to
also include all the Unicode features most users will ever need.
...
Github: https://github.com/tzlaine/text
Online docs: https://tzlaine.github.io/text
I would start by removing the superlative statements about Unicode
being "hard" or "crazy".
It's not that complicated compared to the actual hard problems that
software engineers solve everyday. The only thing is that people
misunderstand what the scope of Unicode is, it's not just an encoding,
it's a a database and a set of algorithms (relying on said database)
to facilitate natural text processing of arbitrary scripts, and does
compromises to integrate with existing industry practices prior to all
those scripts being brought together under the same umbrella.
Right. Unicode encodes all natural languages that anyone has taken the
time to put into Unicode.  I stand by the implication that natural
languages are crazy.
...
Now the string/container/memory management, this is quite irrelevant.
That sort of stuff has nothing to do with Unicode and I certainly do
not want some Unicode library to mess with the way I am organizing how
my data is stored in memory.
Your rope etc. containers belong in a completely independent library.
So then maybe don't use those parts?  They're independent; you don't have
to use them to use the Unicode algorithms.
...
What's important is providing an efficient Unicode character database,
and implementing the algorithms in a way that is generic, working for
arbitrary ranges and being able to be lazily evaluated (i.e. range
adaptors).
I already did all that work more than 10 years ago as a two-month GSoC
project, though there are some limitations since at that time ranges
and ranges adaptors were still fairly new ideas for C++. It does
however provide a generic framework to define arbitrary algorithms
that can be evaluated either lazily or eagerly.
Clearly you are more capable than I am.  It took me a lot longer to do
than 2 months.  Why did you never submit this for a Boost review?  You were
thinking about it, ~10 years ago, but you never did....
...
To be honest I can't say I find your library to be much of an
improvement, at least in terms of usability, since the programming
interface seems more constrained (why don't things work with arbitrary
ranges rather than this "text" containers)
They do, of course.  I'm not sure why it is you think otherwise.
...
and verbose (just look at
the code to do transcoding with iterators),
Are you referring to the verbosity of:
char const * some_utf8 = /* ... */ ;
out = std::ranges::copy(boost::text::as_utf32(some_utf8), out);
, or:
out = boost::text::transcode_utf_8_to_32(utf8_first, utf8_last, out);
, or something else?
...
the set of features is
quite small,
That is quite intentional.  I want to standardize *basic* Unicode
support.  I feel that what I have in Boost.Text is the basic set that users
will need, just to support languages or formatting conventions that are not
common in their favorite environment.  For instance, today there is no
standard way of taking UTF-8 and turning it into UTF-16, or vice versa;
this library is intended to work at that level.  That is, it is intended to
fill in needless gaps in Unicode support that exist in C++ -- gaps that no
other major language besides C has.  It is specifically not intended to
replace all ICU functionality.  Do you have specific things in mind that
you think ~90% of Unicode-aware C++ users will need?  Note that I did not
say 100%.
...
and that the database itself is not even accessible,
That's also intentional.  Another goal of the library is to make Unicode
as simple as possible for naive users who just want to do the basics.  If I
find requests for any new feature that has a compelling use case, I'll add
that.
...
and
last I remember your implementation was ridiculously bloated in size.
I don't consider 1.5MB for a database containing all human languages in
widespread use on computers to be a ridiculous size, but YMMV.
...
It also doesn't provide the ability to do fast substring search, which
you'd typically do by searching for a substring at the character
encoding level and then eliminating matches that do not fall on a
satisfying boundary, instead suggesting to do the search at the
grapheme level which is much slower, and the facility to test for
boundary isn't provided anyway.
I honestly don't know what you mean here.  If you use the text::text or
text::string types, those are just contiguous sequences of bits, like a
std::vector or std::string.  text::text exposes iterators to those bits
which can be used to get grapheme, code point, and/or UTF-8 byte views of
the underlying data.  If you are using something else besides text::text or
text::string two types, you presumably have access to your own bits in your
own representation.  What prevents you from doing whatever substring search
you like, via std::search(), std::ranges::includes(), or something else?
Boost.Text is not intended as a string algorithms library.
I'm pretty sure I made similar comments in the past, but I don't feel
...
like any of them has been addressed.
I think you're referring to  this email you sent in the Boost.Text
interest thread from 14 months ago:
"""
The Unicode library I did as a SoC project in 2009 was significantly
smaller than that and if I recall correctly it has more data than the one
in your library.
Clearly some work can be done here to better optimize the database size.
"""
I did make it a bit smaller.  The other comments are new.
Zach

Zach Laine

tags

participants (1)