On Sat, 26 Oct 2019 at 02:11, Zach Laine via Boost-users
About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.
Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.
This library, in part, is something I want to standardize.
It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.
Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text
I would start by removing the superlative statements about Unicode being "hard" or "crazy". It's not that complicated compared to the actual hard problems that software engineers solve everyday. The only thing is that people misunderstand what the scope of Unicode is, it's not just an encoding, it's a a database and a set of algorithms (relying on said database) to facilitate natural text processing of arbitrary scripts, and does compromises to integrate with existing industry practices prior to all those scripts being brought together under the same umbrella. Now the string/container/memory management, this is quite irrelevant. That sort of stuff has nothing to do with Unicode and I certainly do not want some Unicode library to mess with the way I am organizing how my data is stored in memory. Your rope etc. containers belong in a completely independent library. What's important is providing an efficient Unicode character database, and implementing the algorithms in a way that is generic, working for arbitrary ranges and being able to be lazily evaluated (i.e. range adaptors). I already did all that work more than 10 years ago as a two-month GSoC project, though there are some limitations since at that time ranges and ranges adaptors were still fairly new ideas for C++. It does however provide a generic framework to define arbitrary algorithms that can be evaluated either lazily or eagerly. To be honest I can't say I find your library to be much of an improvement, at least in terms of usability, since the programming interface seems more constrained (why don't things work with arbitrary ranges rather than this "text" containers) and verbose (just look at the code to do transcoding with iterators), the set of features is quite small, and that the database itself is not even accessible, and last I remember your implementation was ridiculously bloated in size. It also doesn't provide the ability to do fast substring search, which you'd typically do by searching for a substring at the character encoding level and then eliminating matches that do not fall on a satisfying boundary, instead suggesting to do the search at the grapheme level which is much slower, and the facility to test for boundary isn't provided anyway. I'm pretty sure I made similar comments in the past, but I don't feel like any of them has been addressed.