Interest in Unicode library for Boost?

Zach Laine

23 Sep 2018 23 Sep '18

4:45 a.m.

I've been working on a Unicode library for submission to Boost, with an eye toward standardizing robust Unicode support for C++. It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" may never happen, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need. You can find the Github page here: https://github.com/tzlaine/text You can find the online docs here: https://tzlaine.github.io/text If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback. I gave a talk about this at C++Now in May, though it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake: https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8 Zach

Show replies by date

Andrey Semashev

23 Sep 23 Sep

9:57 a.m.

On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

...

I've been working on a Unicode library for submission to Boost, with an eye toward standardizing robust Unicode support for C++.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" may never happen, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

You can find the Github page here: https://github.com/tzlaine/text

You can find the online docs here: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May, though it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

I think a Unicode library is very much needed in Boost. Out of curiosity, it looks like you implemented Unicode algorithms yourself. Why not use a specialized library, like ICU?

Vinnie Falco

3:17 p.m.

On Sun, Sep 23, 2018 at 2:57 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...

Why not use a specialized library, like ICU?

The moment I see that a potential library or application uses ICU I give it a hard pass. Regards

Zach Laine

3:37 p.m.

On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...

On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

I think a Unicode library is very much needed in Boost.

Out of curiosity, it looks like you implemented Unicode algorithms yourself. Why not use a specialized library, like ICU?

It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler. I built HEAD of ICU just now, and here are the resulting .so's: -rwxrwxr-x 1 tzlaine tzlaine 26M Sep 23 10:29 ./lib/libicudata.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 3.6M Sep 23 10:28 ./lib/libicui18n.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 65K Sep 23 10:28 ./lib/libicuio.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 66K Sep 23 10:28 ./lib/libiculx.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 234K Sep 23 10:28 ./lib/libicutu.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 2.2M Sep 23 10:28 ./lib/libicuuc.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 5.3K Sep 23 10:28 ./stubdata/libicudata.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 83K Sep 23 10:28 ./tools/ctestfw/libicutest.so.62.1 So, I don't know how many of those you need, but if you require data (and you do!), 26MB is a lot. Note that I put collation data into headers, so your runtime memory footprint might be much larger than 1.2-2MB, but the minimum requirement is still only that small. Requiring the user to pay more than this minimum is a classic "Don't pay for what you don't use" violation. Another thing is that ICU allocates memory all over the place, in some cases needlessly. ICU also has IMO a poor (too complicated and confusing) API; there are way too many types and functions, and the types that are emphasized are often the wrong ones, like UTF-16 strings. The algorithms should be C++-style algorithms if this is something we're going to standardize. Zach

Stefan Seefeld

3:55 p.m.

On 9/23/18 11:37 AM, Zach Laine via Boost wrote:

...

On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...
On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

I think a Unicode library is very much needed in Boost.

Out of curiosity, it looks like you implemented Unicode algorithms yourself. Why not use a specialized library, like ICU? It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler.

Ideally, a "Unicode library for Boost" would offer an API, and the question of what backend is used would be an implementation detail. While I'm very enthusiastic about proper Unicode support being added to C++, I have a hard time with the tendency in the Boost community to reinvent wheels, i.e. the NIH syndrome. A good API / library design should allow me to plug in existing implementations (for standard functionality that's already implemented many times before), as a matter of code reuse and maintainability. Best, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Daniela Engert

4:46 p.m.

Am 23.09.2018 um 17:55 schrieb Stefan Seefeld via Boost:

...

On 9/23/18 11:37 AM, Zach Laine via Boost wrote:

...
On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...
On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

I think a Unicode library is very much needed in Boost. Why not use a specialized library, like ICU? It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler.

Ideally, a "Unicode library for Boost" would offer an API, and the question of what backend is used would be an implementation detail.

Right. For example, Windows 10 comes with ICU built in, ready for consumption: https://docs.microsoft.com/en-us/windows/desktop/intl/international-componen... Ciao Dani

Zach Laine

6:42 p.m.

On Sun, Sep 23, 2018 at 10:55 AM Stefan Seefeld via Boost < boost@lists.boost.org> wrote:

...

On 9/23/18 11:37 AM, Zach Laine via Boost wrote:

...
On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...
On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

I think a Unicode library is very much needed in Boost.

Out of curiosity, it looks like you implemented Unicode algorithms yourself. Why not use a specialized library, like ICU? It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler.

Ideally, a "Unicode library for Boost" would offer an API, and the question of what backend is used would be an implementation detail. While I'm very enthusiastic about proper Unicode support being added to C++, I have a hard time with the tendency in the Boost community to reinvent wheels, i.e. the NIH syndrome. A good API / library design should allow me to plug in existing implementations (for standard functionality that's already implemented many times before), as a matter of code reuse and maintainability.

I agree with this in the abstract. In this case, I don't know of any back end that would work except for ICU. As for having been implemented many times, I'm not aware of any other implementations of all the named Unicode algorithms besides ICU. My hope is that my implementation is more palatable to most users than the ICU one. It certainly is for me. Zach

Andrey Semashev

5:57 p.m.

On 9/23/18 6:37 PM, Zach Laine via Boost wrote:

...

On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...
On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

I think a Unicode library is very much needed in Boost.

Out of curiosity, it looks like you implemented Unicode algorithms yourself. Why not use a specialized library, like ICU?

It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler.

I built HEAD of ICU just now, and here are the resulting .so's:

-rwxrwxr-x 1 tzlaine tzlaine 26M Sep 23 10:29 ./lib/libicudata.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 3.6M Sep 23 10:28 ./lib/libicui18n.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 65K Sep 23 10:28 ./lib/libicuio.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 66K Sep 23 10:28 ./lib/libiculx.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 234K Sep 23 10:28 ./lib/libicutu.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 2.2M Sep 23 10:28 ./lib/libicuuc.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 5.3K Sep 23 10:28 ./stubdata/libicudata.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 83K Sep 23 10:28 ./tools/ctestfw/libicutest.so.62.1

So, I don't know how many of those you need, but if you require data (and you do!), 26MB is a lot. Note that I put collation data into headers, so your runtime memory footprint might be much larger than 1.2-2MB, but the minimum requirement is still only that small. Requiring the user to pay more than this minimum is a classic "Don't pay for what you don't use" violation.

Runtime memory footprint is actually more important. If I have 10 processes running on the machine that use ICU then I'm only paying its price once while in your case I would be paying it 10 times. Given that ICU is rather well adopted, this is not an unrealistic benefit. So, if not using ICU you may want to consider if at least some of the runtime data can be put in constant sections of a shared library.

...

ICU also has IMO a poor (too complicated and confusing) API; there are way too many types and functions, and the types that are emphasized are often the wrong ones, like UTF-16 strings. The algorithms should be C++-style algorithms if this is something we're going to standardize.

Its API could be wrapped inside your library so that users never have to interface with it directly. Nevertheless, thanks for the answer, and I still think a Unicode library like yours is very much needed.

Mathias Gaunard

9:02 p.m.

On 23 September 2018 at 16:37, Zach Laine via Boost <boost@lists.boost.org> wrote:

...

It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler.

The Unicode library I did as a SoC project in 2009 was significantly smaller than that and if I recall correctly it has more data than the one in your library. Clearly some work can be done here to better optimize the database size.

Peter Dimov

3:40 p.m.

Zach Laine wrote:

...

You can find the online docs here: https://tzlaine.github.io/text

I find the "string" layer a hard sell. First, realistically, nobody is going to use it over std::string, especially when its selling point is "we make your code not compile by removing functions from std::string". Second, some of the removed functions are part of the Sequence requirements. Hard to see the benefits of that removal; string<Ch> and vector<Ch> being compatible on a concept level is useful. This of course in no way diminishes the utility of the library. If its opinionated `string` is part of the price of admission, so be it. I'm just saying. :-)

Zach Laine

3:49 p.m.

On Sun, Sep 23, 2018 at 10:40 AM Peter Dimov via Boost < boost@lists.boost.org> wrote:

...

Zach Laine wrote:

...
You can find the online docs here: https://tzlaine.github.io/text

I find the "string" layer a hard sell. First, realistically, nobody is going to use it over std::string, especially when its selling point is "we make your code not compile by removing functions from std::string". Second, some of the removed functions are part of the Sequence requirements. Hard to see the benefits of that removal; string<Ch> and vector<Ch> being compatible on a concept level is useful.

string is not and probably never will be a SequenceContainer, but I take your point about text::string being a breaking change. The original impetus for the whole library was a rethink of 'std::string' for a possible 'std2::string'. 'std2' is probably DOA, given LEWG's over-my-dead-body reaction to the idea. So, the string layer stuff is still there, but it's usefulness is now probably restricted to its interoperation with unencoded_rope.

...

This of course in no way diminishes the utility of the library. If its opinionated `string` is part of the price of admission, so be it. I'm just saying. :-)

Other string types, including std::string are interoperable with most of Boost.Text, via concept-accepting overloads in the string- and text-layer types. Also, you can get away with never using text::string at all if you want, for instance if you only use the text-layer types and/or the Unicode layer. Zach

Peter Dimov

3:54 p.m.

Zach Laine wrote:

...

So, the string layer stuff is still there, but it's usefulness is now probably restricted to its interoperation with unencoded_rope.

Also, `string` stealing the buffer of `string_builder`, which can't be implemented in terms of std::string. Either way, I still think that removing push_back is taking things a bit too far.

Peter Dimov

4:10 p.m.

...

Either way, I still think that removing push_back is taking things a bit too far.

What is the difference between `text::string_view` and `std::string_view`? (We already have `boost::string_view` in Utility too.)

Zach Laine

6:44 p.m.

On Sun, Sep 23, 2018 at 11:11 AM Peter Dimov via Boost < boost@lists.boost.org> wrote:

...

...
Either way, I still think that removing push_back is taking things a bit too far.

What is the difference between `text::string_view` and `std::string_view`? (We already have `boost::string_view` in Utility too.)

At the highest level of abstraction, there is no distinction. As you look at implementation details, though, you see that text::string_view is stripped down in the same way that test::string is, and that it interoperates with the text-layer types gracefully. Zach

Zach Laine

6:39 p.m.

On Sun, Sep 23, 2018 at 10:55 AM Peter Dimov via Boost < boost@lists.boost.org> wrote:

...

Zach Laine wrote:

...
So, the string layer stuff is still there, but it's usefulness is now probably restricted to its interoperation with unencoded_rope.

Also, `string` stealing the buffer of `string_builder`, which can't be implemented in terms of std::string.

True, but I don't know how to implement a string_builder that interoperates with std::string (without standardizing it, of course).

...

Either way, I still think that removing push_back is taking things a bit too far.

Fair enough. As crazy as it sounds, I had to add resize() at some point too. I may have gone too far, as you say. :) However, it would still be my preference that SequenceContainer support front(), back(), and push_back() as algorithms, not members. I see no value in dragging that member API around with us for every new SequenceContainer we introduce -- at least not for new code. Having those functions as algorithms still allows them to be used generically. Zach

Robert Ramey

4:04 p.m.

On 9/22/18 9:45 PM, Zach Laine via Boost wrote:

...

I've been working on a Unicode library for submission to Boost, with an eye toward standardizing robust Unicode support for C++.

Hmmm isn't there a lot of overlap with Boost.Locale: https://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/charset_handling.... Also, in boost detail there's a UTF facet which has been in use for many, many years. What would be the relationship with that? Robert Ramey

Zach Laine

7:07 p.m.

On Sun, Sep 23, 2018 at 11:04 AM Robert Ramey via Boost < boost@lists.boost.org> wrote:

...

On 9/22/18 9:45 PM, Zach Laine via Boost wrote:

...
I've been working on a Unicode library for submission to Boost, with an eye toward standardizing robust Unicode support for C++.

Hmmm isn't there a lot of overlap with Boost.Locale:

https://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/charset_handling....

Not that I can tell. They both operate using UTF encodings, but as I understand it Boost.Locale concerns itself heavily (exclusively?) with iostreams, whereas Boost.Text has no relation to iostreams except for a few stream inserters. Also, in boost detail there's a UTF facet which has been in use for

...

many, many years. What would be the relationship with that?

None at all. I should note that there are two or three different UTF-8 <-> UTF-32 standalone transcoding implementations in Boost, but as far as I can tell (I only tried with two of them) none of them produces encoding errors in the manner (replacement character, not exception) and locations within the code units stream recommended by Unicode. Zach

2478

Age (days ago)

2478

Last active (days ago)

List overview

Download

16 comments

8 participants

participants (8)

Andrey Semashev
Daniela Engert
Mathias Gaunard
Peter Dimov
Robert Ramey
Stefan Seefeld
Vinnie Falco
Zach Laine