Interest in a Unicode library for Boost?

newer
Re: [boost] [Boost-users] Interest...

Zach Laine

26 Oct 2019 26 Oct '19

1:11 a.m.

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out. Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest. This library, in part, is something I want to standardize. It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need. Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback. I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake: https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8 Zach

Show replies by date

Viktor Sehr

26 Oct 26 Oct

2:29 p.m.

Yes, I was thinking this should be added to boost when I saw the library for the first time. /Viktor On Sat, Oct 26, 2019 at 3:12 AM Zach Laine via Boost <boost@lists.boost.org> wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH

https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Klaim - Joël Lamotte

30 Oct 30 Oct

12:58 p.m.

New subject: [Boost-users] Interest in a Unicode library for Boost?

On Sat, 26 Oct 2019 at 03:11, Zach Laine via Boost-users < boost-users@lists.boost.org> wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH

https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach

(as a power user) I would be interested to have such library in Boost and already had plan to try Boost.Text in my next C++ project with text. I am following the discussions happening in SG16 and understand that there are some differences with the parts that will be proposed for standardisation (as ThePHD explains in his talk). Though honestly both approaches seems to solve my problems so I'm open to trying both. If boost.text is stable today, I'm happy to use it (at least to replace ICU and have a proper unicode text type). A. Joël Lamotte

Zach Laine

3:48 p.m.

New subject: [Boost-users] Interest in a Unicode library for Boost?

On Wed, Oct 30, 2019 at 7:59 AM Klaim - Joël Lamotte <mjklaim@gmail.com> wrote:

...

On Sat, 26 Oct 2019 at 03:11, Zach Laine via Boost-users < boost-users@lists.boost.org> wrote:

...
About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

[snip]

(as a power user) I would be interested to have such library in Boost and already had plan to try Boost.Text in my next C++ project with text.

I am following the discussions happening in SG16 and understand that there are some differences with the parts that will be proposed for standardisation (as ThePHD explains in his talk). Though honestly both approaches seems to solve my problems so I'm open to trying both. If boost.text is stable today, I'm happy to use it (at least to replace ICU and have a proper unicode text type).

Yes, JeanHeyd and I started with very different approaches, but we're converging somewhat. Zach

Peter Koch Larsen

2:46 p.m.

I saw that talk (on video) and was quite impressed but never got to try your library out. Would be happy to use a library like this! Can you tell me how it relates to the proposed std? /Peter On Sat, Oct 26, 2019 at 3:11 AM Zach Laine via Boost <boost@lists.boost.org> wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Zach Laine

4:06 p.m.

On Wed, Oct 30, 2019 at 9:47 AM Peter Koch Larsen < peter.koch.larsen@gmail.com> wrote:

...

I saw that talk (on video) and was quite impressed but never got to try your library out. Would be happy to use a library like this! Can you tell me how it relates to the proposed std?

There is no relationship yet. Boost.Text is an experiment meant to gather feedback, which will in turn drive the standardization effort. I believe all large libraries should have Boost-equivalent visibility in the C++ community before being standardized. That being said, some of my fellow SG-16 members (that's the committee's Unicode Study Group) have started proposing some actual APIs for standardization. The first of these is the most low-level: transcoding (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1629r0.html ). JeanHeyd's approach is more ambitious than mine -- I really only care about Unicode. His proposal covers conversions among all possible encodings, including an API that lets users add implementations for the encodings they care about that are not covered by the standard. There are no other in-flight Unicode API proposals that I'm aware of. Zach

Glen Fernandes

4:20 p.m.

New subject: [Boost] Interest in a Unicode library for Boost?

This proposed Text library has facilities that perform dynamic allocation, but do not support custom allocators? i.e. Not even the polymorphic allocators? (even if not the C++ allocator model which would require facilities become templates) I recall this was the case a while ago, not sure if intentions have changed since. Glen

Zach Laine

5:09 p.m.

New subject: [Boost] Interest in a Unicode library for Boost?

On Wed, Oct 30, 2019 at 11:20 AM Glen Fernandes <glen.fernandes@gmail.com> wrote:

...

This proposed Text library has facilities that perform dynamic allocation, but do not support custom allocators?

i.e. Not even the polymorphic allocators? (even if not the C++ allocator model which would require facilities become templates)

I recall this was the case a while ago, not sure if intentions have changed since.

Yes, that's right. There are no allocators. Zach

Mathias Gaunard

1 Nov 1 Nov

11:35 a.m.

New subject: [Boost-users] Interest in a Unicode library for Boost?

On Sat, 26 Oct 2019 at 02:11, Zach Laine via Boost-users <boost-users@lists.boost.org> wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

I would start by removing the superlative statements about Unicode being "hard" or "crazy". It's not that complicated compared to the actual hard problems that software engineers solve everyday. The only thing is that people misunderstand what the scope of Unicode is, it's not just an encoding, it's a a database and a set of algorithms (relying on said database) to facilitate natural text processing of arbitrary scripts, and does compromises to integrate with existing industry practices prior to all those scripts being brought together under the same umbrella. Now the string/container/memory management, this is quite irrelevant. That sort of stuff has nothing to do with Unicode and I certainly do not want some Unicode library to mess with the way I am organizing how my data is stored in memory. Your rope etc. containers belong in a completely independent library. What's important is providing an efficient Unicode character database, and implementing the algorithms in a way that is generic, working for arbitrary ranges and being able to be lazily evaluated (i.e. range adaptors). I already did all that work more than 10 years ago as a two-month GSoC project, though there are some limitations since at that time ranges and ranges adaptors were still fairly new ideas for C++. It does however provide a generic framework to define arbitrary algorithms that can be evaluated either lazily or eagerly. To be honest I can't say I find your library to be much of an improvement, at least in terms of usability, since the programming interface seems more constrained (why don't things work with arbitrary ranges rather than this "text" containers) and verbose (just look at the code to do transcoding with iterators), the set of features is quite small, and that the database itself is not even accessible, and last I remember your implementation was ridiculously bloated in size. It also doesn't provide the ability to do fast substring search, which you'd typically do by searching for a substring at the character encoding level and then eliminating matches that do not fall on a satisfying boundary, instead suggesting to do the search at the grapheme level which is much slower, and the facility to test for boundary isn't provided anyway. I'm pretty sure I made similar comments in the past, but I don't feel like any of them has been addressed.

Mike Gresens

2 Nov 2 Nov

12:49 p.m.

How is the debug-support of your new string classes in IDEs? I assume it's pain in the ass, if you can't debug string values...

Mike

4:07 p.m.

I have been meaning to have a closer look at that library for weeks now, but unfortunately I don't sem to find the time to do so. So I'm just writing down some thoughts: First of all, I hihgly support the inclusion of this library into boost (assuming there aren't any major design problems). C++ needs unicode support easy text processing - the former being a prerequisite for the latter. The only way we get those things into the standard is if the different approaches get battle tested in anger, in real-life projects. What the average programmer like me (who is not working on a text editor) imho needs is the ability to read text from one interface, do some basic input validation, normalize it, (potentially) split it up and or parse parts of it, store other parts through a different interface, retrieve all of that later and send it on through a different text based interface. During all that I don't want to have to worry too much about whethernon-asccii text got broken in betweeen. And in particular I should not have to worry about the particularities of how unicode works to o that. For that, I think it is much more important to have encoding aware interface types and easy to use transcoding functions than a full set of unicode text processing algorithms. Not that I'm not happy when they are there, but at least for the standard, lets please focus on what standards are for: Facilitate and enable components from different 3rd pary vendors to work seemlessly together! I'm also very happy to see that the library focuses on the 90% cases instead of trying to be flexible enough to be usable for every possible use case and under every possible circumstance. Best Mike

2079

Age (days ago)

2086

Last active (days ago)

List overview

Download

10 comments

8 participants

participants (8)

Glen Fernandes
Klaim - Joël Lamotte
Mathias Gaunard
Mike
Mike Gresens
Peter Koch Larsen
Viktor Sehr
Zach Laine