Interest in a Unicode library for Boost?

Zach Laine

26 Oct 2019 26 Oct '19

1:11 a.m.

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out. Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest. This library, in part, is something I want to standardize. It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need. Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback. I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake: https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8 Zach

Attachments:

attachment.html (text/html — 1.8 KB)

Show replies by date

Robert Ramey

26 Oct 26 Oct

4:08 a.m.

On 10/25/19 6:11 PM, Zach Laine via Boost-users wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach

How is this related to Boost.Locale ? Conflict/Complement or ??? Robert Ramey

Zach Laine

6:09 a.m.

It is unrelated. Zach On Fri, Oct 25, 2019, 11:08 PM Robert Ramey via Boost-users < boost-users@lists.boost.org> wrote:

...

On 10/25/19 6:11 PM, Zach Laine via Boost-users wrote:

...
About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH

...
https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

...
Zach

How is this related to Boost.Locale ? Conflict/Complement or ???

Robert Ramey

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users

Rainer Deyke

5:40 a.m.

On 26.10.19 03:11, Zach Laine via Boost-users wrote:

...

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I can't see myself using the string layer at all. My codebase is too deeply linked to std::string, as is the standard library, and a fair number of third-party libraries I am using. Also, the primary advantage of the string layer seems to be a narrower interface, which is not an advantage at all to me as a user. std::string::find may be bad design, but it doesn't hurt me, it just makes finding elements in a string slightly more convenient. I am very much interested in the unicode layer. I am currently using ICU, and I'd really like to remove this dependency. ICU is big, it's difficult to build, and I'm stuck on an older version because of compatibility issues. As for the text layer, the fact that it uses FCC means that I probably won't use it because I have standardized on NFD. -- Rainer Deyke (rainerd@eldwood.com)

Zach Laine

4:41 p.m.

On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...

On 26.10.19 03:11, Zach Laine via Boost-users wrote:

...
If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I can't see myself using the string layer at all. My codebase is too deeply linked to std::string, as is the standard library, and a fair number of third-party libraries I am using. Also, the primary advantage of the string layer seems to be a narrower interface, which is not an advantage at all to me as a user.

It is also a place to experiment with things like ropes and string builders. I would like to standardize both, and I need a string that actually interoperates with those to show how they might work. std::string::find may be bad design,

...

but it doesn't hurt me, it just makes finding elements in a string slightly more convenient.

But it does hurt newcomers to the language, who must learn a slightly different API for string and string_view, and static_string and fixed_string if we get those. It also hurts the standardization effort to review all those APIs. You cannot use the std::string search algorithms on spans and other ranges or views either. Returning -1 instead of the end index it's also pretty horrible. If convenience is so paramount, why don't we add member sort () to vector? This is not a troll, I would really like to know. I want to find something in a vector or sort a vector about as often as I want to find a character or subsequence within a string. What, to you, is the difference? If there isn't one, please explain that too. I am very much interested in the unicode layer. I am currently using

...

ICU, and I'd really like to remove this dependency. ICU is big, it's difficult to build, and I'm stuck on an older version because of compatibility issues.

As for the text layer, the fact that it uses FCC means that I probably won't use it because I have standardized on NFD.

Completely understandable. NFC, very close to FCC, is more popular, due to its compactness. I picked the normalization form with the most readily available time and space optimizations, and then stuck to just that one -- the alternative is many text types with different normalizations having to interoperate, which sounds like hell. Zac

Rainer Deyke

8 p.m.

On 26.10.19 18:41, Zach Laine via Boost-users wrote:

...

On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...
On 26.10.19 03:11, Zach Laine via Boost-users wrote:

...
If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I can't see myself using the string layer at all. My codebase is too deeply linked to std::string, as is the standard library, and a fair number of third-party libraries I am using. Also, the primary advantage of the string layer seems to be a narrower interface, which is not an advantage at all to me as a user.

It is also a place to experiment with things like ropes and string builders. I would like to standardize both, and I need a string that actually interoperates with those to show how they might work.

std::string::find may be bad design,

...
but it doesn't hurt me, it just makes finding elements in a string slightly more convenient.

But it does hurt newcomers to the language, who must learn a slightly different API for string and string_view, and static_string and fixed_string if we get those. It also hurts the standardization effort to review all those APIs. You cannot use the std::string search algorithms on spans and other ranges or views either.

The issue isn't if your string is better than std::string. The issue is if your string provides of an improvement to justify switching from std::string, after the time and effort spent learning std::string is already spent. If I want to not use std::string::find, I can simply not use it.

...

Returning -1 instead of the end index it's also pretty horrible.

Not sure I agree. auto pos = some_long_expression().find('.'); // Clear, simple, obvious: if (pos == std::string::npos) { ... } // Less clear, and I have to either evaluate the same expression twice // or use an additional variable, possibly making an extra copy of the // string in the process. if (pos == some_long_expression().size()) { }

...

If convenience is so paramount, why don't we add member sort () to vector?

Because it would be inconvenient to change existing code from std::sort to std::vector::sort, but also because my entire codebase contains only 8 calls to std::sort and at least two orders of magnitude as many calls to std::string::[r]find. For what it's worth, if I were back in the C++98 standards committee, I would vote against the inclusion of std::string::find. But that's not the current situation. -- Rainer Deyke (rainerd@eldwood.com)

Zach Laine

8:24 p.m.

On Sat, Oct 26, 2019 at 3:01 PM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...

On 26.10.19 18:41, Zach Laine via Boost-users wrote:

...
On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...
On 26.10.19 03:11, Zach Laine via Boost-users wrote:

...
If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I can't see myself using the string layer at all. My codebase is too deeply linked to std::string, as is the standard library, and a fair number of third-party libraries I am using. Also, the primary advantage of the string layer seems to be a narrower interface, which is not an advantage at all to me as a user.

It is also a place to experiment with things like ropes and string builders. I would like to standardize both, and I need a string that actually interoperates with those to show how they might work.

std::string::find may be bad design,

...
but it doesn't hurt me, it just makes finding elements in a string slightly more convenient.

But it does hurt newcomers to the language, who must learn a slightly different API for string and string_view, and static_string and fixed_string if we get those. It also hurts the standardization effort to review all those APIs. You cannot use the std::string search algorithms on spans and other ranges or views either.

The issue isn't if your string is better than std::string. The issue is if your string provides of an improvement to justify switching from std::string, after the time and effort spent learning std::string is already spent. If I want to not use std::string::find, I can simply not use it.

...
Returning -1 instead of the end index it's also pretty horrible.

Not sure I agree.

auto pos = some_long_expression().find('.');

// Clear, simple, obvious: if (pos == std::string::npos) { ... }

// Less clear, and I have to either evaluate the same expression twice // or use an additional variable, possibly making an extra copy of the // string in the process. if (pos == some_long_expression().size()) { }

...
If convenience is so paramount, why don't we add member sort () to vector?

Because it would be inconvenient to change existing code from std::sort to std::vector::sort, but also because my entire codebase contains only 8 calls to std::sort and at least two orders of magnitude as many calls to std::string::[r]find.

For what it's worth, if I were back in the C++98 standards committee, I would vote against the inclusion of std::string::find. But that's not the current situation.

Fair enough. Like I said, the string stuff was originally added to explore what a std2::string might look like. As of this writing, that's not really a thing that will happen. Zach

Rainer Deyke

30 Oct 30 Oct

12:59 p.m.

On 26.10.19 18:41, Zach Laine via Boost-users wrote:

...

NFC, very close to FCC, is more popular, due to its compactness. I picked the normalization form with the most readily available time and space optimizations, and then stuck to just that one -- the alternative is many text types with different normalizations having to interoperate, which sounds like hell.

I can understand that, all other things being equal, the more compact form might be preferable. I mean, if you know nothing about Unicode normalization forms other than that one is more compact than the other, then you might as well pick the more compact one, right? But all other things are clearly /not/ equal, or you would just use NFC. And the difference in compactness between NFC and NFD is completely trivial. I challenge you to find any real-world text where the difference is size between NFC and NFD is big enough that I should care about it, both in absolute and relative terms. I consider FCC a non-solution to a non-problem. The advantage of NFC over NFD is not compactness, but compatibility with interfaces that expect NFC. Since FCC does not provide that advantage, there is no reason to choose FCC over NFD. On the other hand, there are several good reasons for choosing NFD over FCC. Aside from the obvious one - compatibility with interfaces that expect NFD - there's also cleaner, simpler code with fewer surprises. For example, it is a completely straightforward operation to replace all acute accents in a NFD text with grave accents or to remove acute accents entirely, whereas the FCC equivalent requires effectively transcoding to NFD. In summary, I think you should support NFD text types. Either in addition to FCC or instead of it. -- Rainer Deyke (rainerd@eldwood.com)

Zach Laine

3:56 p.m.

On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...

...
NFC, very close to FCC, is more popular, due to its compactness. I

On 26.10.19 18:41, Zach Laine via Boost-users wrote: picked

...
the normalization form with the most readily available time and space optimizations, and then stuck to just that one -- the alternative is many text types with different normalizations having to interoperate, which sounds like hell.

I can understand that, all other things being equal, the more compact form might be preferable. I mean, if you know nothing about Unicode normalization forms other than that one is more compact than the other, then you might as well pick the more compact one, right?

But all other things are clearly /not/ equal, or you would just use NFC. And the difference in compactness between NFC and NFD is completely trivial. I challenge you to find any real-world text where the difference is size between NFC and NFD is big enough that I should care about it, both in absolute and relative terms.

I consider FCC a non-solution to a non-problem. The advantage of NFC over NFD is not compactness, but compatibility with interfaces that expect NFC. Since FCC does not provide that advantage, there is no reason to choose FCC over NFD. On the other hand, there are several good reasons for choosing NFD over FCC. Aside from the obvious one - compatibility with interfaces that expect NFD - there's also cleaner, simpler code with fewer surprises. For example, it is a completely straightforward operation to replace all acute accents in a NFD text with grave accents or to remove acute accents entirely, whereas the FCC equivalent requires effectively transcoding to NFD.

In summary, I think you should support NFD text types. Either in addition to FCC or instead of it.

NFD is not an unreasonable choice, though I don't know why you'd want to do a search-replace that changes all het accents from acute to grave (is that a real use-case, or just a for-instance?). Unfortunately, the fast-path of the collation algorithm implementation requires FCC, which is why ICU uses it, and one of the main reasons why I picked it. If we had NFD strings, we'd have to normalize them to FCC first, if I'm not mistaken. (Though I should verify that with a test.) Zach

Rainer Deyke

8:02 p.m.

On 30.10.19 16:56, Zach Laine via Boost-users wrote:

...

On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...
In summary, I think you should support NFD text types. Either in addition to FCC or instead of it.

NFD is not an unreasonable choice, though I don't know why you'd want to do a search-replace that changes all het accents from acute to grave (is that a real use-case, or just a for-instance?).

The specific example is just hypothetical, but wanting to operate on diacritics and base characters separately is real enough. Better examples: checking that Chinese pinyin syllables have their tone markers on the correct vowel. Or collecting statistics on the use of diacritics in a text. Or testing if a font has all of the glyphs needed to render a text. Or replacing a diacritic that's on my keyboard layout for another one that's not. Or even just collation.

...

Unfortunately, the fast-path of the collation algorithm implementation requires FCC, which is why ICU uses it, and one of the main reasons why I picked it. If we had NFD strings, we'd have to normalize them to FCC first, if I'm not mistaken. (Though I should verify that with a test.)

It find that surprising, since FCC more than any other normalization form mixes precomposed and decomposed characters. But I will say this for FCC: at least it's easy to transcode from FCC to NFD. It could even be done in a fairly straightforward iterator adapter. -- Rainer Deyke (rainerd@eldwood.com)

Asif Lodhi

4 Nov 4 Nov

4:14 p.m.

Dear Sir, On Sat, 26 Oct 2019 at 21:42, Zach Laine via Boost-users < boost-users@lists.boost.org> wrote:

...

On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:

...
On 26.10.19 03:11, Zach Laine via Boost-users wrote:If convenience is so paramount, why don't we add member sort () to vector? This is not a troll, I would really like to know. I want to find something in a vector or sort a vector about as often as I want to find a character or subsequence within a string. What, to you, is the difference? If there isn't one, please explain that too.

I am not an expert (at all) on Unicode but I'd certainly like to add my 2 bits in response to your argument as to why sort() is not a vector member. IIRC, Bjarne Stroustrup (and, I guess, most C++ language designers/standard committee members?) has a clear preference of restricting the member functions of a class to only those functions that are absolutely required to maintain the class invariant. I think I read it in TC++PL's 3rd edition. -Asif Lodhi

David Demelier

28 Oct 28 Oct

8:35 a.m.

Le 26/10/2019 à 03:11, Zach Laine via Boost-users a écrit :

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

I've read the intro on why is std::string so bad and I have to disagree with many points. 1. The Fat Interface In which way is std::string bloat? Of course some functions are probably here as synonymous but to say it's bloat is kinda false. Just look at Java's String numerous functions instead [0]. And I 2. The Missing Unicode Support Yes, many newcomers may be surprised to see that a string "é" has a size of 2 bytes (assuming UTF-8). But it's also the case of UTF-16 strings which may have surrotage pairs... UTF-8 is the way to go and effectively stored. One could argue that we should have some utf8 iterators or things like that. But std::string is still a good candidate for string manipulations. 3. Miscellaneous Limitations Not thread-safe being an issue? Thanks god it is not. Imagine the overhead of a threadsafe version of a string. The purpose of a library is not to be threadsafe on every objects. This has to be on the user side. That said, I really hope for a better unicode support in std:: in the near future. Your library is well designed and API is clean, I hope it could be added in Boost :-). [0]: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html

Zach Laine

5 p.m.

On Mon, Oct 28, 2019 at 3:35 AM David Demelier via Boost-users < boost-users@lists.boost.org> wrote:

...

Le 26/10/2019 à 03:11, Zach Laine via Boost-users a écrit :

...
About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

I've read the intro on why is std::string so bad and I have to disagree with many points.

1. The Fat Interface

In which way is std::string bloat? Of course some functions are probably here as synonymous but to say it's bloat is kinda false. Just look at Java's String numerous functions instead [0].

Comparing std::string to Java's string class is not doing std::string any favors.

...

And I

2. The Missing Unicode Support

Yes, many newcomers may be surprised to see that a string "é" has a size of 2 bytes (assuming UTF-8). But it's also the case of UTF-16 strings which may have surrotage pairs...

UTF-8 is the way to go and effectively stored. One could argue that we should have some utf8 iterators or things like that. But std::string is still a good candidate for string manipulations.

I agree that UTF-8 is the way to go (and as I think you've seen, the library reflects that). However, UTF-8 encoding is only part of the story. There is also normalization. If you use UTF-8-in-std::strings, normalization will not be enforced. (Neither will UTF-8 encoding, but that's less of a problem if you always intend to produce replacement characters for broken UTF-8.) Most users will want a type that enforces normalization as a class invariant. Those that do not have the tools -- the algorithms and iterators in the Unicode layer -- to do that in a std::string if they want.

...

3. Miscellaneous Limitations

Not thread-safe being an issue? Thanks god it is not. Imagine the overhead of a threadsafe version of a string. The purpose of a library is not to be threadsafe on every objects. This has to be on the user side.

I don't think all string types should be threadsafe, but having a threadsafe option is nice. That was not an explicit goal of adding ropes, but it is a nice side-effect of the choice I made for how to implement the ropes in Boost.Text.

...

That said, I really hope for a better unicode support in std:: in the near future. Your library is well designed and API is clean, I hope it could be added in Boost :-).

Thanks, me too. :) Zach

Leon Mlakar

29 Oct 29 Oct

10:11 a.m.

On 26.10.2019 03:11, Zach Laine via Boost-users wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

Puuting an issue of standardization aside, I certainly would love to see something like that included in Boost. After a quick read of you docs (about an hour), I'm not sure I'm happy with all the choices you've made (see some remarks below) but overall I see it as something I would use in the future. As you wrote, Unicode is hard, even with a library like this; nearly mission impossible without. Few remarks, for all their worth: - I've never seen std::string and thread (un)safety as an issue - pattern if (x == npos) is now so common that is imho important to preserve it - for the sake of completeness the normalization type used at the text level ought to be a policy parameter; although I do understand your arguments against it I think it should be there even at the cost of different text types being inoperable without conversions - at the text level I'm not sure I'm willing to cope with different fundamental text types; I just want to use boost::text::text, pretty much the same as I use std::string as an alias to much more complex class template; heck, even at the string layer I'd probably prefer rope/contiguous concept to be a policy parameter to the same type template. - views should be introduced as views and not mixed with rope/contiguous fundamental types Hats off for the excellent work, though! Leon

Zach Laine

4:11 p.m.

On Tue, Oct 29, 2019 at 5:11 AM Leon Mlakar via Boost-users < boost-users@lists.boost.org> wrote:

...

On 26.10.2019 03:11, Zach Laine via Boost-users wrote:

...
About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

Puuting an issue of standardization aside, I certainly would love to see something like that included in Boost. After a quick read of you docs (about an hour), I'm not sure I'm happy with all the choices you've made (see some remarks below) but overall I see it as something I would use in the future. As you wrote, Unicode is hard, even with a library like this; nearly mission impossible without.

Few remarks, for all their worth:

- I've never seen std::string and thread (un)safety as an issue

Fair enough. As stated previously in this thread, the threadsafety feature is a side effect that comes from the copy-on-write semantics of rope. *That* is the reason rope is designed the way it is, not the threadsafety part. It just happens that the threadsafety part comes for free when you do the copy-on-write part.

...

- pattern if (x == npos) is now so common that is imho important to preserve it

The std::string/std::string_view API is the only place in the STL where the algorithms do not return the end of the half-open input range on failure. That's really wonky. I don't care about preserving it.

...

- for the sake of completeness the normalization type used at the text level ought to be a policy parameter; although I do understand your arguments against it I think it should be there even at the cost of different text types being inoperable without conversions

I disagree. Policy parameters are bad for reasoning. If I see a text::text, as things currently stand, I know that it is stored as a contiguous array of UTF-8, and that it is normalized FCC. If I add a template parameter to control the normalization, I change the invariants of the type. Types with different invariants should have different names. To do otherwise is a violation of the single responsibility principle.

...

- at the text level I'm not sure I'm willing to cope with different fundamental text types; I just want to use boost::text::text, pretty much the same as I use std::string as an alias to much more complex class template; heck, even at the string layer I'd probably prefer rope/contiguous concept to be a policy parameter to the same type template.

That would be like adding a template parameter to std::vector that makes it act like a std::deque for certain values of that parameter. Changing the space and time complexity of a type by changing a template parameter is the wrong answer.

...

- views should be introduced as views and not mixed with rope/contiguous fundamental types

That does not sound like what I want either, but I don't know what this refers to. Could you be specific? Zach

Gavin Lambert

11:26 p.m.

On 30/10/2019 05:11, Zach Laine wrote:

...

- pattern if (x == npos) is now so common that is imho important to preserve it

The std::string/std::string_view API is the only place in the STL where the algorithms do not return the end of the half-open input range on failure. That's really wonky. I don't care about preserving it.

Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either. I see that as an unfortunate consequence of using generic iterators as input parameters and return types, and not an otherwise desirable design choice. (ie. the STL algorithms do it because they couldn't do anything better. string doesn't do it because it can do something better [since it knows the iterator type and class, and can consequently choose to return something other than an iterator].)

...

- for the sake of completeness the normalization type used at the text level ought to be a policy parameter; although I do understand your arguments against it I think it should be there even at the cost of different text types being inoperable without conversions

I disagree. Policy parameters are bad for reasoning. If I see a text::text, as things currently stand, I know that it is stored as a contiguous array of UTF-8, and that it is normalized FCC. If I add a template parameter to control the normalization, I change the invariants of the type. Types with different invariants should have different names. To do otherwise is a violation of the single responsibility principle.

While I too dislike policy parameters as a general rule -- especially defaulted policy parameters, since APIs have a tendency to only implement one and not all (see: how many libraries use std::string instead of being templated on std::basic_string, or use std::vector<T> instead of being templated on an allocator)... Technically speaking, a different policy parameter does form a different type name and thus "types with different invariants should have different names" is satisfied.

Jon Kalb

30 Oct 30 Oct

2:03 a.m.

...

On Oct 29, 2019, at 4:26 PM, Gavin Lambert via Boost-users <boost-users@lists.boost.org> wrote:

...

Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.

“incredibly inconvenient”? Is it possible that you are over stating your case?

Gavin Lambert

4:06 a.m.

On 30/10/2019 15:03, Jon Kalb wrote:

...

...
On Oct 29, 2019, at 4:26 PM, Gavin Lambert wrote:

...
Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.

“incredibly inconvenient”?

Is it possible that you are over stating your case?

Granted most existing algorithms require making two calls to the collection to get .begin() and .end(), which requires assigning the collection to some lvalue -- and once you've done that, the inconvenience is small (though "== list.end()" is still a bit ugly). But once you start working with range-based rather than iterator-based algorithms, it happens a lot more frequently that your collection is an rvalue that you don't want to have to assign to an lvalue -- but you end up having to do so just so that you can get its .end() to check for failure. Or you end up writing a helper method just so that you can have a named parameter lvalue without cluttering the original source. (Already cited in this thread was a similar example for string rvalues, where npos was more convenient than end(). Granted strings are more often rvalues than collections are, but the principle applies to both.) I'm sure many people have written helper methods to avoid having to write "map.find(key) == map.end()" patterns repeatedly. And for associative containers in particular, an interface based around Optional or Outcome would be a lot more convenient than one based around iterators.

Zach Laine

5:03 a.m.

On Tue, Oct 29, 2019 at 6:26 PM Gavin Lambert via Boost-users < boost-users@lists.boost.org> wrote:

...

On 30/10/2019 05:11, Zach Laine wrote:

...
- pattern if (x == npos) is now so common that is imho important to preserve it

The std::string/std::string_view API is the only place in the STL where the algorithms do not return the end of the half-open input range on failure. That's really wonky. I don't care about preserving it.

Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.

I see that as an unfortunate consequence of using generic iterators as input parameters and return types, and not an otherwise desirable design choice.

(ie. the STL algorithms do it because they couldn't do anything better. string doesn't do it because it can do something better [since it knows the iterator type and class, and can consequently choose to return something other than an iterator].)

I heartily disagree, but I'm also very curious about this. As an example, could you take one of the simple std algorithms (std::find would be a very simple candidate), and show its definition in the style you have in mind?

...

...
- for the sake of completeness the normalization type used at the

text

...
level ought to be a policy parameter; although I do understand your arguments against it I think it should be there even at the cost of different text types being inoperable without conversions

I disagree. Policy parameters are bad for reasoning. If I see a text::text, as things currently stand, I know that it is stored as a contiguous array of UTF-8, and that it is normalized FCC. If I add a template parameter to control the normalization, I change the invariants of the type. Types with different invariants should have different names. To do otherwise is a violation of the single responsibility principle.

While I too dislike policy parameters as a general rule -- especially defaulted policy parameters, since APIs have a tendency to only implement one and not all (see: how many libraries use std::string instead of being templated on std::basic_string, or use std::vector<T> instead of being templated on an allocator)...

Technically speaking, a different policy parameter does form a different type name and thus "types with different invariants should have different names" is satisfied.

Yes, you got me. I was speaking loosely, and referred to a template as if it were a type. What I should have added was that a template's single responsibility should be to stamp out types that all model the same concept. A policy-based template has a hard time doing that. A policy-based template that stamps out strings with different invariants does not do that at all. Zach

Richard Damon

11:50 a.m.

On 10/30/19 1:03 AM, Zach Laine via Boost-users wrote:

...

Yes, you got me. I was speaking loosely, and referred to a template as if it were a type. What I should have added was that a template's single responsibility should be to stamp out types that all model the same concept. A policy-based template has a hard time doing that. A policy-based template that stamps out strings with different invariants does not do that at all.

But all the various templates may well express a base concept even if some of the invariants change between different template parameters. The example of Unicode normalization, or memory allocator seem like perfect examples of this. If an operation needs a particular normalization rule, it specializes its parameter on that one case, otherwise it leaves it as a template parameter. To me, the basic invariant of a string is that it is a sequence of code units that describe a textual object. Often the details of that encoding are unimportant, it might be ASCII, it might be in some old code page, it might be in UTF-8, it might be in UCS-4, and for the various Unicode variations, there are different normalization rules to handle that a given 'character' (aka Glyph) might be expressed in different ways, but the routine largely doesn't care. When it does care, it can force the string into that variant (or refuse some other variants), but the purpose of templates is to condense duplicate code into a single piece of code that only needs to be written once. -- Richard Damon

Zach Laine

3:46 p.m.

On Wed, Oct 30, 2019 at 6:55 AM Richard Damon via Boost-users < boost-users@lists.boost.org> wrote:

...

On 10/30/19 1:03 AM, Zach Laine via Boost-users wrote:

...
Yes, you got me. I was speaking loosely, and referred to a template as if it were a type. What I should have added was that a template's single responsibility should be to stamp out types that all model the same concept. A policy-based template has a hard time doing that. A policy-based template that stamps out strings with different invariants does not do that at all.

But all the various templates may well express a base concept even if some of the invariants change between different template parameters. The example of Unicode normalization, or memory allocator seem like perfect examples of this. If an operation needs a particular normalization rule, it specializes its parameter on that one case, otherwise it leaves it as a template parameter.

To me, the basic invariant of a string is that it is a sequence of code units that describe a textual object. Often the details of that encoding are unimportant, it might be ASCII, it might be in some old code page, it might be in UTF-8, it might be in UCS-4, and for the various Unicode variations, there are different normalization rules to handle that a given 'character' (aka Glyph) might be expressed in different ways, but the routine largely doesn't care. When it does care, it can force the string into that variant (or refuse some other variants), but the purpose of templates is to condense duplicate code into a single piece of code that only needs to be written once.

You're mixing kinds of abstractions here. There is the genericity you find in a function that takes a generic parameter, and that's the kind of use based on concept you're talking about here. About that you're 100% correct: template<foo_concept T> auto foo(T const & x); // <-- feel free to pass any type here that models foo_concept Part of why the above code works is that foo() only uses x in certain ways, and anything that meets the syntactic requirements is well-formed. If foo_concept describes a sequence container, I only care about the common interface of a sequence container. Specifically, I cannot use vector::reserve(), and don't really care that it exists. Where that breaks down is when you have not a function template that uses certain aspects of a type, but a class template that represents a set of types. That case is different: foo_template<T> foo; // <-- feel free to use the entire API If the API is different for various values of T, such as it would be for a text template that instantiates as string-like or rope-like (because those have significantly different interfaces), that implies to me that I should have two names in play -- one for the string version and one for the rope version. Otherwise, the result is super confusing for someone reading or writing code using the unified name. Zach

Gavin Lambert

11:19 p.m.

On 31/10/2019 04:46, Zach Laine wrote:

...

Where that breaks down is when you have not a function template that uses certain aspects of a type, but a class template that represents a set of types. That case is different:

foo_template<T> foo; // <-- feel free to use the entire API

If the API is different for various values of T, such as it would be for a text template that instantiates as string-like or rope-like (because those have significantly different interfaces), that implies to me that I should have two names in play -- one for the string version and one for the rope version. Otherwise, the result is super confusing for someone reading or writing code using the unified name.

While I don't disagree with that, there is some precedent for it in the STL, namely future<void> having a different interface from other future<T>. Although most of that is due to C++'s reluctance to treat void as an actual type. (Though it does have some good features, such as explicit return void, that are lacking in some other languages.)

Gavin Lambert

31 Oct 31 Oct

11:28 p.m.

On 30/10/2019 18:03, Zach Laine wrote:

...

Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.

I see that as an unfortunate consequence of using generic iterators as input parameters and return types, and not an otherwise desirable design choice.

(ie. the STL algorithms do it because they couldn't do anything better. string doesn't do it because it can do something better [since it knows the iterator type and class, and can consequently choose to return something other than an iterator].)

I heartily disagree, but I'm also very curious about this. As an example, could you take one of the simple std algorithms (std::find would be a very simple candidate), and show its definition in the style you have in mind?

Ok: wall of text incoming. There are many options (especially when you have Outcome and/or ranges), but a very straightforward update using only existing STL concepts would be to take this: template<typename InputIt, typename T> InputIt find(InputIt first, InputIt last, T const& value); And change its return type: template<typename InputIt, typename T> optional<InputIt> my_find(InputIt first, InputIt last, T const& value); It now either returns an iterator in the range [first, last) or it returns nullopt. It will never return last. This simple change means that code following this can easily detect search failure without having to refer back to the actual input parameters (ie. having to know the value of last). This can be useful when searching a subset of the container (you don't need to save a separate copy of "list.begin() + 5" just so that you can use it as both the last parameter and for failure checks -- or worse, write it twice and risk bugs if someone edits one but not the other). It also can be useful when the container as a whole is a temporary rvalue -- which is *rarely* useful for std::find (since after all a successful iterator return is almost useless if the container itself is about to be destroyed), but not completely so. Sometimes you only want to detect existence without needing to actually consume the iterator. Sometimes you can use the successful return iterator within the same whole-expression, so the container's lifetime hasn't ended yet. As for "last" itself, while having to pass both begin and end iterators *usually* precludes using rvalue containers, that's also not always true. Some containers/iterators will accept a default-constructed iterator to mean "the end of the sequence" (especially when underlying storage is a linked-list, but other containers/iterators use this model too). This means that "first" can be passed "method().begin()" or some other rvalue and "last" can be passed "iterator()". And in range-based algorithms, there's only a single parameter to worry about anyway, so it's even more likely. As a demonstration, let's imagine a simple span-based contains check (or replace span with your better range type of choice) -- it doesn't care about the iterator other than to assure that there was a successful return: template<typename T> bool contains(std::span<T> range, T const& value) { return my_find(range.begin(), range.end(), value).has_value(); // or you can just cast to bool if you prefer // alternatively, if my_find itself took a range: return my_find(range, value).has_value(); } This seems more readable (and less error prone) than an explicit "== range.end()" check. And, given the range-based version of my_find, you could even have code that does this: if (my_find(method(), 42)) { /* method included 42 */ } Or this: return resolve(my_find(method(), 42), 0); Where resolve(x, y) returns "x ? **x : y" -- somewhat like value_or, but including the iterator dereference. (That makes more sense with a map find, or predicate find, or where the element type is a class that only checks key equality (so the above would return a fully populated object if it exists or a default object if not). Obviously it's a bit silly with plain ints, but you get the idea.) Another extension might be to define my_find_value, which returns optional<T> rather than optional<InputIt>. That doesn't allow mutating the found value in the collection (so it can't replace my_find) but it's still useful in read-only scenarios (and rvalue containers will almost always be read-only scenarios), so that the above no longer needs resolve and would become: return my_find_value(method(), 42).value_or(0); Of course, there are performance consequences of using rvalue containers, since you're copying and throwing away data. But sometimes it makes sense, for known-cheap containers or where the method wants to do a lock-and-return-copy or dequeue or similar, but you're only interested in part of the information.

Zach Laine

1 Nov 1 Nov

3:34 p.m.

On Thu, Oct 31, 2019 at 6:28 PM Gavin Lambert via Boost-users < boost-users@lists.boost.org> wrote:

...

On 30/10/2019 18:03, Zach Laine wrote:

...
Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.

I see that as an unfortunate consequence of using generic iterators

as

...
input parameters and return types, and not an otherwise desirable design choice.

(ie. the STL algorithms do it because they couldn't do anything

better.

...
string doesn't do it because it can do something better [since it

knows

...
the iterator type and class, and can consequently choose to return something other than an iterator].)

I heartily disagree, but I'm also very curious about this. As an example, could you take one of the simple std algorithms (std::find would be a very simple candidate), and show its definition in the style you have in mind?

Ok: wall of text incoming.

There are many options (especially when you have Outcome and/or ranges), but a very straightforward update using only existing STL concepts would be to take this:

template<typename InputIt, typename T> InputIt find(InputIt first, InputIt last, T const& value);

And change its return type:

template<typename InputIt, typename T> optional<InputIt> my_find(InputIt first, InputIt last, T const& value);

It now either returns an iterator in the range [first, last) or it returns nullopt. It will never return last.

This simple change means that code following this can easily detect search failure without having to refer back to the actual input parameters (ie. having to know the value of last).

This can be useful when searching a subset of the container (you don't need to save a separate copy of "list.begin() + 5" just so that you can use it as both the last parameter and for failure checks -- or worse, write it twice and risk bugs if someone edits one but not the other).

It also can be useful when the container as a whole is a temporary rvalue -- which is *rarely* useful for std::find (since after all a successful iterator return is almost useless if the container itself is about to be destroyed), but not completely so. Sometimes you only want to detect existence without needing to actually consume the iterator. Sometimes you can use the successful return iterator within the same whole-expression, so the container's lifetime hasn't ended yet.

This is a red herring; if you only care about existence, why are you using find()? Use any_of() or C++20's includes() (for detecting subranges) instead. They each return a bool. Moreover, just knowing whether a value is found at all within a subrange via linear search is a corner case -- usually you will use something with O(log(N)) or faster access if you need to do that operation a lot.

...

As for "last" itself, while having to pass both begin and end iterators *usually* precludes using rvalue containers, that's also not always true. Some containers/iterators will accept a default-constructed iterator to mean "the end of the sequence" (especially when underlying storage is a linked-list, but other containers/iterators use this model too). This means that "first" can be passed "method().begin()" or some other rvalue and "last" can be passed "iterator()". And in range-based algorithms, there's only a single parameter to worry about anyway, so it's even more likely.

As a demonstration, let's imagine a simple span-based contains check (or replace span with your better range type of choice) -- it doesn't care about the iterator other than to assure that there was a successful return:

template<typename T> bool contains(std::span<T> range, T const& value) { return my_find(range.begin(), range.end(), value).has_value(); // or you can just cast to bool if you prefer // alternatively, if my_find itself took a range: return my_find(range, value).has_value(); }

This seems more readable (and less error prone) than an explicit "== range.end()" check. And, given the range-based version of my_find, you could even have code that does this:

if (my_find(method(), 42)) { /* method included 42 */ }

None of those is better to me than using any_of().

...

Or this:

return resolve(my_find(method(), 42), 0);

Where resolve(x, y) returns "x ? **x : y" -- somewhat like value_or, but including the iterator dereference.

(That makes more sense with a map find, or predicate find, or where the element type is a class that only checks key equality (so the above would return a fully populated object if it exists or a default object if not). Obviously it's a bit silly with plain ints, but you get the idea.)

Now we're getting somewhere. I fully agree that what you wrote above is more easily followed than using std::find(). However, consider this use: out = std::copy(c.begin(), std::find(c.begin(), c.end(), 42), out); Or, in the near future: out = std::copy(c.begin(), std::ranges::find(c, 42), out); That says I want to copy *until* I find 42, and if there is no 42, I want to copy the rest. I find this pattern of code comes up pretty often. I find that I write something like this vs. something like the did-I-find-it style code you wrote above vaguely half the time. That is, I want to know where an element is -- *and* whether it is found -- about as often as I want to get a reference to the first one. This is true because when I frequently want to find just the first one, I tend to reach for something sorted, for efficiency reasons. If I had to write that code using your approach, it would suffer. All I'm pointing out here is that the change you propose is not universally better. In fact, it is universally worse if what you want to do is search for a subrange: auto lower = std::lower_bound(c.begin(), c.end(), 42); out = std::copy(lower, std::upper_bound(lower, c.end(), 42), out); Or: out = std::ranges::copy(std::ranges::equal_range(c, 42), out); That turns in to a real mess when the iterators returned are optionals. Zach

Gavin Lambert

3 Nov 3 Nov

10:24 p.m.

On 2/11/2019 04:34, Zach Laine wrote:

...

This is a red herring; if you only care about existence, why are you using find()? Use any_of() or C++20's includes() (for detecting subranges) instead. They each return a bool. Moreover, just knowing whether a value is found at all within a subrange via linear search is a corner case -- usually you will use something with O(log(N)) or faster access if you need to do that operation a lot.

I'm using find because you said to use find. :) But yes, the argument applies to map.find and friends as well -- and would probably be more useful there than for std::find itself. "Is key present in map" is a very common query. (Granted, map.contains has been added in C++20, but most people don't have access to that yet.)

...

If I had to write that code using your approach, it would suffer. All I'm pointing out here is that the change you propose is not universally better. In fact, it is universally worse if what you want to do is search for a subrange:

auto lower = std::lower_bound(c.begin(), c.end(), 42); out = std::copy(lower, std::upper_bound(lower, c.end(), 42), out);

Or:

out = std::ranges::copy(std::ranges::equal_range(c, 42), out);

That turns in to a real mess when the iterators returned are optionals.

I don't really like the former example anyway because you're not checking for failure of lower_bound. Granted, it will end up with an empty range in the end so the result will still be correct, but you're potentially wasting some time in upper_bound. In the second example it would return an empty range either way, so there's not really any difference.

Zach Laine

4 Nov 4 Nov

9:19 a.m.

On Sun, Nov 3, 2019 at 10:24 PM Gavin Lambert via Boost-users < boost-users@lists.boost.org> wrote:

...

On 2/11/2019 04:34, Zach Laine wrote:

...
This is a red herring; if you only care about existence, why are you using find()? Use any_of() or C++20's includes() (for detecting subranges) instead. They each return a bool. Moreover, just knowing whether a value is found at all within a subrange via linear search is a corner case -- usually you will use something with O(log(N)) or faster access if you need to do that operation a lot.

I'm using find because you said to use find. :)

Right, that's true. However, I did ask you to use it like that. :) The thing I was going for, in part, was that you show the definition, which would have included a branch, necessary for initializing/not initializing the optional, which is not necessary in the general case.

...

But yes, the argument applies to map.find and friends as well -- and would probably be more useful there than for std::find itself. "Is key present in map" is a very common query.

(Granted, map.contains has been added in C++20, but most people don't have access to that yet.)

True enough, but that is a special case, since any_of() does not work optimally with trees. If you have a flat tree, you can use std::binary_search() too. My point is that the algorithms already support the use cases you care about that are related to find(). If you have another algorithm that you find to be a better example for what you're trying to show, we can discuss that one.

...

...
If I had to write that code using your approach, it would suffer. All I'm pointing out here is that the change you propose is not universally better. In fact, it is universally worse if what you want to do is search for a subrange:

auto lower = std::lower_bound(c.begin(), c.end(), 42); out = std::copy(lower, std::upper_bound(lower, c.end(), 42), out);

Or:

out = std::ranges::copy(std::ranges::equal_range(c, 42), out);

That turns in to a real mess when the iterators returned are optionals.

I don't really like the former example anyway because you're not checking for failure of lower_bound.

Well, that was the point of that example -- I don't have to.

...

Granted, it will end up with an empty range in the end so the result will still be correct, but you're potentially wasting some time in upper_bound.

No. upper_bound() will go into the first iteration of its loop with a false condition (first != last will not be true). If I had checked for failure of lower_bound() I would just be pessimizing the non-failure case with an extra branch.

...

In the second example it would return an empty range either way, so there's not really any difference.

The first and second examples have the same semantics and will probably generate nearly identical object code. As importantly, they are simple to read and understand. There is no checking-for-failure noise, nor is there the opportunity for bugs if I forget to check for failure. Zach

Gavin Lambert

10:28 p.m.

On 4/11/2019 22:19, Zach Laine wrote:

...

The thing I was going for, in part, was that you show the definition, which would have included a branch, necessary for initializing/not initializing the optional, which is not necessary in the general case.

It's not much of an overhead in the implementation. Internally find is a for loop, which returns the iterator once it finds a successful value, or returns the iterator limit if the loop terminates unsuccessfully. All this would change is that on unsuccessful return it would return an empty optional. No extra branches. On successful return it would construct an optional<iterator> around its loop iterator, but that, too, should be trivial. There's no extra branches when consuming the result, either -- either way, there has to be some code that's checking for the failure state. And that code can be simpler when it's checking for an empty optional rather than checking for equality with an end iterator. It probably can be optimised better as well, since an empty optional is a known state, while list.end() is an extra method call that can't be optimised away. (Granted, the user could cache that in a variable to avoid the duplicate call, but I suspect that this is only very rarely done.)

...

True enough, but that is a special case, since any_of() does not work optimally with trees. If you have a flat tree, you can use std::binary_search() too. My point is that the algorithms already support the use cases you care about that are related to find(). If you have another algorithm that you find to be a better example for what you're trying to show, we can discuss that one.

No, find (and siblings like find_if) are probably the most applicable algorithms. Correct me if I'm wrong, but most other algorithms don't have a "return input-last on failure" behaviour.

...

I don't really like the former example anyway because you're not checking for failure of lower_bound.

Well, that was the point of that example -- I don't have to.

Granted, it will end up with an empty range in the end so the result will still be correct, but you're potentially wasting some time in upper_bound.

No. upper_bound() will go into the first iteration of its loop with a false condition (first != last will not be true). If I had checked for failure of lower_bound() I would just be pessimizing the non-failure case with an extra branch.

Depends on the failure mode. lower_bound can return a non-last iterator which points at a non-equal element (indicating where the element could be inserted). This is what could cause unnecessary work for upper_bound, especially if it immediately goes for a binary search rather than first testing its start iterator. (It's actually the worst case for a binary search.) Admittedly an optional return isn't going to help you with that case either. Actually you can convincingly argue that lower_bound/upper_bound don't actually have a failure result -- in the case where they are currently returning last, it still means "this is where you could insert the element". So probably these should still return last, not an optional.

Leon Mlakar

30 Oct 30 Oct

9:56 a.m.

I'm reposting this - by mistake I've used "Reply" instead of "Reply To List" button. I apologize for the inconvenience.

...

- for the sake of completeness the normalization type used at the text level ought to be a policy parameter; although I do understand your arguments against it I think it should be there even at the cost of different text types being inoperable without conversions

I disagree. Policy parameters are bad for reasoning. If I see a text::text, as things currently stand, I know that it is stored as a contiguous array of UTF-8, and that it is normalized FCC. If I add a template parameter to control the normalization, I change the invariants of the type. Types with different invariants should have different names. To do otherwise is a violation of the single responsibility principle.

Okay, the policy or not the policy was not my point ... it was to allow for different underlying normalizations. Granted, it may only be important to (a few) corner cases where input and/or output normalizations are given, and your assessment that it may not be worth the effort is reasonable ... unless you are aiming towards adding to the standard. Then the completeness imho becomes more important. Frankly, I'm not proficient enough in the meta-programming to make a strong case either for policy parameter or for explicit types/templates. I just happen to prefer the policy based approach.

...

- at the text level I'm not sure I'm willing to cope with different fundamental text types; I just want to use boost::text::text, pretty much the same as I use std::string as an alias to much more complex class template; heck, even at the string layer I'd probably prefer rope/contiguous concept to be a policy parameter to the same type template.

That would be like adding a template parameter to std::vector that makes it act like a std::deque for certain values of that parameter. Changing the space and time complexity of a type by changing a template parameter is the wrong answer.

No, that is not making the std::vector to act as std::deque - the text would still remain the text and act as a text, with the same interface. It's more like FIFO implementation using either std::vector or std::dequeu for its store - since in both cases the FIFO has the same interface and functionally behaves the same, I really don't want two distinct types. The type template with the parameter that makes the choice between the underlying storage seems much more natural to me.

...

- views should be introduced as views and not mixed with rope/contiguous fundamental types

That does not sound like what I want either, but I don't know what this refers to. Could you be specific?

Well, I'll have to think more about it ... it struck me that the docs often mention X and X_view in the same sentence, and you have to go elsewhere to learn that one is owning and the other isn't. I hope I'll find some time in the next days and come back on this. Cheers, Leon

Klaim - Joël Lamotte

12:58 p.m.

On Sat, 26 Oct 2019 at 03:11, Zach Laine via Boost-users < boost-users@lists.boost.org> wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH

https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach

(as a power user) I would be interested to have such library in Boost and already had plan to try Boost.Text in my next C++ project with text. I am following the discussions happening in SG16 and understand that there are some differences with the parts that will be proposed for standardisation (as ThePHD explains in his talk). Though honestly both approaches seems to solve my problems so I'm open to trying both. If boost.text is stable today, I'm happy to use it (at least to replace ICU and have a proper unicode text type). A. Joël Lamotte

Zach Laine

3:48 p.m.

On Wed, Oct 30, 2019 at 7:59 AM Klaim - Joël Lamotte <mjklaim@gmail.com> wrote:

...

On Sat, 26 Oct 2019 at 03:11, Zach Laine via Boost-users < boost-users@lists.boost.org> wrote:

...
About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

[snip]

(as a power user) I would be interested to have such library in Boost and already had plan to try Boost.Text in my next C++ project with text.

I am following the discussions happening in SG16 and understand that there are some differences with the parts that will be proposed for standardisation (as ThePHD explains in his talk). Though honestly both approaches seems to solve my problems so I'm open to trying both. If boost.text is stable today, I'm happy to use it (at least to replace ICU and have a proper unicode text type).

Yes, JeanHeyd and I started with very different approaches, but we're converging somewhat. Zach

Mathias Gaunard

1 Nov 1 Nov

11:35 a.m.

On Sat, 26 Oct 2019 at 02:11, Zach Laine via Boost-users <boost-users@lists.boost.org> wrote:

...

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text

I would start by removing the superlative statements about Unicode being "hard" or "crazy". It's not that complicated compared to the actual hard problems that software engineers solve everyday. The only thing is that people misunderstand what the scope of Unicode is, it's not just an encoding, it's a a database and a set of algorithms (relying on said database) to facilitate natural text processing of arbitrary scripts, and does compromises to integrate with existing industry practices prior to all those scripts being brought together under the same umbrella. Now the string/container/memory management, this is quite irrelevant. That sort of stuff has nothing to do with Unicode and I certainly do not want some Unicode library to mess with the way I am organizing how my data is stored in memory. Your rope etc. containers belong in a completely independent library. What's important is providing an efficient Unicode character database, and implementing the algorithms in a way that is generic, working for arbitrary ranges and being able to be lazily evaluated (i.e. range adaptors). I already did all that work more than 10 years ago as a two-month GSoC project, though there are some limitations since at that time ranges and ranges adaptors were still fairly new ideas for C++. It does however provide a generic framework to define arbitrary algorithms that can be evaluated either lazily or eagerly. To be honest I can't say I find your library to be much of an improvement, at least in terms of usability, since the programming interface seems more constrained (why don't things work with arbitrary ranges rather than this "text" containers) and verbose (just look at the code to do transcoding with iterators), the set of features is quite small, and that the database itself is not even accessible, and last I remember your implementation was ridiculously bloated in size. It also doesn't provide the ability to do fast substring search, which you'd typically do by searching for a substring at the character encoding level and then eliminating matches that do not fall on a satisfying boundary, instead suggesting to do the search at the grapheme level which is much slower, and the facility to test for boundary isn't provided anyway. I'm pretty sure I made similar comments in the past, but I don't feel like any of them has been addressed.

2077

Age (days ago)

2086

Last active (days ago)

List overview

Download

2 comments

3 participants

participants (3)

Asif Lodhi
David Demelier
Gavin Lambert
Jon Kalb
Klaim - Joël Lamotte
Leon Mlakar
Mathias Gaunard
Rainer Deyke
Richard Damon
Robert Ramey
Zach Laine