On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
On 12.06.20 21:56, Zach Laine via Boost wrote:
(And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though. That is a simple workaround for this corner case you have mentioned, but it is truly a corner case compared to the much more common case of using strings for holding contiguous sequences of char. Contiguous sequences of char really should not be anywhere near 2GB, for efficiency reasons.
A memmapped string_view /is/ a contiguous sequence of char. I don't see the difference.
The difference is mutability. There's no perf concern with erasing the first element of a string_view, if that's not even a supported operation.
I don't really get what you mean about the runtime cost. Could you be more explicit?
Somewhere in the implementation of operator[] and operator(), there has to be a branch on index < 0 (or >= 0) in order for that negative index trick to work, which the compiler can't always optimize away. Branches are often affordable but they're not free.
Ah, I see, thanks. Would it make you feel better if negative indexing were only used when getting substrings?
One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer.
Hm. I had not considered making these available as algorithms, and I generally like the approach. But could you be more specific? In particular, do you mean that insert() would take a container C and do C.insert(), then renormalize? This is the approach used by C++ 20's erase() and erase_if() free functions. Or did you mean something else?
I hadn't thought through the interface in detail. I just saw that this was a feature of the text layer, and thought it would be nice to have in the unicode layer, because I don't want to use the text layer (in its current form).
I don't need a detailed interface. Pseudocode would be fine too.
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them?
Here is the philosophy: If you have a template parameter for UTF-encoding and/or normalization form, you have an interoperability problem in your code. Multiple text<...>'s may exist for which there is no convenient or efficient interop story. If you instead convert all of your text to one UTF+normalization that you use throughout your code, you can do all your work in that one scheme and transcode and/or renormalize at the program input/output boundaries.
Having to renormalize at API boundaries can be prohibitively expensive.
Sure. Anything can be prohibitively expensive in some context. If that's the case in a particular program, I think it is likely to be unacceptable to use text::operator+(string_view) as well, since that also does on-the-fly normalization. Someone, somewhere, has to pay that cost if you want to use two chunks of text in encoding/normalization A and B. You might be able to keep working in A for some text and keep working in B separately for other text, but I think code that works like that is going to be hard to reason about, and will be as common as code that freely mixes wstring and string (and I mean not only at program boundaries). That is, not very common. However, that is a minority of cases. The majority case is that texts have to be able to interop within your program arbitrarily, and so you need to pay the conversion cost somewhere eventually anyway. FWIW, I'm planning to write standardization papers for the Unicode layer stuff for C++23, and the text stuff in the C++26 timeframe. My hope is that we will adopt my text design here into Boost in plenty of time to see whether it is actually as workable as I claim. I'm open to the idea of being wrong about its design and changing it to a template if a nontemplate design turns out to be problematic.
Because, again, I want there to be trivial interop. Having texttext::string and textstd::string serves what purpose exactly? That is, I have never seen a compelling use case for needing both at the same time. I'm open to persuasion, of course.
The advantage of textstd::string is API interop with functions that accept std::string arguments.
Sure. That exists now, though it does require a copy. It could also be done via a move if I replace text::string with std::string within text::text, which I expect to as a result of this review.
I'm not sure what the advantage of textboost::text::string is. But if we accept that boost::text::rope (which is would just be textboost::text::unencoded_rope in my scheme)
That does not work. Strings and ropes have different APIs.
is useful, then it logically follows that text
could also be useful.
That's what I don't get. Could you explain how text<A> and text<B> are useful in a specific case? "Could also be useful" is not sufficient motivation to me. I understand the impulse, but I think that veers into over-generality in a way that I have found to be problematic over and over in my career.
.../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings.
In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm. That note was added because the specific case mentioned fails to work for NFD, but works for FCC and NFC. I think the grapheme matching analysis above addresses something different from what the note is talking about -- the note is concerned with code-point-level searching results producing (perhaps) surprising results. The grapheme-based search does not, so which CPs are matched by a particular grapheme does not seem to be relevant. Perhaps I'm missing something?
I am talking about code point level matches here. ("Partial grapheme match" means "matches some code points within a grapheme, not not the whole grapheme".)
Ah, I see. Thanks. I'll update the note. Zach