On 12.06.20 21:56, Zach Laine via Boost wrote:
(And no, unencoded_rope would not be a better choice. I can't memmap ropes, but I can memmap string_views.)
You can easily memmap string_views 2GB at a time, though. That is a simple workaround for this corner case you have mentioned, but it is truly a corner case compared to the much more common case of using strings for holding contiguous sequences of char. Contiguous sequences of char really should not be anywhere near 2GB, for efficiency reasons.
A memmapped string_view /is/ a contiguous sequence of char. I don't see the difference.
I like the use of negative indices for indexing from the end in Python, but I am ambivalent about using the same feature in C++. None of the other types I regularly use in C++ work like that, and the runtime cost involved is a lot more noticeable in C++.
I don't really get what you mean about the runtime cost. Could you be more explicit?
Somewhere in the implementation of operator[] and operator(), there has to be a branch on index < 0 (or >= 0) in order for that negative index trick to work, which the compiler can't always optimize away. Branches are often affordable but they're not free.
Also, using the same type of counting-from-end indices and counting-from-beginning indices seems unsafe. A separate type for counting-from-end would be safer and faster, at the cost of being more syntactically heavy.
Interesting. Do you mean something like a strong typedef for "index" and "negative-index", or something else?
Yes, something like that.
One things that appears to be missing is a normalization-preserving append/insert/erase operators. These are available in the text layer, but that means being tied to the specific text classes provided by that layer.
Hm. I had not considered making these available as algorithms, and I generally like the approach. But could you be more specific? In particular, do you mean that insert() would take a container C and do C.insert(), then renormalize? This is the approach used by C++ 20's erase() and erase_if() free functions. Or did you mean something else?
I hadn't thought through the interface in detail. I just saw that this was a feature of the text layer, and thought it would be nice to have in the unicode layer, because I don't want to use the text layer (in its current form).
Requiring unicode text to be in Stream-Safe Format is another time bomb waiting to go off, but it's also usability issue. The library should provide an algorithm to put unicode text in Stream-Safe Format, and should automatically apply that algorithm whenever text is normalized. This would make it safe to use Boost.Text on data from an untrusted source so long as the data is normalized first, which you have to do with untrusted data anyway.
This seems like a good idea to me. I went back and forth over whether or not to supply the SSF algorithm, since it's not an official Unicode algorithm, but adding it to text's normalization step would be reason enough to do so.
Text layer: overall, I don't like it.
On one hand, there is the gratuitous restriction to FCC. Why can't other normalization forms be supported, given that the unicode layer supports them?
Here is the philosophy: If you have a template parameter for UTF-encoding and/or normalization form, you have an interoperability problem in your code. Multiple text<...>'s may exist for which there is no convenient or efficient interop story. If you instead convert all of your text to one UTF+normalization that you use throughout your code, you can do all your work in that one scheme and transcode and/or renormalize at the program input/output boundaries.
Having to renormalize at API boundaries can be prohibitively expensive.
Because, again, I want there to be trivial interop. Having texttext::string and textstd::string serves what purpose exactly? That is, I have never seen a compelling use case for needing both at the same time. I'm open to persuasion, of course.
The advantage of textstd::string is API interop with functions that
accept std::string arguments. I'm not sure what the advantage of
textboost::text::string is. But if we accept that boost::text::rope
(which is would just be textboost::text::unencoded_rope in my scheme)
is useful, then it logically follows that
text
The SSF assumption is explicitly allowed in the Unicode standard, and it's less onerous than not checking array-bounds access in operator[] in one's array-like types. Buffer overflows are really common, and SSF violations are not. That being said, I can add the SSF-conformance algorithm as mentioned above.
Unintential SSF violations are rare. Intentional SSF violations can be used as an attack vector, if "undefined behavior" translates to "memory error".
.../the_unicode_layer/searching.html: the note at the end of the page is wrong, assuming you implemented the algorithms correctly. The concerns for searching NFD strings are similar to the concerns for searching FCC strings.
In both FCC and NFD: - There is a distinction between A+grave+acute and A+acute+grave, because they are not canonically equivalent. - A+grave is a partial grapheme match for A+grave+acute. - A+acute is not a partial grapheme match for A+grave+acute. - A+grave is not a partial grapheme match for A+acute+grave. - A+acute is a partial grapheme match for A+acute+grave. But: - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
Hm. That note was added because the specific case mentioned fails to work for NFD, but works for FCC and NFC. I think the grapheme matching analysis above addresses something different from what the note is talking about -- the note is concerned with code-point-level searching results producing (perhaps) surprising results. The grapheme-based search does not, so which CPs are matched by a particular grapheme does not seem to be relevant. Perhaps I'm missing something?
I am talking about code point level matches here. ("Partial grapheme match" means "matches some code points within a grapheme, not not the whole grapheme".)
I was not able to get the library to build, so I was not able to test it. But it does look like it should be a good ICU replacement for my purposes, assuming it doesn't have any serious bugs.
Huh. What was the build problem? Was it simply that it takes forever to build all the tests?
The configuration step failed because it tries to compile and run test programs in order to gather information about my environment, and I was running in a cross-compile context which prevents CMake from running the programs that it compiles. Probably not too hard to work around on my part by simply not using a cross-compile context. -- Rainer Deyke (rainerd@eldwood.com)