[boost] Parsers vs Unicode

22 Feb 2024

      Since the reviews of Boost.Parser are currently on, I wanted to share an insight we had at think-cell when working with Boost.Spirit. We have standardized on it for years for all custom parsing needs. Most parsers are small, but some are larger, like Excel expressions.
Of course, our input is mostly Unicode, either UTF-8 or UTF-16. Matching Unicode is complex. Comparison by code point is usually not the right thing. Instead, we must normalize, for which we even have various choices of what to accept as equal:
https://en.wikipedia.org/wiki/Unicode_equivalence
Case-insensitive matching is even more complex, slow, and even language-dependent.
Input is often not guaranteed to be valid Unicode. For example, file names on Windows are sequences of 16-bit units, allowing unmatched surrogates, same with input from Win32 edit boxes and file content.
But we realized that for almost all grammars we have, all this complexity does not matter. The reserved symbols of most grammars (JSON, XML, C++, URLs, etc.) are pure ASCII. Semantically relevant strings are ASCII as well („EXCEL.EXE“). ASCII can be correctly and quickly matched on a per-code-unit basis. Case-insensitive matching for ASCII is simple and fast. User-defined strings, such as JSON string values, may contain Unicode, but then they usually do not affect parsing decisions. The user may want Unicode validation for these strings, but this can be done by the leaf parser for these strings, rather than for the whole input.
Since so much matching is against ASCII, we found it useful to have compile-time known ASCII literals in the parser library. With them, the same grammar can be used for all input encodings. When parsing user-defined strings, they will have the encoding of the input, but that’s fine. Any encoding conversion can be dealt with separately from the parser.
Finally, we may want to parse more than just strings. Parsing binary files, or sequences of DNA, should be possible and efficient.
Thus I recommend separating Unicode processing from the parser library. The parser library operates on an abstract stream of symbols. For Unicode text these would be code units. It provides the composite parsers such as sequences with and without backtracking, alternatives, Kleene star etc., and leaves the interpretation of the symbols entirely to the leaf parsers, which may or may not care about Unicode.
We started modifying Boost.Spirit in this direction, and it serves us well.

--
Dr. Arno Schödl
CTO
schoedl@think-cell.com<mailto:schoedl@think-cell.com> | +49 30 6664731-0

We are looking for C++ Developers: https://www.think-cell.com/developers

think-cell Software GmbH (Web site<https://www.think-cell.com>)
Leipziger Str. 51, 10117 Berlin, Germany
Main phone +49 30 6664731-0 | US toll-free +1 800 891 8091

Amtsgericht Berlin-Charlottenburg HRB 180042
Directors: Christoph Hobo, Dr. Arno Schödl

Please refer to our privacy policy<https://www.think-cell.com/privacy> on how we protect your personal data.

[boost] Parsers vs Unicode

Arno Schoedl