czw., 28 gru 2023 o 22:05 Zach Laine via Boost
I'm trying to gauge interest in a parsing library to replace Boost.Spirit 2/Spirit X3. I'm also looking for endorsements.
The library is intended to remedy some shortcomings of Boost.Spirit*. I think these are great libraries, but Spirit 2 was written in pre-11 C++ (I think; certainly its dependencies were). Most-to-all of the downsides stem from that -- long compile times, inscrutable compilation failures, etc. (Boost.Parser compile times are quite low.)
I'm calling my proposal Boost.Parser, and it follows many of the conventions of Boost.Spirit 2 and X3, such as the operators used for overloading, the names of many parsers and directives, etc. It requires C++17 or later.
From the introduction in the online docs: """ Boost.Parser is a parser combinator library. That is, it consists of a set of low-level primitive parsers, and operations that can be used to combine those parsers into more complicated parsers.
There are primitive parsers that parse epsilon (the empty string), chars, ints, floats, etc.
There are operations which combine parsers to create new parsers. For instance, the Kleene star operation takes an existing parser p and creates a new parser that matches zero or more occurrences of whatever p matches. Both callable objects and operator overloads are used for the combining operations. For instance, operator*() is used for Kleene star, and you can also write repeat(n)[p] to create a parser for exactly n repetitions of p.
Boost.Parser also tries to accommodate the multiple ways that people often want to get a parse result out of their parsing code. Some parsing may best be done by returning an object that represents the result of the parse. Other parsing may best be done by filling in a preexisting data structure. Yet other parsing may best be done by parsing small sections of a large document, and reporting the results of subparsers as they are finished, via callbacks. Boost.Parser accommodates all these ways of working, and even makes it possible to do callback-based or non-callback-based parsing without rewriting any code (except by changing the top-level call from parse() to callback_parse()).
All of Boost.Parser's public interfaces are sentinel- and range-friendly, just like the interfaces in std::ranges.
Boost.Parser is Unicode-aware through and through. When you parse ranges of char, Boost.Parser does not assume any particular encoding — not Unicode or any other encoding. Parsing of inputs other than plain chars assumes that the input is Unicode. In the Unicode-aware code paths, all parsing is done by matching code points. This means that you can feed UTF-8 strings into Boost.Parser, both as input and within your parser, and the right sort of matching occurs. For instance, if your parser is trying to match repetitions of the char '\xcc' (which is a lead byte from a UTF-8 sequence, and so is malformed UTF-8 if not followed by an appropriate UTF-8 code unit), it will not match the start of "\xcc\x80" (UTF-8 for the code point U+0300). Boost.Parser knows that the matching must be whole-code-point, and so it interprets the char '\xcc' as the code point U+00CC.
Error reporting is important to get right, and it is important to make errors easy to understand, especially for end-users. Boost.Parser produces runtime parse error messages that are very similar to the diagnostics that you get when compiling with GCC and Clang (it even supports warnings that don't fail the parse). The exact token associated with a diagnostic can be reported to the user, with the containing line quoted, and with a marker pointing right at the token. Boost.Parser takes care of this for you; your parser does not need to include any special code to make this happen. Of course, you can also replace the error handler entirely, if it doesn't fit your needs.
Debugging complex parsers can be a real nightmare. Boost.Parser makes it trivial to get a trace of your entire parse, with easy-to-read (and very verbose) indications of where each part of the trace is within the parse, the state of values produced by the parse, etc. Again, you don't need to write any code to make this happen — you just pass a parameter to parse().
Dependencies are still a nightmare in C++, so Boost.Parser can be used as a purely standalone library, independent of Boost. """
Boost.Parser aims to be a superset of Boost.Spriit* in most ways. Major things missing from the set of features in Spirit 2 + Spirit X3 are:
- A separate lexer. - Binary parsers (meaning for parsing bits, not binary numbers written as text; the latter is fully supported).
I've been in touch with Joel de Guzman, Hartmut Kaiser, and Michael Caisse, to make sure I was not toe-stomping, for those who are concerned about that. They gave this new library their blessing. One feature comes entirely from them: Boost.Parser is usable in a Boost-free environment -- as a standalone library -- at the user's option. They said that was the #1 request from users, which surprised me a bit.
The Github page is here: https://github.com/tzlaine/parser The online docs are here: https://tzlaine.github.io/parser
To see an extended example, here's a JSON parser that passes all the published JSON tests, including most of the optional ones, in only about 300 lines of code, go here:
https://tzlaine.github.io/parser/doc/html/boost_parser__proposed_/extended_e...
Finally, for those wanting to know how this lib differs from Boost.Spirit* without digging through the docs, here is the doc page that explains Boost.Parser's relationship to Boost.Spirit*: """ Boost.Spirit is a library that is already in Boost, and it has been around for a long time.
However, it does not suit user needs in some ways.
Spirit 2 suffers from very long compile times. Spirit 2 has error reporting that requires a lot of user intervention to work. Spirit 2 requires user intervention, including a (long) recompile, to enable parse tracing. Spirit X3 has rules that do not compose well — the attributes produced by a rule can change depending on the context in which you use the rule. Spirit X3 is missing many of the convenient interfaces to parsers that Spirit 2 had. For instance, you cannot add parameters to a parser. All versions of Spirit have Unicode support, but it is quite difficult to get working. I wanted a library that does not suffer from any of the above limitations. It should be noted that while Spirit X3 only has a couple of flaws in the list above, the one related to rules is a deal-breaker. The ability to write rules, test them in isolation, and then re-use them throughout a complex parser is essential.
Though no version of Boost.Spirit (Spirit 2 or Spirit X3) suffers from all those limitations, there also does not exist any one version that avoids all of them. Boost.Parser does so. However, there are a lot of great ideas in Boost.Spirit that have been retained in Boost.Parser. Both libraries:
- use the same operator overloads to combine parsers; - use approximately the same set of directives to influence the parse (e.g. lexeme[]); - provide loosely-coupled rules that are separately compilable (at least for Spirit X3); and - are built around a flexible parse context object that has state added to and removed from it during the parse (again, comparing to Spirit X3). """
Hi Zach, Thank you for writing and sharing this library. I intend to test it on my mini-language early next year. For now, let me dig a bit about the high-level differences between Boost.Parser and Boost.SpiritX3. Your introduction mentions "a separate lexer" as a feature that Boost.Spirit is missing. How does that square with the entire section for Spirit.Lex in Boost.Spirit docs? "Boost.Parser aims to be a superset of Boost.Spirit". But Boost.Spirit is also a generator. You mention that "Spirit X3 has rules that do not compose well". I personally never experienced this. Is there an example somewhere that would illustrate this problem? What is the recommendation of Boost.Spirit authors to the programmers that need to do parsing? Is Boost.Parser simply the newer and improved version, or do they have disjoint sets of use cases? Personally, skimming through the docs, I find the feature of producing custom error and warning messages very attractive. This is what I was always missing from the parsing libraries. Thanks again for your effort. Regards, &rzej;