[boost] Interest in new parsing library?

28 Dec 2023

      I'm trying to gauge interest in a parsing library to replace
Boost.Spirit 2/Spirit X3.  I'm also looking for endorsements.

The library is intended to remedy some shortcomings of Boost.Spirit*.
I think these are great libraries, but Spirit 2 was written in pre-11
C++ (I think; certainly its dependencies were).  Most-to-all of the
downsides stem from that -- long compile times, inscrutable
compilation failures, etc.  (Boost.Parser compile times are quite
low.)

I'm calling my proposal Boost.Parser, and it follows many of the
conventions of Boost.Spirit  2 and X3, such as the operators used for
overloading, the names of many parsers and directives, etc.  It
requires C++17 or later.
...
From the introduction in the online docs:
"""
Boost.Parser is a parser combinator library. That is, it consists of a
set of low-level primitive parsers, and operations that can be used to
combine those parsers into more complicated parsers.
There are primitive parsers that parse epsilon (the empty string),
chars, ints, floats, etc.

There are operations which combine parsers to create new parsers. For
instance, the Kleene star operation takes an existing parser p and
creates a new parser that matches zero or more occurrences of whatever
p matches. Both callable objects and operator overloads are used for
the combining operations. For instance, operator*() is used for Kleene
star, and you can also write repeat(n)[p] to create a parser for
exactly n repetitions of p.

Boost.Parser also tries to accommodate the multiple ways that people
often want to get a parse result out of their parsing code. Some
parsing may best be done by returning an object that represents the
result of the parse. Other parsing may best be done by filling in a
preexisting data structure. Yet other parsing may best be done by
parsing small sections of a large document, and reporting the results
of subparsers as they are finished, via callbacks. Boost.Parser
accommodates all these ways of working, and even makes it possible to
do callback-based or non-callback-based parsing without rewriting any
code (except by changing the top-level call from parse() to
callback_parse()).

All of Boost.Parser's public interfaces are sentinel- and
range-friendly, just like the interfaces in std::ranges.

Boost.Parser is Unicode-aware through and through. When you parse
ranges of char, Boost.Parser does not assume any particular encoding —
not Unicode or any other encoding. Parsing of inputs other than plain
chars assumes that the input is Unicode. In the Unicode-aware code
paths, all parsing is done by matching code points. This means that
you can feed UTF-8 strings into Boost.Parser, both as input and within
your parser, and the right sort of matching occurs. For instance, if
your parser is trying to match repetitions of the char '\xcc' (which
is a lead byte from a UTF-8 sequence, and so is malformed UTF-8 if not
followed by an appropriate UTF-8 code unit), it will not match the
start of "\xcc\x80" (UTF-8 for the code point U+0300). Boost.Parser
knows that the matching must be whole-code-point, and so it interprets
the char '\xcc' as the code point U+00CC.

Error reporting is important to get right, and it is important to make
errors easy to understand, especially for end-users. Boost.Parser
produces runtime parse error messages that are very similar to the
diagnostics that you get when compiling with GCC and Clang (it even
supports warnings that don't fail the parse). The exact token
associated with a diagnostic can be reported to the user, with the
containing line quoted, and with a marker pointing right at the token.
Boost.Parser takes care of this for you; your parser does not need to
include any special code to make this happen. Of course, you can also
replace the error handler entirely, if it doesn't fit your needs.

Debugging complex parsers can be a real nightmare. Boost.Parser makes
it trivial to get a trace of your entire parse, with easy-to-read (and
very verbose) indications of where each part of the trace is within
the parse, the state of values produced by the parse, etc. Again, you
don't need to write any code to make this happen — you just pass a
parameter to parse().

Dependencies are still a nightmare in C++, so Boost.Parser can be used
as a purely standalone library, independent of Boost.
"""

Boost.Parser aims to be a superset of Boost.Spriit* in most ways.
Major things missing from the set of features in Spirit 2 + Spirit X3
are:

- A separate lexer.
- Binary parsers (meaning for parsing bits, not binary numbers written
as text; the latter is fully supported).

I've been in touch with Joel de Guzman, Hartmut Kaiser, and Michael
Caisse, to make sure I was not toe-stomping, for those who are
concerned about that.  They gave this new library their blessing.  One
feature comes entirely from them: Boost.Parser is usable in a
Boost-free environment -- as a standalone library -- at the user's
option.  They said that was the #1 request from users, which surprised
me a bit.

The Github page is here: https://github.com/tzlaine/parser
The online docs are here: https://tzlaine.github.io/parser

To see an extended example, here's a JSON parser that passes all the
published JSON tests, including most of the optional ones, in only
about 300 lines of code, go here:

https://tzlaine.github.io/parser/doc/html/boost_parser__proposed_/extended_e...

Finally, for those wanting to know how this lib differs from
Boost.Spirit* without digging through the docs, here is the doc page
that explains Boost.Parser's relationship to Boost.Spirit*:
"""
Boost.Spirit is a library that is already in Boost, and it has been
around for a long time.

However, it does not suit user needs in some ways.

Spirit 2 suffers from very long compile times.
Spirit 2 has error reporting that requires a lot of user intervention to work.
Spirit 2 requires user intervention, including a (long) recompile, to
enable parse tracing.
Spirit X3 has rules that do not compose well — the attributes produced
by a rule can change depending on the context in which you use the
rule.
Spirit X3 is missing many of the convenient interfaces to parsers that
Spirit 2 had. For instance, you cannot add parameters to a parser.
All versions of Spirit have Unicode support, but it is quite difficult
to get working.
I wanted a library that does not suffer from any of the above
limitations. It should be noted that while Spirit X3 only has a couple
of flaws in the list above, the one related to rules is a
deal-breaker. The ability to write rules, test them in isolation, and
then re-use them throughout a complex parser is essential.

Though no version of Boost.Spirit (Spirit 2 or Spirit X3) suffers from
all those limitations, there also does not exist any one version that
avoids all of them. Boost.Parser does so. However, there are a lot of
great ideas in Boost.Spirit that have been retained in Boost.Parser.
Both libraries:

- use the same operator overloads to combine parsers;
- use approximately the same set of directives to influence the parse
(e.g. lexeme[]);
- provide loosely-coupled rules that are separately compilable (at
least for Spirit X3); and
- are built around a flexible parse context object that has state
added to and removed from it during the parse (again, comparing to
Spirit X3).
"""

Zach

[boost] Interest in new parsing library?

Zach Laine