Re: [boost] Interest in new parsing library?

29 Dec 2023

      On Fri, Dec 29, 2023 at 10:35 AM Peter Dimov via Boost
<boost@lists.boost.org> wrote:
...
Zach Laine wrote:
...
...
I'm calling my proposal Boost.Parser, and it follows many of the conventions
of Boost.Spirit  2 and X3, such as the operators used for overloading, the
names of many parsers and directives, etc.  It requires C++17 or later.
...
...
The Github page is here: https://github.com/tzlaine/parser
The online docs are here: https://tzlaine.github.io/parser
Some observations:
I understand, in principle, the motivation behind asserting at runtime
instead of failing compilation, but I don't think the same argument applies
to rejecting *eps parsers. It seems to me that a static assert for any *p or
+p where p can match epsilon (can succeed while consuming no input)
would be clear enough. (E.g. +-p, *(p | q | eps), *attr(...), +&p, etc.)
Why?  It may be better to static_assert, but it's not clear to me why
...
Interestingly, this would reject **p and +*p, because these parsers can
go into an infinite loop. The current behavior is to collapse them into *p,
which is useful, but technically wrong. This raises the possibility of, instead
of rejecting *p or +p when p can match epsilon, just 'fixing' its behavior so
that when p matches epsilon, the outer parser just exits the loop. This will
make the current collapsing behavior equivalent to the non-collapsed one.
At first, I thought this was a great idea.  Now I'm ambivalent.  The
way I might implement this is in repeat_parser (that's the only
looping parser, modulo its subclasses).  I could then do a couple of
things:

1) detect that we have not eaten any of the input, but have matched
repeat_parser's subparser, and terminate the repetition; or
2) detect that we have matched repeat_parser's subparser, *and* that
the subparser is an unconditional match.

#1 is nice, because you don't need any way of tagging parser types as
being epsilon-like.  Without this or some similar approach you could
end up with a closed set of types that trigger this short-circuiting.
This seems like a maintenance problem for me, but moreover an
extensibility problem for users.  #2 suffers from this closed-set
problem.

To fix #2, I could add a template param (or constexpr static member,
same diff), that acts as a tag.

#1 is problematic though, and anything where the no-input-consuming
match is conditional is equally problematic.  Each parser could have
arbitrary side effects, via semantic actions.  So this parser:

*(if_(c)[p] | eps[a])

Could match the eps first, if 'c' evaluated to false, and later match
'p', depending on what 'a' does.  If 'a' flips the value of 'c', then
the parse will always match 'p'.  If 'a' increments a counter, then
the parse might eventually match 'p', but just take a long time to do
it; this case might also result in an infinite loop.  In the case of
the increment that ends in a match, maybe 'a' increments a counter,
but also does some other important side effect.  This may be a useful
pattern to someone, somewhere.

This is obviously contrived, but the point is that there are currently
some things that you can express that would become non-expressible.

tl;dr I like the idea, but I'm struggling with how to do it so that we
don't limit expressivity.
...
Also, errors should definitely go to std::cerr by default, not std::cout. Errors
aren't program output, and routing them to stdout is script-hostile.
Ach!  Yeah, that's just an oversight.  I've opened a ticket, thanks.

Zach