Re: [boost] Reminder: Review for Zach's parsing library starts on Feb 19th

23 Feb 2024

      On Thu, Feb 22, 2024 at 4:36 PM Christian Mazakas via Boost
<boost@lists.boost.org> wrote:
...
plaintext dump as requested:
Thanks!  I was going to have to do it to respond here. :)

[snip]
...
### The Feedback
...
* Spirit X3 has rules that do not compose well — the attributes produced
by a rule can change depending on the context in which you use the rule.
* Spirit X3 is missing many of the convenient interfaces to parsers that
Spirit 2 had. For instance, you cannot add
The differences between Spirit.X3 and Parser need to be made much more
clear by the
documentation, by which I mean:
parameters to a parser.
...
* All versions of Spirit have Unicode support, but it is quite difficult
to get working.
These need code samples showing the limitations of X3 and how Parser
actually solves them. I think many,
myself included, aren't overwhelmingly clear on what these differences are
and how Parser solves them
and _why_ this is a good thing.
Fair enough.  The first one I'll need to try to write a good example
for, but the basic idea is that the loose matching of a parser's
default attribute to a user-provided attribute is not something you
can turn off.  So when you define a Spirit X3 rule with the attribute
type vector<int>, it can still fill in a deque<int>.  This means that
some errors you write don't get diagnosed until way later than they
could, and you can't really write a rule and test it in isolation,
because the context in which you use it can change its behavior -- the
loose matching rules are very loose.  That's my memory of the problem
anyway... I'll have to investigate a bit.

The second one refers to parameterized rules, like you find in the YAML spec:

[70] l-empty(n,c) ::=
  (
      s-line-prefix(n,c)
    | s-indent-less-than(n)
  )
  b-as-line-feed

When you write the l_empty parser, you want to be able to take two
arguments to it, n and c, and then pass those to the subparsers like
s_line_prefix.  You could do that with Spirit 2 parsers, but not with
Spirit X3 parsers.  I wrote a YAML parser, and so felt the loss of
this pretty strongly.  I'm not sure how many people write parsers like
this though.

The last one you have dealt with yourself -- I think it might be
enough to say that Unicode support requires you to repeat a lot of
your parser logic.
...
Though I have attempted X3's Unicode before and I will say, Parser's
unified interface is much better.
For example, I wrote an entire RFC-compliant URL parser in X3, including
Unicode support, for my library
Foxy
<https://github.com/cmazakas/foxy/blob/7f4ac0495ad2ed9cd0eca5994743d677ac1d2636/include/foxy/uri.hpp#L353-L356>
Yep.  I had written Unicode-aware Spirit parsers before, and want to
get rid of that kind of repetition.
...
Not a criticism but Parser's directives are a dramatic improvement over X3.
By these, I'm thinking of
`boost::parser::string_view[]`, `boost::parser::separate[]` and
`boost::parser::merge[]`. These should be
highlighted in the comparison vs X3 as I think they add a good bit of
substantive value.
Glad you liked these.  These came out of discussions with Andrzej
about his frustration at having to write rules for some of his
parsers.  Rather than embrace the complexity of the attribute rules
from Spirit, I thought lexically-visible control via directives would
be easier to reason about.
...
More praise for the library includes its parse API which is a dramatic
improvement over X3 in  terms of
ergonomics. Personally, I can see myself main using `prefix_parse`,
similarly to X3's original parse() but
I think the value of `parse()` itself is obvious and doesn't warrant much
explanation.
Thanks!  It took me about 6 different iterations to settle on a
minimal overload set.
...
---
The library generates a dramatic amount of warnings when compiled with
`-Wall -Wextra -pedantic`. I had
originally forgotten to `include_directorie()` Parser as a `SYSTEM`
directory and was met with 172 warnings.
I think being clean under `-Wall -Wextra -Werror` is reasonable criteria
for acceptance.
I basically never turn on -Wextra, but a I have a ticket to make sure
Parser is -Wall-clean: https://github.com/tzlaine/parser/issues/107
...
When writing the parser for the HTTP version line, I ran into a lot of
issues separating the major version
from the minor version when using:
parser::digit >> "." >> parser::digit
It was only when I was lazily browsing the docs that I found the
`separate[]` directive which I think indicates
the library's documentation has poor discoverability.
This is described in the page about attribute generation:
https://tzlaine.github.io/parser/doc/html/boost_parser__proposed_/tutorial/a...
.  It's definitely difficult to know where to put things where people
will see them.

[snip]
...
The documentation doesn't exactly give users any guidelines about when they
should formally
define rules with attributes vs semantic actions.
This is one of those things where if users don't already have the intuition
built up, using the
library is an exercise in frustration.
Namely, attributes don't compose well when you start nesting rules and want
to have different out
params at different levels of the parsers. I suppose this should be obvious
because the rules themselves
form a hierarchy so the attributes _must_ mirror that.
I found that the best thing to do for my use-case was to abuse
`_globals(ctx)` and semantic actions.
I think a good rule of thumb for users can be expressed as something like:
use rule attributes for simple
grammars where all you really want is just the `std::vector<double>`. For
something complex with domain-defined
validation rules, prefer `_globals(ctx)` and semantic actions for
validating everything.
I can add something to the Best Practices page about this. Ticket:
https://github.com/tzlaine/parser/issues/110
...
I felt like I came up with working but unsatisfactory solutions to parsing
characters in a specific range.
Namely, I just wrote a simple semantic action around `+parser::char_[f]`
but this made me feel like I was
writing unidiomatic code. I'd appreciate some feedback here:
// reason-phrase = 1*( HTAB / SP / VCHAR / obs-text )
    // VCHAR         = %x21-7E
    auto const on_reason_phrase = [](auto &ctx) {
      auto ch = _attr(ctx);
      if (ch != '\t' && ch != ' ' && (ch < '\x21' || ch > '\x7e')) {
        _pass(ctx) = false;
        return;
      }
This part could just have been: char_('\x21', '\x7e') | char_('\t') |
char_(' ').  You might want to reorder it for the most common stuff to
come first.  Oh, since you put it in an omit below, it's even simpler:
char_('\x21', '\x7e') | '\y' | ' '.
...
response::metadata &md = _globals(ctx).md_;
      std::ranges::subrange<char const *> sub = _where(ctx);
      if (!md.reason_phrase_begin_) {
        md.reason_phrase_begin_ = sub.begin();
      }
      md.reason_phrase_end_ = sub.end();
      md.status_line_end_ = sub.end();
    };
auto const reason_phrase =
parser::omit[+parser::char_[on_reason_phrase]];
---
There should be a debugging utility for discovering what the synthesized
attribute is for a rule. Perhaps
this could exist as a semantic action that attempts to instantiate an
incomplete template.
template<class> struct diagnostic;
auto const debug=[](auto& ctx){diagnostic<decltype(_attr(ctx))> d;};
It took me a long time to realize what the attribute type for `digit >> "."
...
...
digit` was (it was `std::string`).
Since you are not the only person asking for this:
https://github.com/tzlaine/parser/issues/111
...
The library seems to really emphasize Boost.Hana's `tuple` type but I think
in practice, this winds up
being a mistake unless there's substantive compile-time savings by using
Hana here.
Namely, Hana's `tuple` doesn't support structured bindings. Because Parser
is a C++17 library, I consider
this a somewhat large defect of the interface. What's worse, Hana's
literals aren't compatible with Parser's.
A huge boon for Hana and its `tuple` is that it supports indexing via
`tup[0_c]` but this doesn't work when
Parser's literal are involved as it also defines a `_c` so an ambiguity is
introduced and compilation fails.
Between this and the lack of structured binding support, I think relying on
Hana here is a strict downgrade unless
there's compelling reasons I'm simply unaware of.
https://github.com/boostorg/hana/milestone/4
When using Hana, why would you need to use parser::literals?  I can
only think of uses in the non-Hana compatibility stuff; when using
Hana you don't need that stuff, I think.
...
Stuff like `_globals` in the docs would 404 for me. I also think the docs
are kind of just incomplete in general
and because there's not much cross-referencing, discoverability is hurt so
a user has to essentially read the entire
docs front-to-back to understand the library which is a large ask for a
parsing library.
I'll audit the links.  I did that at some point, but I guess there's
been breakage since.  Ticket:
https://github.com/tzlaine/parser/issues/112
...
### Conclusion
Overall, I think that even with all of the problems I found in the library,
it's still quite good and worthy
of being brought into Boost as the successor for Spirit. I don't think
anything I found was a deal-breaker and
I think in a few iterations of Boost, this library can be something quite
great.
Thanks for the review, Christian!

Zach

Re: [boost] Reminder: Review for Zach's parsing library starts on Feb 19th

Zach Laine