On Wed, Feb 21, 2024 at 12:27 PM Phil Endecott via Boost
Boost.Parser Review ===================
[snip]
- There's an ugly MACRO to define rules! That doesn't look very 2024 to me.
Addressed below.
Test Case ---------
[snip] Thanks for sharing the code, and marking it up as you have. It was really helpful to have this.
//const auto degrees_symbol = bp::no_case[ bp::lit("degrees") | bp::lit("deg") | bp::lit('d') ]; //const auto minutes_symbol = bp::no_case[ bp::lit('\'') | bp::lit("minutes") | bp::lit("min") | bp::lit('m') ]; //const auto seconds_symbol = bp::no_case[ bp::lit('"') | bp::lit("seconds") | bp::lit("sec") | bp::lit('s') ]; // FIXME those rules don't work and I don't know why. They work if the symbol is // the first alternative; it it's any of the other alternatives the double returned // by e.g. decimal_degrees below is always zero. // Use these instead: const auto degrees_symbol = bp::lit('d'); const auto minutes_symbol = bp::lit('m'); const auto seconds_symbol = bp::lit('s'); // Beware, 's' can also mean south.
Yes, this is indeed a bug! The sequence parser was wiping out the value of a previously-successful parser's attribute, when the next parser had an internal failure (like not matching "degrees", but then matching 'd'). This is now fixed on the boost_review_changes branch. This is code I was messing about with right before the review; sorry about that.
const auto uint_0_60_def = bp::uint_ [( [](auto ctx) { _pass(ctx) = _attr(ctx) < 60U; _val(ctx) = _attr(ctx); } )]; const auto double_0_60_def = bp::double_ [( [](auto ctx) { _pass(ctx) = _attr(ctx) < 60; _val(ctx) = _attr(ctx); } )];
const auto decimal_degrees = bp::double_ >> -degrees_symbol;
const auto degrees_decimal_minutes_def = (bp::uint_ >> -degrees_symbol >> double_0_60 >> -minutes_symbol) [( [](auto ctx) { auto d = _attr(ctx)[0_c]; auto m = _attr(ctx)[1_c]; _val(ctx) = d + m/60.0; } )]; // FIXME this rule matches e.g. 150.5 as 150 .5; we want that input to // be matched by the decimal degrees rule as 150.5. In the qi code, which // has the same fundamental problem, it is avoided because the double_ // parser is configured to reject input with a leading decimal point. // It would be more correct to require that either a degrees symbol or // whitepace or both between the degrees and minutes numbers. That's // a bit complicated because the whitespace is handled by the skipper. // In a bottom-up parser I think the ambiguity is resolved by putting // decimal_degrees first in degrees_def below. That doesn't work for a // top-down parser. // FIXME for similar reasons, "150d 0.5m n" fails to parse, because // the degrees_minutes_seconds rule matches 150,0,.5 and then the next // rule up fails to parse the 'm'.
This is really interesting! I did not know that Spirit X3 had made this change, but the real/float parser I use is just the one from Spirit X3. And sure enough, it's policies class starts with this: template <typename T> struct ureal_policies { // trailing dot policy suggested by Gustavo Guerra static bool const allow_leading_dot = true; static bool const allow_trailing_dot = true; static bool const expect_dot = false; So, not only does it allow a leading dot, but it's a constant; you'd have to inherit from this policies type and hide the value with a new one in the derived class. Again, I don't know why this was done, but I can understand the desire to match .6, 0.6, and 6. with the same parser. I also see how in your case the 0.6-only match has more natural semantics. I was able to achieve the same goal by changing your parser not to match on double if the input starts with '.'. So this:
const auto degrees_minutes_seconds_def = (bp::uint_ >> -degrees_symbol >> uint_0_60 >> -minutes_symbol >> double_0_60 >> -seconds_symbol) [( [](auto ctx) { auto d = _attr(ctx)[0_c]; auto m = _attr(ctx)[1_c]; auto s = _attr(ctx)[2_c]; _val(ctx) = d + m/60.0 + s/3600.0; } )];
Becomes this:
const auto degrees_minutes_seconds_def = (bp::uint_ >> -degrees_symbol >> uint_0_60 >> -minutes_symbol >> (double_0_60 - '.') >> -seconds_symbol) [( [](auto ctx) { auto d = _attr(ctx)[0_c]; auto m = _attr(ctx)[1_c]; auto s = _attr(ctx)[2_c]; _val(ctx) = d + m/60.0 + s/3600.0; } )];
That fixed the parse. [snip]
const auto signed_latitude_def = signed_degrees [( [](auto ctx) { auto d = _attr(ctx); _pass(ctx) = -90 <= d && d <= 90; } )]; const auto signed_longitude_def = signed_degrees [( [](auto ctx) { auto d = _attr(ctx); _pass(ctx) = -180 <= d && d <= 180; } )];
One thing I noticed here -- you forgot the "_val(ctx) = _attr(ctx);" statement, and so this path of the parse would match, but produce 0. With those three changes (the fix, the " - '.'", and this missing assignment), everything seems to be working as expected. I've turned your example into a test: https://github.com/tzlaine/parser/blob/boost_review_changes/test/parse_coord...
BOOST_PARSER_DEFINE_RULES(uint_0_60); BOOST_PARSER_DEFINE_RULES(double_0_60); BOOST_PARSER_DEFINE_RULES(degrees_decimal_minutes); BOOST_PARSER_DEFINE_RULES(degrees_minutes_seconds); BOOST_PARSER_DEFINE_RULES(degrees); BOOST_PARSER_DEFINE_RULES(latitude); BOOST_PARSER_DEFINE_RULES(longitude); BOOST_PARSER_DEFINE_RULES(signed_latitude); BOOST_PARSER_DEFINE_RULES(signed_longitude); BOOST_PARSER_DEFINE_RULES(latlon);
This could be replaced with a single use of BOOST_PARSER_DEFINE_RULES, FWIW. [snip]
That Ugly Macro ---------------
bp::rule
latitude = "latitude"; const auto latitude_def = (degrees >> northsouth) .... ; BOOST_PARSER_DEFINE_RULES(latitude); Is a macro really the best way to do this? Spirit seems to manage without.
About that. This revision to the way that Spirit 2 did rules is not actually my invention. It comes from Spirit X3: https://live.boost.org/doc/libs/1_68_0/libs/spirit/doc/x3/html/spirit_x3/tut... I liked the way it was done there better than the Spirit 2 way, so I opted for the Spirit 3 way. You can't get recursive rules, or even mutually recursive rules, without a way to forward declare parsers somehow. Rules are what does that. In Spirit 2, they did this by including their attribute type as part of their type, and then using type erasure to avoid requiring the type of the parser to be part of the rule's type as well. In Spirit X3, a rule is instead a function template. I like this better as we don't need to build memory allocation into the rule to get decoupling. I dislike it because the function template is only subtly different from rule to rule, and so that leads naturally to the macro we have. I think the macro is the lesser of these evils.
Despite having a macro this is still very verbose. I've had to write the name of the rule FIVE times in three lines. Once there is a macro you might as well put everything inside it:
BOOST_PARSER_DEFINE_RULE(latitude, double, (degrees >> northsouth) ....);
Of course if you need to declare the rules before defining them due to recursion or other forward references you will need separate macros for declaration and definition.
Yeah, that's what Spriit X3 does. There are DECLARE and DEFINE macros. I was trying to simplify. You're not the first one to ask for a macro that reduces the repetition needed for a single rule/parser pair. If enough others agree, I'll add this. I personally tend not to use macros like that much, or when I do, to make ad hoc ones that I locally #undef after use. So the lack of the DECLARE/DEFINE pair is just my personal taste.
These rule declarations seem to need to be at global scope (maybe namespace scope?). Is that fundamental? Can't I put them in a class or a function?
The macro defines a pair of function templates for each rule you pass to it; this is mentioned in the docs. The implication is that the macro should be used at namespace scope. As long as the tag type you use is in the same namespace where you write the macro, ADL will do the rest.
Semantic Actions ----------------
In Bison I can write semantic actions like this:
{ $$ = $1 + $2/60.0; }
After expanding the $ variables that gets copied verbatim into the bison output, so it can include any code.
Spirit clearly wanted to replicate that sort of functionality, but it pre-dates lambdas, so it includes a lot of amazing functionality that allows me to write semantic actions like:
[ _val = _1 + _2/60.0 ]
This is nice and concise but its limitation is of course that I cannot write arbitrary code in there.
(BTW, are we strictly allowed to use identifiers starting with _ like _val?)
Are you asking about the Spirit 2 use of Phoenix? There is a naming convention there and in Spirit X3, and in Boost.Parser. In the latter case though, it's just a lambda, so you can name it whatever you like.
What should this look like now that we have lambdas in the language? Of course it's going to be a bit more verbose because of the syntax overhead, but I was hoping that it would be something like this:
[ [](auto deg, auto min) { return deg + min/60.0; } ]
This is the biggest thing I miss from Spirit 2 code vs. Spirit X3 code. However, I don't really miss dealing with the verbosity of Phoenix in the non-trivial cases. Lambdas are a win on balance I find.
(Aside: That doesn't actually work because [[ can only appear in attributes even if there is a space, so it needs to be written [( [] ... )]. I did wonder if changing from operator[] to operator() might make sense?)
Sure. C++. I don't like the idea of changing the operator to operator(), because 1) everyone used to Spirit is used to operator[], and 2) operator() is already valid on many parsers and directives. Having to deal with a new, anything-goes operator() would be fraught at best.
That's longer than before, but I have given meaningful names to the parameters (which I can also do in Bison BTW) and I can include any code.
But that's not the syntax that the current proposal needs. It needs this:
[( [](auto ctx) { _val(ctx) = _attr(ctx)[0_c] + _attr(ctx)[1_c]; } )]
Yep. You could also write: auto [a, b] = _attr(ctx); _val(ctx) = a + b; ... if you were using std::tuple. Neither of those is as pithy as _val = _1 + _2. However, as soon as you do anything significantly more complicated than that, lambdas are better.
Frankly I think that combines the worst of all the other options and adds some extra new verbosity of its own.
One other advantage of having the lambda return a result could be improving how the semantic type (aka attribute type) of a rule is set. For example, if I have the rule
degrees >> minutes
where degrees and minutes return doubles, the semantic type of the combined rule is something like tuple
. But I'm going to add a semantic action that adds the two values as shown above, returning one double. If the return value of the semantic action's lambda were used as the semantic type of the rule then we could write: auto rule = (degrees >> minutes) [( [](auto d, auto m) { return d + m/60.0; } )];
and rule would return a double to its parent rule. As it is, the semantic action doesn't change the type and we have to explicitly and verbosely declare the rule as returning a double.
I had not thought of this before, but I really like it. I'll have to think pretty deeply about whether there is some reason this might not work, but I'm definitely going to try to implement this. I'm thinking that the rule would be: if std::apply(f, tup) is well-formed (where f is your invocable, and tup is your tuple of values returned by _attr()), and decltype(std::apply(f, tup)) is assignable to _val(), then do the semantic action as _val(ctx) = std::apply(f, _attr(ctx)); otherwise, the semantic action gets called with the current context.. There would be a special case when _attr() is not a tuple, but you get the idea. Ticket to track it: https://github.com/tzlaine/parser/issues/106
Of course the context includes more than just the child rule semantic values and a slot for the result; for example it includes the _pass boolean that I use to clamp the range of values in my example:
[( [](auto ctx) { _pass(ctx) = _attr(ctx)[0_c] <= 90; } )];
I suggest that either a special lambda signature is used to indicate that it is a predicate, or an additional "directive" e.g. predicate[ ... ] is added.
This idea I don't like as much. I think the previous idea makes things simpler in simpler cases, and something like _pass(ctx) should just use the ctx version of the semantic action.
Compilation Times and Errors ----------------------------
Compilation times for my example were:
Boost.Parser: 28.6 s Boost.Spirit: 30.1 s
I get something pretty similar, except that it's about 5 seconds for each.
but the grammar is not quite identical, so the small difference is not significant. Code size is somewhat larger for Parser than for Spirit.
(g++ 12.2.0, -O3, Linux, aarch64.)
I did not particularly notice improved error messages compared to Spirit, though they weren't worse. Consider for example this code from my example:
return parse(s, input, bp::ws|bp::lit(','));
Originally I had this:
auto skipper = bp::ws | bp::lit(','); return parse(s, input, skipper);
Oof. Yeah, that's a constraint problem. It's pretty easily fixed; see https://github.com/tzlaine/parser/issues/105 . If you make skipper const, the problem goes away. The problem is that the overload getting selected is the one that takes an out-param, so it's trying to assign your result to skipper.
But that failed to compile, and I got 15 screenfuls of inpenetrable error messages. I fixed it more-or-less by randomly changing things until it worked. If I now change it back I get only two screenfuls that accurately say that I can't return a bool from a function that's supposed to return an optional, but I still don't know why the return type of parse() changes when I move the declaration of the skipper.
So I fear that the idea that compile times and error verbosity are improved compared to Spirit are not borne out.
Well, its a mixed bag. The kind of error you ran into is the worst kind. When you try to use an out-param for your parser's result that does not work, the errors are very messy. It sort of has to be this way though, because the loose-matching of parser attribute to out-param works becasue we defer any kind of checking until the last possible moment -- which is always dozens of instantion stack frames in. Where the library improves over Spiriit is when you write many kinds of errors in your own semantic action lambdas, especially when the error is some kind of call to ctx, like _attr(ctx). That can often be turned into a runtime error that is very quickly diagnosed.
Bugs ----
I found a couple of bugs while writing this review which Zach has resolved. I suspect that this code has not had enough varied use to flush out all the issues.
This well might be. There are a lot of tests, but the combinatorics make it hard to catch all the corners.
Other Issues ------------
The double_ parser is not configurable in the way that the Spirit double_ parser is. This is problematic; in my example 3e6n means "3 east 6 north", not "3000000 north", so I need a double_ parser that doesn't recognise exponents. I'd also like to ignore values with leading .s, i.e. 10.1 should not match int_ >> double_ as "10 0.1", and I really don't want a hacker to sneak a nan or inf into my code if it recognises those.
Addressed above.
The docs are not bad and are reminiscent of the Spirit docs. Personally I would prefer to see an example early on that takes a chunk of real BNF and translates it into a parser.
I had issues with Apple Clang (15.0.0); it would not compile "#include
" until I turned off concepts. Has anyone tried with other Clangs? If this is not fixable maybe the library should detect this and use non-concepts mode automatically.
This appears to be a Clang bug, or very nearly related to one; see https://github.com/tzlaine/parser/issues/89 . Andrzej has since said that some change I made seems to have fixed/side-stepped the problem. I don't have great access to Clang, so I did not verify this myself.
Compiling with -Wall resulted in lots of unused parameter warnings. These can be fixed by commenting-out the parameter name, leaving just the type. Boost libraries should compile without warnings, some people use -Werror.
I usually do compile with -Wall. Perhaps it got turned off at some point. Will fix: https://github.com/tzlaine/parser/issues/107
Conclusion ----------
I was hoping to find a library that was an improvement on Spirit, but I regret to say that that's not what I've found here.
The only notable improvement compared to Spirit.Qi that I've found is that no_case[] is propogated to nested rules, which isn't the case with Qi.
Rules themselves are essentially the same as Spirit. The method for defining rules is more verbose and now needs an ugly macro. Semantic actions are less readable.
It is possible that there are significant advantages compared to Spirit that I have missed, as a result of having used working Spirit code as my starting point, but I have read the doc's comparison with Spirit page.
My conclusion is that this library should not be accepted.
If others believe it should be accepted then I feel that should be conditional on many of the issues I've mentioned above being resolved, e.g. unused parameter warnings, configuration of the double_ parser, etc.
[snip] Thanks for spending the time, and for your thoughtful review, Phil. Zach