[boost] Boost.Parser Review

21 Feb 2024

      Boost.Parser Review
===================

I've been using Spirit for almost 20 years. It has always been
an amazing example of pushing the limits of what is possible in
C++, even by the standards of Boost. So it has always left me
thinking that we just need to wait a few more years until the
language has the few more features needed to smooth over its
remaining awkwardness.

Now a few years have passed and we have, for example, lambdas,
auto, constexpr, concepts. So I was excited to see what
improvements Zach's proposed library has over Spirit.

For example, when Metaparse was reviewed in 2015, compile-time
parsing of BNF was discussed. Perhaps we should be able to write
actual BNF in compile-time strings now, rather than re-use C++
operators?

And now that we have constexpr, should we be able to convert the
production rules to tables at compile time and do proper
linear-complexity LALR parsing, rather than top-down? We do now
have compile-time regular expression libraries than can do
something like this.

Initial Reaction
----------------

My first reaction when I looked at the docs was:

- I'd like to congratulate Zach on choosing a sensible
  descriptive name for the library.

- The rule syntax looks very much like Spirit. The docs look
  like the Spirit docs. It's more of an evolution than a
  revolution.

- There's an ugly MACRO to define rules! That doesn't look very
  2024 to me.

Test Case
---------

I decided to try to translate an exiting parser from Spirit v2
(qi) to the new library. It's a parser for latitude-longitude
coordinates entered by a human, i.e. it is supposed to try to
recognise a range of possible formats. Examples:

  12.34 N, 56.78 E
  56.78,-12.34
  12d 30' n 45d 15' 7" w
  12 30 45 N, 45 15 7 W
  12d 30m 15s S 50d 59m 59s W

There are some subtle issues with this grammar. Note for example
that 's' could mean seconds or South. One advantage of Bison is
that it gives compile-time diagnostics reporting some of these.
Neither Spirit nor the new library can do this.
The core of the Spirit code is as follows:

(I will link to a .zip with the complete code at the end; I bet
the mailing list messes up the long lines.)

/////////////////////////////////////

    degrees_symbol = lit(UTF8_DEGREES)      | no_case[             lit("degrees") | lit("deg") | lit('d') ];
    minutes_symbol = lit(UTF8_PRIME)        | no_case[ lit('\'') | lit("minutes") | lit("min") | lit('m') ];
    seconds_symbol = lit(UTF8_DOUBLE_PRIME) | no_case[ lit('"')  | lit("seconds") | lit("sec") | lit('s') ];

    uint_0_60   %= uint_         [ _pass = _1 < 60U ];
    double_0_60 %= basic_udouble [ _pass = _1 < 60 ];

    decimal_degrees        %=  basic_udouble >> -degrees_symbol;
    degrees_decimal_minutes = (uint_         >> -degrees_symbol >> double_0_60 >> -minutes_symbol)                                    [ _val = _1 + _2/60.0 ];
    degrees_minutes_seconds = (uint_         >> -degrees_symbol >> uint_0_60   >> -minutes_symbol >> double_0_60 >> -seconds_symbol)  [ _val = _1 + _2/60.0 + _3/3600.0 ];

    degrees %= degrees_minutes_seconds  // Longest first.
             | degrees_decimal_minutes
             | decimal_degrees;

    northsouth %= no_case[ char_("ns") ];
    eastwest   %= no_case[ char_("ew") ];

    latitude  = (degrees >> northsouth) [( _pass = _1 <=  90, _val = if_else(_2=='S' || _2=='s', -_1, _1) )];
    longitude = (degrees >> eastwest)   [( _pass = _1 <= 180, _val = if_else(_2=='W' || _2=='w', -_1, _1) )];

    signed_degrees %= basic_double >> -degrees_symbol;

    signed_latitude  %= signed_degrees  [ _pass =  -90 <= _1 && _1 <=  90 ];
    signed_longitude %= signed_degrees  [ _pass = -180 <= _1 && _1 <= 180 ];

    latlon = ((latitude >> longitude)               [ _val = construct<g2d::Vector>(_2,_1) ])
           | ((longitude >> latitude)               [ _val = construct<g2d::Vector>(_1,_2) ])
           | ((signed_longitude >> signed_latitude) [ _val = construct<g2d::Vector>(_1,_2) ]);

/////////////////////////////////////

The complete Boost.Parser code is as follows:

/////////////////////////////////////

#include "boost/parser/parser.hpp"

#include <cassert>
#include <iostream>
#include <optional>
#include <string>

namespace bp = boost::parser;

using namespace boost::hana::literals;

namespace g2d {
  struct Vector { double x, y; };
};

bp::rule<struct uint_0_60,               unsigned int> uint_0_60               = "uint_0_60";
bp::rule<struct double_0_60,             double>       double_0_60             = "double_0_60";
bp::rule<struct degrees_decimal_minutes, double>       degrees_decimal_minutes = "degrees_decimal_minutes";
bp::rule<struct degrees_minutes_seconds, double>       degrees_minutes_seconds = "degrees_minutes_seconds";
bp::rule<struct degrees,                 double>       degrees                 = "degrees";
bp::rule<struct latitude,                double>       latitude                = "latitude";
bp::rule<struct longitude,               double>       longitude               = "longitude";
bp::rule<struct signed_latitude,         double>       signed_latitude         = "signed_latitude";
bp::rule<struct signed_longitude,        double>       signed_longitude        = "signed_longitude";
bp::rule<struct latlon,                  g2d::Vector>  latlon                  = "latlon";

//const auto degrees_symbol = bp::no_case[                 bp::lit("degrees") | bp::lit("deg") | bp::lit('d') ];
//const auto minutes_symbol = bp::no_case[ bp::lit('\'') | bp::lit("minutes") | bp::lit("min") | bp::lit('m') ];
//const auto seconds_symbol = bp::no_case[ bp::lit('"')  | bp::lit("seconds") | bp::lit("sec") | bp::lit('s') ];
// FIXME those rules don't work and I don't know why. They work if the symbol is
// the first alternative; it it's any of the other alternatives the double returned
// by e.g. decimal_degrees below is always zero.
// Use these instead:
const auto degrees_symbol = bp::lit('d');
const auto minutes_symbol = bp::lit('m');
const auto seconds_symbol = bp::lit('s');  // Beware, 's' can also mean south.

const auto uint_0_60_def   = bp::uint_   [( [](auto ctx) { _pass(ctx) = _attr(ctx) < 60U; _val(ctx) = _attr(ctx); } )];
const auto double_0_60_def = bp::double_ [( [](auto ctx) { _pass(ctx) = _attr(ctx) < 60;  _val(ctx) = _attr(ctx); } )];

const auto decimal_degrees = bp::double_ >> -degrees_symbol;

const auto degrees_decimal_minutes_def = (bp::uint_ >> -degrees_symbol
                                       >> double_0_60 >> -minutes_symbol) [( [](auto ctx) {
  auto d = _attr(ctx)[0_c];
  auto m = _attr(ctx)[1_c];
  _val(ctx) = d + m/60.0;
} )];
// FIXME this rule matches e.g. 150.5 as 150 .5; we want that input to
// be matched by the decimal degrees rule as 150.5. In the qi code, which
// has the same fundamental problem, it is avoided because the double_
// parser is configured to reject input with a leading decimal point.
// It would be more correct to require that either a degrees symbol or
// whitepace or both between the degrees and minutes numbers. That's
// a bit complicated because the whitespace is handled by the skipper.
// In a bottom-up parser I think the ambiguity is resolved by putting
// decimal_degrees first in degrees_def below. That doesn't work for a
// top-down parser.
// FIXME for similar reasons, "150d 0.5m n" fails to parse, because
// the degrees_minutes_seconds rule matches 150,0,.5 and then the next
// rule up fails to parse the 'm'.

const auto degrees_minutes_seconds_def = (bp::uint_ >> -degrees_symbol
                                       >> uint_0_60 >> -minutes_symbol
                                       >> double_0_60 >> -seconds_symbol) [( [](auto ctx) {
  auto d = _attr(ctx)[0_c];
  auto m = _attr(ctx)[1_c];
  auto s = _attr(ctx)[2_c];
  _val(ctx) = d + m/60.0 + s/3600.0;
} )];

const auto degrees_def = degrees_minutes_seconds
                       | degrees_decimal_minutes
                       | decimal_degrees;

const auto northsouth = bp::no_case[ bp::char_("ns") ];
const auto eastwest   = bp::no_case[ bp::char_("ew") ];

const auto latitude_def = (degrees >> northsouth) [( [](auto ctx) {
  auto d  = _attr(ctx)[0_c];
  auto ns = _attr(ctx)[1_c];
  _pass(ctx) = d <= 90;
  _val(ctx) = ns=='S' || ns=='s' ? -d : d;
} )];

const auto longitude_def = (degrees >> eastwest) [( [](auto ctx) {
  auto d  = _attr(ctx)[0_c];
  auto ew = _attr(ctx)[1_c];
  _pass(ctx) = d <= 180;
  _val(ctx) = ew=='W' || ew=='w' ? -d : d;
} )];

const auto signed_degrees = bp::double_ >> -degrees_symbol;

const auto signed_latitude_def  = signed_degrees [( [](auto ctx) { auto d = _attr(ctx); _pass(ctx) =  -90 <= d && d <=  90; } )];
const auto signed_longitude_def = signed_degrees [( [](auto ctx) { auto d = _attr(ctx); _pass(ctx) = -180 <= d && d <= 180; } )];

const auto latlon_def = ((latitude >> longitude) [( [](auto ctx) { _val(ctx) = g2d::Vector{_attr(ctx)[1_c], _attr(ctx)[0_c]}; } )] )
                      | ((longitude >> latitude) [( [](auto ctx) { _val(ctx) = g2d::Vector{_attr(ctx)[0_c], _attr(ctx)[1_c]}; } )] )
                      | ((signed_longitude >> signed_latitude)
                                                 [( [](auto ctx) { _val(ctx) = g2d::Vector{_attr(ctx)[0_c], _attr(ctx)[1_c]}; } )] );

BOOST_PARSER_DEFINE_RULES(uint_0_60);
BOOST_PARSER_DEFINE_RULES(double_0_60);
BOOST_PARSER_DEFINE_RULES(degrees_decimal_minutes);
BOOST_PARSER_DEFINE_RULES(degrees_minutes_seconds);
BOOST_PARSER_DEFINE_RULES(degrees);
BOOST_PARSER_DEFINE_RULES(latitude);
BOOST_PARSER_DEFINE_RULES(longitude);
BOOST_PARSER_DEFINE_RULES(signed_latitude);
BOOST_PARSER_DEFINE_RULES(signed_longitude);
BOOST_PARSER_DEFINE_RULES(latlon);

static std::optional<g2d::Vector> try_parse_latlon(std::string_view s)
{
  auto input = latlon >> bp::eoi;
  return parse(s,input, bp::ws|bp::lit(','));
}

int main(int argc, const char* argv[])
{
  assert(argc==2);

  auto opt_coords = try_parse_latlon(argv[1]);

  if (!opt_coords) {
    std::cerr << "parse error\n";
    return 1;
  } else {
    std::cout << opt_coords->x << " " << opt_coords->y << "\n";
    return 0;
  }
}

/////////////////////////////////////

That Ugly Macro
---------------

  bp::rule<struct latitude, double> latitude = "latitude";
  const auto latitude_def = (degrees >> northsouth) .... ;
  BOOST_PARSER_DEFINE_RULES(latitude);

Is a macro really the best way to do this? Spirit seems to
manage without.

Despite having a macro this is still very verbose. I've had to
write the name of the rule FIVE times in three lines. Once there
is a macro you might as well put everything inside it:

  BOOST_PARSER_DEFINE_RULE(latitude, double, (degrees >> northsouth) ....);

Of course if you need to declare the rules before defining them
due to recursion or other forward references you will need separate 
macros for declaration and definition.

These rule declarations seem to need to be at global scope
(maybe namespace scope?). Is that fundamental? Can't I put them
in a class or a function?

Semantic Actions
----------------

In Bison I can write semantic actions like this:

  { $$ = $1 + $2/60.0; }

After expanding the $ variables that gets copied verbatim into
the bison output, so it can include any code.

Spirit clearly wanted to replicate that sort of functionality,
but it pre-dates lambdas, so it includes a lot of amazing
functionality that allows me to write semantic actions like:

  [ _val = _1 + _2/60.0 ]

This is nice and concise but its limitation is of course that I
cannot write arbitrary code in there.

(BTW, are we strictly allowed to use identifiers starting with _
like _val?)

What should this look like now that we have lambdas in the
language? Of course it's going to be a bit more verbose because
of the syntax overhead, but I was hoping that it would be
something like this:

  [ [](auto deg, auto min) { return deg + min/60.0; } ]

(Aside: That doesn't actually work because [[ can only appear in
attributes even if there is a space, so it needs to be written
[( [] ... )]. I did wonder if changing from operator[] to
operator() might make sense?)

That's longer than before, but I have given meaningful names to
the parameters (which I can also do in Bison BTW) and I can
include any code.

But that's not the syntax that the current proposal needs. It
needs this:

  [( [](auto ctx) { _val(ctx) = _attr(ctx)[0_c] + _attr(ctx)[1_c]; } )]

Frankly I think that combines the worst of all the other options
and adds some extra new verbosity of its own.

One other advantage of having the lambda return a result could
be improving how the semantic type (aka attribute type) of a
rule is set. For example, if I have the rule

  degrees >> minutes

where degrees and minutes return doubles, the semantic type of
the combined rule is something like tuple<double,double>. But
I'm going to add a semantic action that adds the two values as
shown above, returning one double. If the return value of the
semantic action's lambda were used as the semantic type of the
rule then we could write:

  auto rule = (degrees >> minutes)
                [( [](auto d, auto m) { return d + m/60.0; } )];

and rule would return a double to its parent rule. As it is, the
semantic action doesn't change the type and we have to
explicitly and verbosely declare the rule as returning a double.

Of course the context includes more than just the child rule
semantic values and a slot for the result; for example it
includes the _pass boolean that I use to clamp the range of
values in my example:

  [( [](auto ctx) { _pass(ctx) = _attr(ctx)[0_c] <= 90; } )];

I suggest that either a special lambda signature is used to
indicate that it is a predicate, or an additional "directive"
e.g. predicate[ ... ] is added.

Compilation Times and Errors
----------------------------

Compilation times for my example were:

  Boost.Parser: 28.6 s
  Boost.Spirit: 30.1 s

but the grammar is not quite identical, so the small difference
is not significant. Code size is somewhat larger for Parser than
for Spirit.

(g++ 12.2.0, -O3, Linux, aarch64.)

I did not particularly notice improved error messages compared
to Spirit, though they weren't worse. Consider for example this
code from my example:

  return parse(s, input, bp::ws|bp::lit(','));

Originally I had this:

  auto skipper = bp::ws | bp::lit(',');
  return parse(s, input, skipper);

But that failed to compile, and I got 15 screenfuls of
inpenetrable error messages. I fixed it more-or-less by randomly
changing things until it worked. If I now change it back I get
only two screenfuls that accurately say that I can't return a
bool from a function that's supposed to return an optional, but
I still don't know why the return type of parse() changes when I
move the declaration of the skipper.

So I fear that the idea that compile times and error verbosity
are improved compared to Spirit are not borne out.

Bugs
----

I found a couple of bugs while writing this review which Zach
has resolved. I suspect that this code has not had enough varied
use to flush out all the issues.

Other Issues
------------

The double_ parser is not configurable in the way that the
Spirit double_ parser is. This is problematic; in my example
3e6n means "3 east 6 north", not "3000000 north", so I need a
double_ parser that doesn't recognise exponents. I'd also like
to ignore values with leading .s, i.e. 10.1 should not match
int_ >> double_ as "10 0.1", and I really don't want a hacker to
sneak a nan or inf into my code if it recognises those.

The docs are not bad and are reminiscent of the Spirit docs.
Personally I would prefer to see an example early on that takes
a chunk of real BNF and translates it into a parser.

I had issues with Apple Clang (15.0.0); it would not compile
"#include <boost/parser.hpp>" until I turned off concepts. Has
anyone tried with other Clangs? If this is not fixable maybe the
library should detect this and use non-concepts mode
automatically.

Compiling with -Wall resulted in lots of unused parameter
warnings. These can be fixed by commenting-out the parameter
name, leaving just the type. Boost libraries should compile
without warnings, some people use -Werror.

Conclusion
----------

I was hoping to find a library that was an improvement on
Spirit, but I regret to say that that's not what I've found
here.

The only notable improvement compared to Spirit.Qi that I've
found is that no_case[] is propogated to nested rules, which
isn't the case with Qi.

Rules themselves are essentially the same as Spirit. The method
for defining rules is more verbose and now needs an ugly macro.
Semantic actions are less readable.

It is possible that there are significant advantages compared to
Spirit that I have missed, as a result of having used working
Spirit code as my starting point, but I have read the doc's
comparison with Spirit page.

My conclusion is that this library should not be accepted.

If others believe it should be accepted then I feel that should
be conditional on many of the issues I've mentioned above being
resolved, e.g. unused parameter warnings, configuration of the
double_ parser, etc.

File containing example:
https://chezphil.org/tmp/parser_review.zip

Regards, Phil.

[boost] Boost.Parser Review

Phil Endecott