Reminder: Review for Zach's parsing library starts on Feb 19th
That’s a week from now. Github: https://github.com/tzlaine/parser Docs: https://tzlaine.github.io/parser — Marshall
Marshall, My 2c: Review of the Boost.Parser documentation. I can see that considerable effort has been put into it so far, here are some ideas on how to improve it. Text - perhaps the text is a little small - consider a larger font size for both text and code examples. Refer to Microsoft MSDN for comparison. Introduction ========= - Would be great to see a table with two columns. The first the Name of each parser included in the lib, the second a Description (a sentence or two) on the purpose of the parser. It is OK to have a table like this in an Introduction. The names can be linked into the Reference. Such a table encapsulates the full scope of the library in one spot, and tables are popular. - Would be nice to see some examples of scenarios or applications that would benefit from this library, or perhaps that this library is most targeted towards. For example, is Boost.Parser suitable for writing a language compiler or interpreter, data serialization (and deserialization), parsing config files, writing scripting engines, writing query languages, parsing high/low level protocols, natural language, mathematical expressions and formulae, or even text-based games or components of games (dialog?). This kind of introduction helps answer the "why" question - "Why should I be interested in this library?". - Perhaps give some ideas of "Combinatory Parsing" - some examples of parser combinations that work for any of the scenarios mentioned above. - Is some or all of the functionality available in the Standard Library, and if so, how does it compare? - Mention performance - what priority has this been given and are there any pertinent examples when compared with other options. Style points: ---------------- - Avoid using words or phrases that add little or no meaning - such as "very" "simply" - sharpens the reading experience. - Avoid anthropomorphism - don't treat software as a human - software does not "know" about anything. It can record and reproduce - it never "knows" "likes" "dislikes" or similar - The introduction currently is over-technical, could use higher-level/conceptual treatment. Getting Started ============ - Suggest having a "Getting Started with Boost.Parser" section, following the Introduction, even if the instructions are simple. - Include the B2 and CMake instructions to install the parser. - Include a section on Dependencies - even if the answer is "None". - Move the section on supported compilers/OS to this topic. Configuration and Optional Features ============================ - Could do with some sub-headings for each Config option or Feature available. - Should follow the Getting Started section. Relationship to Boost.Spirit ===================== - Consider moving this to or following the Introduction. Terminology ========== Good to have a terminology section, perhaps move this to following the "Configuration and Optional Features" topic as it doesn't seem to be specific to the Tutorials. Tutorial ====== - Rename this to "Tutorials". - Many of the headings in this section are inactive labels (such as "Symbol Tables"). It is more compelling if these are active headings - for example: "Parse name/value pairs into Symbol Tables" - or whatever would be the active heading for this topic. And the same for all the other entries in the Tutorials section. - "Writing Your Own Parsers" is not as effective a heading as "Write Your Own Parsers". - Great to have this good range of tutorials. Concepts ======= - Currently this is just a code block - add an introduction and explain what is being shown, and any nuances or potential sticky spots. Compiler Support ============== - Move to the Getting Started section, and add OS support (even if this is trivial - put it in writing). Reference ======== - Great to have a full Reference though each entry needs textual explanation. For example, Macro BOOST_PARSER_USE_CONCEPTS, needs an explanation as to its purpose. - Add a "Description" heading and text to all Reference entries without one. Make sure any nuances, unexpected behavior, conflicts with other settings/macros/etc, special cases and the like, are mentioned in the Description. - Consider adding an "Errors and Exceptions" section to each Reference entry that might return an error, listing the errors/exceptions that might be returned or fired, and ideally what might have caused this and how to respond. - Remember tables are popular - tables of errors, parameters, fields, methods, etc. Navigation ======== - Great to have all the code functions and structures mentioned in the Tutorials link to an entry in the Reference. It can work the other way too - adding an Example section to each reference, linking to one or more of the Tutorials (if an example exists). - Similarly for the Introduction and other front matter, if you are mentioning the name of a library component, make that mention a link. On Mon, Feb 12, 2024 at 8:04 AM Marshall Clow via Boost < boost@lists.boost.org> wrote:
That’s a week from now.
Github: https://github.com/tzlaine/parser Docs: https://tzlaine.github.io/parser
— Marshall
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Marshall,
I should have mentioned my affiliation (technical writer) with C++
Alliance!
Apologies for the omission.
- Peter
On Mon, Feb 19, 2024 at 7:42 PM Peter Turcan
Marshall,
My 2c: Review of the Boost.Parser documentation. I can see that considerable effort has been put into it so far, here are some ideas on how to improve it.
Text - perhaps the text is a little small - consider a larger font size for both text and code examples. Refer to Microsoft MSDN for comparison.
Introduction ========= - Would be great to see a table with two columns. The first the Name of each parser included in the lib, the second a Description (a sentence or two) on the purpose of the parser. It is OK to have a table like this in an Introduction. The names can be linked into the Reference. Such a table encapsulates the full scope of the library in one spot, and tables are popular.
- Would be nice to see some examples of scenarios or applications that would benefit from this library, or perhaps that this library is most targeted towards. For example, is Boost.Parser suitable for writing a language compiler or interpreter, data serialization (and deserialization), parsing config files, writing scripting engines, writing query languages, parsing high/low level protocols, natural language, mathematical expressions and formulae, or even text-based games or components of games (dialog?). This kind of introduction helps answer the "why" question - "Why should I be interested in this library?".
- Perhaps give some ideas of "Combinatory Parsing" - some examples of parser combinations that work for any of the scenarios mentioned above.
- Is some or all of the functionality available in the Standard Library, and if so, how does it compare?
- Mention performance - what priority has this been given and are there any pertinent examples when compared with other options.
Style points: ---------------- - Avoid using words or phrases that add little or no meaning - such as "very" "simply" - sharpens the reading experience. - Avoid anthropomorphism - don't treat software as a human - software does not "know" about anything. It can record and reproduce - it never "knows" "likes" "dislikes" or similar - The introduction currently is over-technical, could use higher-level/conceptual treatment.
Getting Started ============ - Suggest having a "Getting Started with Boost.Parser" section, following the Introduction, even if the instructions are simple.
- Include the B2 and CMake instructions to install the parser.
- Include a section on Dependencies - even if the answer is "None".
- Move the section on supported compilers/OS to this topic.
Configuration and Optional Features ============================ - Could do with some sub-headings for each Config option or Feature available. - Should follow the Getting Started section.
Relationship to Boost.Spirit ===================== - Consider moving this to or following the Introduction.
Terminology ========== Good to have a terminology section, perhaps move this to following the "Configuration and Optional Features" topic as it doesn't seem to be specific to the Tutorials.
Tutorial ====== - Rename this to "Tutorials".
- Many of the headings in this section are inactive labels (such as "Symbol Tables"). It is more compelling if these are active headings - for example: "Parse name/value pairs into Symbol Tables" - or whatever would be the active heading for this topic. And the same for all the other entries in the Tutorials section.
- "Writing Your Own Parsers" is not as effective a heading as "Write Your Own Parsers".
- Great to have this good range of tutorials.
Concepts ======= - Currently this is just a code block - add an introduction and explain what is being shown, and any nuances or potential sticky spots.
Compiler Support ============== - Move to the Getting Started section, and add OS support (even if this is trivial - put it in writing).
Reference ======== - Great to have a full Reference though each entry needs textual explanation. For example, Macro BOOST_PARSER_USE_CONCEPTS, needs an explanation as to its purpose.
- Add a "Description" heading and text to all Reference entries without one. Make sure any nuances, unexpected behavior, conflicts with other settings/macros/etc, special cases and the like, are mentioned in the Description.
- Consider adding an "Errors and Exceptions" section to each Reference entry that might return an error, listing the errors/exceptions that might be returned or fired, and ideally what might have caused this and how to respond.
- Remember tables are popular - tables of errors, parameters, fields, methods, etc.
Navigation ======== - Great to have all the code functions and structures mentioned in the Tutorials link to an entry in the Reference. It can work the other way too - adding an Example section to each reference, linking to one or more of the Tutorials (if an example exists).
- Similarly for the Introduction and other front matter, if you are mentioning the name of a library component, make that mention a link.
On Mon, Feb 12, 2024 at 8:04 AM Marshall Clow via Boost < boost@lists.boost.org> wrote:
That’s a week from now.
Github: https://github.com/tzlaine/parser Docs: https://tzlaine.github.io/parser
— Marshall
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mon, Feb 19, 2024 at 9:42 PM Peter Turcan via Boost
Marshall,
My 2c: Review of the Boost.Parser documentation. I can see that considerable effort has been put into it so far, here are some ideas on how to improve it.
[snip] Thanks for your review, Peter! Zach
Review for Zach's parsing library starts> on Feb 19th
Here is my review of the proposed
Boost.Parser
## What is your evaluation of the design?
The design is clear with good structure.
This makes the proposed Boost.Parser
library easy to use. It is simultaneously
quite powerful. Naming choices are
intuitive, which will help both
beginning as well as intermediate
and advanced clients when using
this library.
## What is your evaluation of the implementation?
Excellent.
A few suggestions for potential
evolution of the library follow
below.
I would like to see some evidence of
Code Coverage in testing. Some newer
Boost libraries are showing improved
quality with Code Coverage reports
and statistics. Although this adds
no real functionality, I think it
helps prove overall quality.
When compiling, I really enjoy using
advanced compiler warnings. Two of
my favorites on GCC are -Wconversion
and -Wsign-conversion. When using these
I found several easy-to-fix warnings
present in the proposed Boost.Parser.
Providing evidence of running modern
syntax checker(s) might improve
confidence in the library and/or
reduce the library's vulnerability.
- What is your evaluation of the documentation?
It's great. The library is deep in scope
and functionality. But I was able to
(with help from the docs) get a small
example up and running quickly.
It is easy to study the docs and theyread crsply and clearly.
- What is your evaluation of the potential
usefulness of the library?
Exceptionally high. Parsing tasks
are uniformly ubiquitous in almost all
programming domains. A strong, intuitive
parser library fills a much needed gap
in the collection. Spirit is great and
hopefully will continue to be maintained.
But I hope to migrate new designs to
the proposed Boost.Parser and I will
recommend to others to do the same.
- Did you try to use the library? With
what compiler? Did you have any problems?
Yes absolutely. At first I tried a simple
command-line interpreter. It worked
immediately and was up and running
within minutes. I then went on to define
a simple grammar with upper-case words(and underscores) and try to parse it.
This also worked perfectly.
I did not have time for it yet,
but I would like to attempt to
make a primitive for parsing
big-integers and big-floats
(in the sense of Boost.Multiprecision).
I tried with MSVC 2019, 2022, GCC 11-13
No issues.
- How much effort did you put into your evaluation?
A glance? A quick reading? In-depth study?
I spent about 2 days, using the code,
then did quite a bit of looking at code
and reading the docs.
- Are you knowledgeable about the problem domain?
Yes. But only in the empirical sense.
I have written many programs and
applications based on and using parsers,
usually parsing self-invented grammars,
protocols, binary streams, lexical
fragments and the like, really a lot
of parser-things. Yet at the same time,
I consider myself to be vastly uninformed
about the formal depths of grammars,
standard parser formats and the like.
So it has been a struggle to move.
This is what makes the proposed
Boost.Parser so great. I can use
the proposed Boost.Parser effectively
to solve problems in my domain(s)
even though my formal parser
knowledge is weak.
This makes the proposed Boost.Parser
a valuable library.
## Do you think the library should be accepted as a Boost library?
Yes!
Christopher Kormanyos
On Monday, February 12, 2024 at 05:04:11 PM GMT+1, Marshall Clow via Boost
I'm going to boldly bring this ML into the future and post a GH gist of my review: https://gist.github.com/cmazakas/f363187ef45a03c55661fc2324a6df36 This way I have things like code formatting and paragraphs and visual separators. The tl;dr is that we should overwhelmingly ACCEPT Parser. If anyone wants, I can just do a plaintext dump of the gist but I think the gist should render fine for everyone. - Christian
I realize, I forgot to officially disclose that I'm a C++ Alliance staff engineer. But I've been a huge fan of Spirit for much longer than I've been working at the Alliance. - Christian
On Feb 22, 2024, at 1:12 PM, Christian Mazakas via Boost
I'm going to boldly bring this ML into the future and post a GH gist of my review: https://gist.github.com/cmazakas/f363187ef45a03c55661fc2324a6df36
This way I have things like code formatting and paragraphs and visual separators.
The tl;dr is that we should overwhelmingly ACCEPT Parser.
If anyone wants, I can just do a plaintext dump of the gist but I think the gist should render fine for everyone.
Please do that, so that the review ends up in the ML archives. (And thanks for the review) — Marshall
* Spirit X3 has rules that do not compose well — the attributes produced by a rule can change depending on the context in which you use the rule. * Spirit X3 is missing many of the convenient interfaces to parsers that Spirit 2 had. For instance, you cannot add
plaintext dump as requested: I've been a huge fan of Spirit for many years now. One of my favorite Spirit memories was during two separate job interviews. The first time, I was interviewing with a game company and on my resume, I noted that I had written a simple HTTP server that performed simple geospatial querying, creating a list of cities within a radius of a provided zip code. Here's the link https://github.com/cmazakas/location-service. I remember that I had used Spirit for parsing a CSV and the interviewer was genuinely concerned about me using such a library for such a seemingly small task. It was the main focus of the interview and I think represented a cultural divide. I did not get the job. The second time, I used it as a litmus test for me writing C++ at the company. They saw me use Spirit and still hired me anyway which I took to be a great sign. This time, I got the job. ## Boost.Parser Cutting to the chase, I vote we ACCEPT Parser. To test the library, I've been using a small test project here: https://github.com/cmazakas/parser-review My criteria for acceptance was that I'd be able to write an HTTP/1 parser with feature parity to what is currently being done in the proposed Boost.Http.Proto library. I feel that I was able to achieve this goal and I stopped proceeding because I no longer felt like the library itself would be a blocker. Now, even though I'm voting ACCEPT I'm going to spend most of the review criticizing the library. To me, the value-add of a type-safe parser combinator library is obvious so I won't spend too much time here. Basically, testable, reusable parsers are good and even more good when they elegantly compose as well. ### The Feedback The differences between Spirit.X3 and Parser need to be made much more clear by the documentation, by which I mean: parameters to a parser.
* All versions of Spirit have Unicode support, but it is quite difficult to get working.
These need code samples showing the limitations of X3 and how Parser
actually solves them. I think many,
myself included, aren't overwhelmingly clear on what these differences are
and how Parser solves them
and _why_ this is a good thing.
Though I have attempted X3's Unicode before and I will say, Parser's
unified interface is much better.
For example, I wrote an entire RFC-compliant URL parser in X3, including
Unicode support, for my library
Foxy
https://github.com/cmazakas/foxy/blob/7f4ac0495ad2ed9cd0eca5994743d677ac1d26...
.
---
Not a criticism but Parser's directives are a dramatic improvement over X3.
By these, I'm thinking of
`boost::parser::string_view[]`, `boost::parser::separate[]` and
`boost::parser::merge[]`. These should be
highlighted in the comparison vs X3 as I think they add a good bit of
substantive value.
More praise for the library includes its parse API which is a dramatic
improvement over X3 in terms of
ergonomics. Personally, I can see myself main using `prefix_parse`,
similarly to X3's original parse() but
I think the value of `parse()` itself is obvious and doesn't warrant much
explanation.
---
The library generates a dramatic amount of warnings when compiled with
`-Wall -Wextra -pedantic`. I had
originally forgotten to `include_directorie()` Parser as a `SYSTEM`
directory and was met with 172 warnings.
I think being clean under `-Wall -Wextra -Werror` is reasonable criteria
for acceptance.
---
When writing the parser for the HTTP version line, I ran into a lot of
issues separating the major version
from the minor version when using:
parser::digit >> "." >> parser::digit
It was only when I was lazily browsing the docs that I found the
`separate[]` directive which I think indicates
the library's documentation has poor discoverability.
Ultimately, it was cleaner for me to just do:
// HTTP-version = HTTP-name "/" DIGIT "." DIGIT
// HTTP-name = %x48.54.54.50 ; HTTP
auto const on_version = [](auto &ctx) {
auto const &tup = _attr(ctx);
auto const major = tup[0_c];
auto const minor = tup[1_c];
if (major > 9 || minor > 9) {
_pass(ctx) = false;
return;
}
response::metadata &md = _globals(ctx).md_;
md.version_.major_ = major;
md.version_.minor_ = minor;
};
auto const http_name = parser::lit("HTTP");
auto const http_version =
(http_name >> "/" >> parser::uint_ >> "." >>
parser::uint_)[on_version];
---
The documentation doesn't exactly give users any guidelines about when they
should formally
define rules with attributes vs semantic actions.
This is one of those things where if users don't already have the intuition
built up, using the
library is an exercise in frustration.
Namely, attributes don't compose well when you start nesting rules and want
to have different out
params at different levels of the parsers. I suppose this should be obvious
because the rules themselves
form a hierarchy so the attributes _must_ mirror that.
I found that the best thing to do for my use-case was to abuse
`_globals(ctx)` and semantic actions.
I think a good rule of thumb for users can be expressed as something like:
use rule attributes for simple
grammars where all you really want is just the `std::vector<double>`. For
something complex with domain-defined
validation rules, prefer `_globals(ctx)` and semantic actions for
validating everything.
In general, Parser's semantic actions are a huge improvement over Spirit's.
---
I felt like I came up with working but unsatisfactory solutions to parsing
characters in a specific range.
Namely, I just wrote a simple semantic action around `+parser::char_[f]`
but this made me feel like I was
writing unidiomatic code. I'd appreciate some feedback here:
// reason-phrase = 1*( HTAB / SP / VCHAR / obs-text )
// VCHAR = %x21-7E
auto const on_reason_phrase = [](auto &ctx) {
auto ch = _attr(ctx);
if (ch != '\t' && ch != ' ' && (ch < '\x21' || ch > '\x7e')) {
_pass(ctx) = false;
return;
}
response::metadata &md = _globals(ctx).md_;
std::ranges::subrange
digit` was (it was `std::string`).
---
Dependencies are still a nightmare in C++, so Boost.Parser can be used as a purely standalone library, independent of Boost.
This was funny for me to read because I had a working vcpkg port working for my test project in like 2 minutes. --- The library seems to really emphasize Boost.Hana's `tuple` type but I think in practice, this winds up being a mistake unless there's substantive compile-time savings by using Hana here. Namely, Hana's `tuple` doesn't support structured bindings. Because Parser is a C++17 library, I consider this a somewhat large defect of the interface. What's worse, Hana's literals aren't compatible with Parser's. A huge boon for Hana and its `tuple` is that it supports indexing via `tup[0_c]` but this doesn't work when Parser's literal are involved as it also defines a `_c` so an ambiguity is introduced and compilation fails. Between this and the lack of structured binding support, I think relying on Hana here is a strict downgrade unless there's compelling reasons I'm simply unaware of. https://github.com/boostorg/hana/milestone/4 --- Stuff like `_globals` in the docs would 404 for me. I also think the docs are kind of just incomplete in general and because there's not much cross-referencing, discoverability is hurt so a user has to essentially read the entire docs front-to-back to understand the library which is a large ask for a parsing library. ### Conclusion Overall, I think that even with all of the problems I found in the library, it's still quite good and worthy of being brought into Boost as the successor for Spirit. I don't think anything I found was a deal-breaker and I think in a few iterations of Boost, this library can be something quite great. - Christian
On Thu, Feb 22, 2024 at 4:36 PM Christian Mazakas via Boost
plaintext dump as requested:
Thanks! I was going to have to do it to respond here. :) [snip]
### The Feedback
* Spirit X3 has rules that do not compose well — the attributes produced by a rule can change depending on the context in which you use the rule. * Spirit X3 is missing many of the convenient interfaces to parsers that Spirit 2 had. For instance, you cannot add
The differences between Spirit.X3 and Parser need to be made much more clear by the documentation, by which I mean: parameters to a parser.
* All versions of Spirit have Unicode support, but it is quite difficult to get working.
These need code samples showing the limitations of X3 and how Parser actually solves them. I think many, myself included, aren't overwhelmingly clear on what these differences are and how Parser solves them and _why_ this is a good thing.
Fair enough. The first one I'll need to try to write a good example for, but the basic idea is that the loose matching of a parser's default attribute to a user-provided attribute is not something you can turn off. So when you define a Spirit X3 rule with the attribute type vector<int>, it can still fill in a deque<int>. This means that some errors you write don't get diagnosed until way later than they could, and you can't really write a rule and test it in isolation, because the context in which you use it can change its behavior -- the loose matching rules are very loose. That's my memory of the problem anyway... I'll have to investigate a bit. The second one refers to parameterized rules, like you find in the YAML spec: [70] l-empty(n,c) ::= ( s-line-prefix(n,c) | s-indent-less-than(n) ) b-as-line-feed When you write the l_empty parser, you want to be able to take two arguments to it, n and c, and then pass those to the subparsers like s_line_prefix. You could do that with Spirit 2 parsers, but not with Spirit X3 parsers. I wrote a YAML parser, and so felt the loss of this pretty strongly. I'm not sure how many people write parsers like this though. The last one you have dealt with yourself -- I think it might be enough to say that Unicode support requires you to repeat a lot of your parser logic.
Though I have attempted X3's Unicode before and I will say, Parser's unified interface is much better. For example, I wrote an entire RFC-compliant URL parser in X3, including Unicode support, for my library Foxy https://github.com/cmazakas/foxy/blob/7f4ac0495ad2ed9cd0eca5994743d677ac1d26...
Yep. I had written Unicode-aware Spirit parsers before, and want to get rid of that kind of repetition.
Not a criticism but Parser's directives are a dramatic improvement over X3. By these, I'm thinking of `boost::parser::string_view[]`, `boost::parser::separate[]` and `boost::parser::merge[]`. These should be highlighted in the comparison vs X3 as I think they add a good bit of substantive value.
Glad you liked these. These came out of discussions with Andrzej about his frustration at having to write rules for some of his parsers. Rather than embrace the complexity of the attribute rules from Spirit, I thought lexically-visible control via directives would be easier to reason about.
More praise for the library includes its parse API which is a dramatic improvement over X3 in terms of ergonomics. Personally, I can see myself main using `prefix_parse`, similarly to X3's original parse() but I think the value of `parse()` itself is obvious and doesn't warrant much explanation.
Thanks! It took me about 6 different iterations to settle on a minimal overload set.
---
The library generates a dramatic amount of warnings when compiled with `-Wall -Wextra -pedantic`. I had originally forgotten to `include_directorie()` Parser as a `SYSTEM` directory and was met with 172 warnings.
I think being clean under `-Wall -Wextra -Werror` is reasonable criteria for acceptance.
I basically never turn on -Wextra, but a I have a ticket to make sure Parser is -Wall-clean: https://github.com/tzlaine/parser/issues/107
When writing the parser for the HTTP version line, I ran into a lot of issues separating the major version from the minor version when using:
parser::digit >> "." >> parser::digit
It was only when I was lazily browsing the docs that I found the `separate[]` directive which I think indicates the library's documentation has poor discoverability.
This is described in the page about attribute generation: https://tzlaine.github.io/parser/doc/html/boost_parser__proposed_/tutorial/a... . It's definitely difficult to know where to put things where people will see them. [snip]
The documentation doesn't exactly give users any guidelines about when they should formally define rules with attributes vs semantic actions.
This is one of those things where if users don't already have the intuition built up, using the library is an exercise in frustration.
Namely, attributes don't compose well when you start nesting rules and want to have different out params at different levels of the parsers. I suppose this should be obvious because the rules themselves form a hierarchy so the attributes _must_ mirror that.
I found that the best thing to do for my use-case was to abuse `_globals(ctx)` and semantic actions.
I think a good rule of thumb for users can be expressed as something like: use rule attributes for simple grammars where all you really want is just the `std::vector<double>`. For something complex with domain-defined validation rules, prefer `_globals(ctx)` and semantic actions for validating everything.
I can add something to the Best Practices page about this. Ticket: https://github.com/tzlaine/parser/issues/110
I felt like I came up with working but unsatisfactory solutions to parsing characters in a specific range.
Namely, I just wrote a simple semantic action around `+parser::char_[f]` but this made me feel like I was writing unidiomatic code. I'd appreciate some feedback here:
// reason-phrase = 1*( HTAB / SP / VCHAR / obs-text ) // VCHAR = %x21-7E auto const on_reason_phrase = [](auto &ctx) { auto ch = _attr(ctx); if (ch != '\t' && ch != ' ' && (ch < '\x21' || ch > '\x7e')) { _pass(ctx) = false; return; }
This part could just have been: char_('\x21', '\x7e') | char_('\t') | char_(' '). You might want to reorder it for the most common stuff to come first. Oh, since you put it in an omit below, it's even simpler: char_('\x21', '\x7e') | '\y' | ' '.
response::metadata &md = _globals(ctx).md_; std::ranges::subrange
sub = _where(ctx); if (!md.reason_phrase_begin_) { md.reason_phrase_begin_ = sub.begin(); } md.reason_phrase_end_ = sub.end(); md.status_line_end_ = sub.end(); }; auto const reason_phrase = parser::omit[+parser::char_[on_reason_phrase]];
---
There should be a debugging utility for discovering what the synthesized attribute is for a rule. Perhaps this could exist as a semantic action that attempts to instantiate an incomplete template.
template<class> struct diagnostic;
auto const debug=[](auto& ctx){diagnostic
d;}; It took me a long time to realize what the attribute type for `digit >> "."
digit` was (it was `std::string`).
Since you are not the only person asking for this: https://github.com/tzlaine/parser/issues/111
The library seems to really emphasize Boost.Hana's `tuple` type but I think in practice, this winds up being a mistake unless there's substantive compile-time savings by using Hana here.
Namely, Hana's `tuple` doesn't support structured bindings. Because Parser is a C++17 library, I consider this a somewhat large defect of the interface. What's worse, Hana's literals aren't compatible with Parser's.
A huge boon for Hana and its `tuple` is that it supports indexing via `tup[0_c]` but this doesn't work when Parser's literal are involved as it also defines a `_c` so an ambiguity is introduced and compilation fails.
Between this and the lack of structured binding support, I think relying on Hana here is a strict downgrade unless there's compelling reasons I'm simply unaware of.
When using Hana, why would you need to use parser::literals? I can only think of uses in the non-Hana compatibility stuff; when using Hana you don't need that stuff, I think.
Stuff like `_globals` in the docs would 404 for me. I also think the docs are kind of just incomplete in general and because there's not much cross-referencing, discoverability is hurt so a user has to essentially read the entire docs front-to-back to understand the library which is a large ask for a parsing library.
I'll audit the links. I did that at some point, but I guess there's been breakage since. Ticket: https://github.com/tzlaine/parser/issues/112
### Conclusion
Overall, I think that even with all of the problems I found in the library, it's still quite good and worthy of being brought into Boost as the successor for Spirit. I don't think anything I found was a deal-breaker and I think in a few iterations of Boost, this library can be something quite great.
Thanks for the review, Christian! Zach
On Thu, Feb 22, 2024 at 4:36 PM Christian Mazakas via Boost
[snip]
### The Feedback
The differences between Spirit.X3 and Parser need to be made much more clear by the documentation, by which I mean:
* Spirit X3 has rules that do not compose well — the attributes produced by a rule can change depending on the context in which you use the rule.
I will add this to the docs as well, but I thought I'd post here an
example of what I was referring to above. Here is a complete program
using X3:
#include
sob., 24 lut 2024 o 02:21 Zach Laine via Boost
On Thu, Feb 22, 2024 at 4:36 PM Christian Mazakas via Boost
wrote: [snip]
### The Feedback
* Spirit X3 has rules that do not compose well — the attributes
The differences between Spirit.X3 and Parser need to be made much more clear by the documentation, by which I mean: produced by a rule can change depending on the context in which you use the rule.
I will add this to the docs as well, but I thought I'd post here an example of what I was referring to above. Here is a complete program using X3:
#include
#include <iostream> #include <set> #include <string> #include <vector>
namespace x3 = boost::spirit::x3; using ints_type = x3::rule
; BOOST_SPIRIT_DECLARE(ints_type); x3::rule
ints = "ints"; constexpr auto ints_def = x3::int_ % ','; BOOST_SPIRIT_DEFINE(ints); #define FIXED_ATTRIBUTE 0
int main() { std::string input = "43, 42"; auto first = input.begin(); auto const last = input.end(); #if FIXED_ATTRIBUTE std::vector<int> result; #else std::set<int> result; #endif bool success = x3::phrase_parse(first, last, ints, x3::space, result); if (success) { // We want this to print "43 42\n". for (auto x : result) { std::cout << x << ' '; } std::cout << "\n"; }
return 0; }
IMO, the problem with this is that, even though I specified a vector as the rule's attribute type, I can feed the rule a set (#define FIXED_ATTRIBUTE 1), and it happily populates it with the parsed values. Note that this sorts, and in this case reorders, the values. I think this looseness is important in many cases, but there's no way to turn it off in X3. In Parser the way to turn it off is to use a rule instead of a parser. My inability to disable this looseness resulted in wrong results in an X3 YAML parser I was working on. I really loved so much of the design of X3, that my frustration with this problem was a pretty big part of the motivation for writing Parser.
I would recommend putting something like the above in the docs, where the motivation for using rule parsers is listed. Currently it states "fix the attribute type produced by a parser to something other than the default" but what you sayin above is more: "prevent the parsers from doing things that they would otherwise do". I note that parsing a homogenous sequence to a std::set does work, at least without rules: https://godbolt.org/z/x57j1dq5T I was surprised to find that, in the context of other precautions, feeding parsed data to a collection is performed via `insert` rather than `push_back`. Regards, &rzej;
Zach
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Below is my review of the proposed Boost.Parser.
## What is your evaluation of the design?
I thought it was easier to pick-up and use than Spirit. How to use the library is clear, and I didn't see any real gotchas floating around. Like others I appreciate that the library has a descriptive name.
## What is your evaluation of the implementation?
It's generally pretty clean. The numeric parsing in
This part could just have been: char_('\x21', '\x7e') | char_('\t') | char_(' '). You might want to reorder it for the most common stuff to come first. Oh, since you put it in an omit below, it's even simpler: char_('\x21', '\x7e') | '\y' | ' '.
I just wanted to chime in and say that this feedback actually worked and the code is a good bit cleaner now! // reason-phrase = 1*( HTAB / SP / VCHAR / obs-text ) // VCHAR = %x21-7E auto const on_reason_phrase = [](auto &ctx) { auto sv = _attr(ctx); response::metadata &md = _globals(ctx).md_; md.reason_phrase_begin_ = sv.begin(); md.reason_phrase_end_ = sv.end(); md.status_line_end_ = sv.end(); }; auto const reason_phrase = parser::string_view[+(parser::char_('\x21', '\x7e') | '\t' | ' ')] [on_reason_phrase]; The `string_view[]` directive was very sorely missing from X3 the last time I used it in any serious capacity. Thanks for the feedback, Zach! This code actually looks really nice now. - Christian
On Fri, Feb 23, 2024 at 2:59 AM Matt Borland via Boost
Below is my review of the proposed Boost.Parser.
## What is your evaluation of the design?
I thought it was easier to pick-up and use than Spirit. How to use the library is clear, and I didn't see any real gotchas floating around. Like others I appreciate that the library has a descriptive name.
## What is your evaluation of the implementation?
It's generally pretty clean. The numeric parsing in
seems to be a reduced version of the parser from Spirit X3. To test if it was similar I ran this library through the Boost.Charconv benchmarks for performance evaluation with parsing floats, doubles, uint32, and uint64. I ran this using Homebrew GCC 13.2.0 on an M1 Mac. The test set consists of a test set of 2'000'000 random numbers of each type, and that test set is run 5 times. The timing in the left column is the average time to parse 1 number (i.e. total time / 10'000'000). ----from_chars float---- 29.7 ns | std::strtof float scientific 36.7 ns | std::strtod double scientific 519.9 ns | Boost.lexical_cast float scientific 574.0 ns | Boost.lexical_cast double scientific 28.3 ns | Boost.Spirit.Qi float scientific 34.7 ns | Boost.Spirit.Qi double scientific 72.1 ns | Boost.Parser float scientific 78.9 ns | Boost.Parser double scientific 22.0 ns | std::from_chars float scientific 22.2 ns | std::from_chars double scientific 23.0 ns | Boost.Charconv::from_chars float scientific 22.7 ns | Boost.Charconv::from_chars double scientific 65.6 ns | Google double-conversion float scientific 115.2 ns | Google double-conversion double scientific
-----from_chars int----- 23.1 ns | std::strtoul uint32_t 26.8 ns | std::strtoull uint64_t 40.4 ns | Boost.lexical_cast uint32_t 51.4 ns | Boost.lexical_cast uint64_t 29.3 ns | Boost.Spirit.Qi uint32_t 29.4 ns | Boost.Spirit.Qi uint64_t 53.7 ns | Boost.Parser uint32_t 66.3 ns | Boost.Parser uint64_t 11.1 ns | std::from_chars uint32_t 16.4 ns | std::from_chars uint64_t 10.0 ns | Boost.Charconv::from_chars uint32_t 16.3 ns | Boost.Charconv::from_chars uint64_t
The library requires C++17 so it may be worth offloading some of the numeric cases to <charconv> if possible.
Interesting. I would have expected the perf to be basically the same as Qi, since you're right, I took it from Spirit. However, I took it from Spirit X3, not Spirit 2's Qi. Maybe that's the difference? In any case, I think that using one of the from/to_chars implementations would be best. Ticket for that: https://github.com/tzlaine/parser/issues/113
The only thing I didn't like seems to be a carry-over from Spirit. Most numeric types have a parser of the form type_ (e.g. float_, int_, etc). ulong_long and long_long are both missing the trailing underscore which is a small enough difference to be surprising and a bit annoying.
Well, in all cases, there's an extra '_' after the first token of the original type name. See? Consistent. :) Basically, I saw no need to change it -- you strictly need the trailing '_' for float, and you don't need it for long_long.
I think the CI could be more comprehensive. All of the current testing is run on x86_64. The boost drone has ARM Mac, S390X, and aarch64 all running natively. Boost.CI also has some docker images of other architectures. I concur with Chris about adding CodeCov to the CI. I have added it to a number of boost libraries recently, and it has helped us to find, and fix dark corners of the codebase.
I'm all for this -- patches welcome. I don't understand how Drone works at all, or even where the builds happen, or I would have tried setting this up myself. [snip] Thanks for the review, Matt! And the PR. Zach
I would recommend putting something like the above in the docs, where the motivation for using rule parsers is listed. Currently it states "fix the attribute type produced by a parser to something other than the default" but what you sayin above is more: "prevent the parsers from doing things that they would otherwise do".
Ha, came online to say exactly this. I think that description of the problem being solved is more than satisfactory and illuminates the issue quite well.
I note that parsing a homogenous sequence to a std::set does work, at least without rules: https://godbolt.org/z/x57j1dq5T
I was surprised to find that, in the context of other precautions, feeding parsed data to a collection is performed via `insert` rather than `push_back`. This is probably a holdover from the X3 days as this is who Spirit treats
collections generically and, ironically, is done for the exact reason that `std::set` supports `insert(pos, val)`.
In principle, I don't think that a free-form grammar like the above parsing into a `set` is problematic. Instead, I agree more with Zach that when you're formally defining rules, you're really telling the library, "Hey, this is exactly what I want to be parsing into." - Christian
pon., 26 lut 2024 o 17:05 Christian Mazakas via Boost
I would recommend putting something like the above in the docs, where the motivation for using rule parsers is listed. Currently it states "fix the attribute type produced by a parser to something other than the default" but what you sayin above is more: "prevent the parsers from doing things that they would otherwise do".
Ha, came online to say exactly this. I think that description of the problem being solved is more than satisfactory and illuminates the issue quite well.
I note that parsing a homogenous sequence to a std::set does work, at least without rules: https://godbolt.org/z/x57j1dq5T
I was surprised to find that, in the context of other precautions, feeding parsed data to a collection is performed via `insert` rather than `push_back`. This is probably a holdover from the X3 days as this is who Spirit treats
collections generically and, ironically, is done for the exact reason that `std::set` supports `insert(pos, val)`.
In principle, I don't think that a free-form grammar like the above parsing into a `set` is problematic. Instead, I agree more with Zach that when you're formally defining rules, you're really telling the library, "Hey, this is exactly what I want to be parsing into."
I would be fine with this approach, except that I found the behavior in Boost.Parser is more complicated than that. I had a parser in Boost.Spirit V2 that does not use a single rule: https://github.com/akrzemi1/wdungeon/blob/master/parser.cpp#L47 This does no longer work in Boost.Parser, because it imposes stricter rules on attribute conversions *even when I do not use rules*. So, I cannot build a simple conceptual model here. Regards, &rzej;
- Christian
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Boost.Parser Review I'd like to thank Zach for his work. Here's my review.
Does this library bring real benefit to C++ developers for real-world use-case?
Yes. Boost.Spirit is already a great library that exemplifies what's possible in C++. But it's pre-C++11 and no longer in active development. So even if this were only a modern and actively maintained version of Spirit, that's already good enough considering how important the problem is. It seems like many people have issues with Spirit. For instance, we avoided using it in Boost.URL and know of other people who have also been avoiding it in their projects. Some things I like about Boost.Parser: - Boost.Parser uses the same conventions as Spirit to combine parsers. - Being able to use C++17 vocabulary types in the API - Unicode support - Better error reporting - The Boost.Spirit authors gave it their blessing - The library name is very direct The issue with Spirit people seem to mention most frequently for avoiding it is compile times. Unfortunately, there's not a lot that can be done by Boost.Parser here, but users can still mitigate that by only including the library in their source files if they want.
Do you have an application for this library?
We have lots of libraries where we need to parse things. I would consider Boost.Parse for new libraries once it's more stable. Even if we intend to use custom parsers in the end, Boost.Parser could provide a good starting point.
Does the API match with current best practices?
Using the same conventions as Spirit to combine parsers seems like a good
design here.
I was a little confused about the convention to assume unicode for anything
other than char.
I'm not sure that's the convention that minimizes surprise.
I would expect the same convention (unicode or not) to apply to any data
type because:
- It's one more thing to remember.
- I often don't want to interpret things as code points regardless of
whether it's char or unsigned char.
- I noticed many people also don't care about unicode most of the time in
their use cases.
- It also seems to be the case that the relevant keywords in most grammars
are ASCII.
- The documentation says "the Unicode parsing API will not actually cost
you anything" but two sentences later it says "one caveat is that there may
be an extra branch on each char, if the input is UTF-8"
At least, the library documentation could make it very clear right at the
start how the user can explicitly specify if they want to enable/disable
unicode support for whatever data type they wish.
I find the macros to define rules a little unsettling.
- Isn't it possible to at least document what things would look like
without a macro so that readers can have a good idea of what the macro is
doing whenever they use it?
- A single macro to define the rules could also be useful.
- It would also be useful if the cases where macros are not as useful
(non-recursive rules?) could be specified in the documentation.
About Hana:
- I don't like optional dependencies in general, so I also don't like the
idea of supporting std::tuple and hana tuples separately.
- Since the library is boost/standalone and tuple/hana, then that's already
four combinations that are unlikely to be tested for all relevant compilers
and environments.
- That's more possibilities of conflicts for users compiling different
versions of the library.
- I especially don't like the idea of using another type for tuples when we
already have a vocabulary type from the standard.
- I feel like whatever Hana provides can be achieved with std::tuple and
maybe some other library like mp11 in extreme cases.
- It's probably possible to support Hana as a peer-dependency where the
rule would fill any tuple-like object according to some concept (the rules
can already do that for containers, variants and optionals if I'm not
wrong).
- Including all of
Is the documentation helpful and clear?
Yes. The documentation is clear. It's driven by goals and examples, which is great. Knowing Boost.Spirit helped me, so it might be useful to give more weight to feedback from people who don't know Boost.Spirit yet. I found the section on "Writing Your Own Parsers" very limited considering that we know "some people are obsessed with writing everything for themselves". Very often, we have ad hoc rules that are hard to express as a combination of other rules. At other times, as a library evolves, custom rules can make things less generic and more efficient even if they could be expressed with the more generic rules. For instance, many examples in the documentation would be more easily represented as functions in my opinion. So I find this insufficient: "It is a basic orientation to get you familiar enough with all the moving parts of writing a parser that you can then learn by reading the Boost.Parser code" because the documentation should be exposing all the concepts and documenting everything. Telling such a usual user to read the code to reverse engineer how it should be done is no good. Especially because the number of concepts and requirements involved is probably not much larger than what's already documented. Are there small examples of binary parsers somewhere? The reference is just kind of ugly and it's difficult to find individual symbols. The difference between this library and Spirit versions should be more explicit and justified in the documentation. Some pros and cons only apply to certain versions of Spirit, so that should also be explicit instead of just mentioning Spirit 2 _or_ Spirit X3. I think some other people were also confused about that. The section "The Parse Context" is a little confusing. Unlike other sections, it's not goal-oriented with lots of examples. The section "Alternative Parsers" is a little weird. I don't see a reason to discuss operator|() individually in a single section and not discuss other operators and the previous or next sections. I would expect a section listing all operators (as exists later) and operator|() to just be part of it. I find the TOC overwhelming (reminds me of the Boost.Asio documentation a little). There are too many sections under "Tutorial". Almost the whole thing is under "Tutorial", which makes categorization somewhat pointless. It would be good to use Miller's Law at each level here so that the reader can always know where he is. This whole "Tutorial" section doesn't even look much like a tutorial. One thing I'm also missing is some cheat sheets with all the symbols provided by the library related to what a given section is discussing.
Did you try to use it? What problems or surprises did you encounter?
Yes. It worked fine with MSVC over here but CI needs to improve because people have reported failures with other compiler versions. It's not a reason to reject the library but it's something that needs to be fixed and causes a bad first impression. CI should also include coverage, more warnings, sanitizers, and probably fuzz tests. I believe Sam can also help with Drone if that's needed.
What is your evaluation of the implementation?
The code is very well organized and properly documented. I found some issues in the CMake scripts but I'll just open issues in the repository for these.
Are you knowledgeable about the problem domain?
We have been using custom parsers in almost all of our libraries. I've also used Boost.Spirit before. In most cases, we just write custom logic for the parsers now. But something like Boost.Parser could save us a lot of time in new libraries even if we intend to use custom parsers later.
Please explicitly state that you either *accept* or *reject* the inclusion of this library into boost.
Also please indicate the time & effort spent on the evaluation and give
I recommend accepting the library. the reasons for your decision. I spent time on and off over the last weeks and this week. In total, I spent something like 2 or 3 days evaluating the library.
Disclaimer
I'm affiliated with the C++ Alliance. -- Alan Freitas
On Wed, Feb 28, 2024 at 9:18 PM Alan de Freitas via Boost
Boost.Parser Review
[snip]
Does the API match with current best practices?
Using the same conventions as Spirit to combine parsers seems like a good design here.
I was a little confused about the convention to assume unicode for anything other than char. I'm not sure that's the convention that minimizes surprise.
The rationale is this: All the other character types in C++ -- every one -- advertises that it is a certain Unicode encoding (char{8,16,32}_t) or is specifically designed to hold Unicode code points (wchar_t). char is the only exception to this.
I would expect the same convention (unicode or not) to apply to any data type because: - It's one more thing to remember. - I often don't want to interpret things as code points regardless of whether it's char or unsigned char. - I noticed many people also don't care about unicode most of the time in their use cases. - It also seems to be the case that the relevant keywords in most grammars are ASCII. - The documentation says "the Unicode parsing API will not actually cost you anything" but two sentences later it says "one caveat is that there may be an extra branch on each char, if the input is UTF-8" At least, the library documentation could make it very clear right at the start how the user can explicitly specify if they want to enable/disable unicode support for whatever data type they wish.
For the performance part, I would be very surprised to see a measurable difference between chars and chars + as_utf32. The reason is that parsing is *super* branchy, and the branch on each char here is "if (c < 128)". The branch predictor will guess right after seeing that a few times. As for turning off Unicode for charN_t and wchar_t, do you have a compelling use case for that? It seems very unlikely to me.
I find the macros to define rules a little unsettling. - Isn't it possible to at least document what things would look like without a macro so that readers can have a good idea of what the macro is doing whenever they use it?
I've already added this to the mid-review branch, but each rule you
give to the macro expands to this:
#define BOOST_PARSER_DEFINE_IMPL(_, rule_name_) \
template< \
bool UseCallbacks, \
typename Iter, \
typename Sentinel, \
typename Context, \
typename SkipParser> \
auto parse_rule( \
decltype(rule_name_)::parser_type::tag_type *, \
std::bool_constant<UseCallbacks> use_cbs, \
Iter & first, \
Sentinel last, \
Context const & context, \
SkipParser const & skip, \
boost::parser::detail::flags flags, \
bool & success) \
{ \
auto const & parser = BOOST_PARSER_PP_CAT(rule_name_, _def); \
return parser(use_cbs, first, last, context, skip, flags, success); \
} \
\
template< \
bool UseCallbacks, \
typename Iter, \
typename Sentinel, \
typename Context, \
typename SkipParser, \
typename Attribute> \
void parse_rule( \
decltype(rule_name_)::parser_type::tag_type *, \
std::bool_constant<UseCallbacks> use_cbs, \
Iter & first, \
Sentinel last, \
Context const & context, \
SkipParser const & skip, \
boost::parser::detail::flags flags, \
bool & success, \
Attribute & retval) \
{ \
auto const & parser = BOOST_PARSER_PP_CAT(rule_name_, _def); \
using attr_t = decltype(parser( \
use_cbs, first, last, context, skip, flags, success)); \
if constexpr (boost::parser::detail::is_nope_v
- A single macro to define the rules could also be useful. - It would also be useful if the cases where macros are not as useful (non-recursive rules?) could be specified in the documentation.
About Hana: - I don't like optional dependencies in general, so I also don't like the idea of supporting std::tuple and hana tuples separately. - Since the library is boost/standalone and tuple/hana, then that's already four combinations that are unlikely to be tested for all relevant compilers and environments. - That's more possibilities of conflicts for users compiling different versions of the library. - I especially don't like the idea of using another type for tuples when we already have a vocabulary type from the standard. - I feel like whatever Hana provides can be achieved with std::tuple and maybe some other library like mp11 in extreme cases. - It's probably possible to support Hana as a peer-dependency where the rule would fill any tuple-like object according to some concept (the rules can already do that for containers, variants and optionals if I'm not wrong). - Including all of
can increase compile times significantly. Maybe I'm missing something and Hana enables something great but I don't see it. If it were that great and important, it wouldn't be an _optional_ dependency.
hana::tuple has one great feature that std::tuple does not: operator[]. It is *so* much nicer to use that than std::get<>(), that it makes it worth all the trouble. I have been using Parser to write parsers for a few years now, and I'm telling you, don't sleep on op[].
Maybe the parsers for numbers could use Boost.Charconv under the hood. I don't know if that's feasible for all cases.
It is feasible. It's already done on my branch.
In our applications, we often have charsets defined by functions that return whether a char is reserved in a grammar. I couldn't find a good way to express that with Boost.Parser (although operator|() combining charsets are often enough). If that's correct, it would be good to have some alternatives here. It would also be good to have rules for very common charsets if that doesn't already exist.
I don't really know what you mean by "reserved" here, but the parsers are all pretty compact, and are all copyable and assignable. If you want to return one from a function, you can do so. You just need to make sure that either: 1) you construct it with one or two characters, like char_('a')/char_('a', 'g'), or 2) if you construct with a range, like char_("abcdef"), that the range you use outlives the parser.
Is the documentation helpful and clear?
Yes. The documentation is clear. It's driven by goals and examples, which is great. Knowing Boost.Spirit helped me, so it might be useful to give more weight to feedback from people who don't know Boost.Spirit yet.
I found the section on "Writing Your Own Parsers" very limited considering that we know "some people are obsessed with writing everything for themselves". Very often, we have ad hoc rules that are hard to express as a combination of other rules. At other times, as a library evolves, custom rules can make things less generic and more efficient even if they could be expressed with the more generic rules. For instance, many examples in the documentation would be more easily represented as functions in my opinion. So I find this insufficient: "It is a basic orientation to get you familiar enough with all the moving parts of writing a parser that you can then learn by reading the Boost.Parser code" because the documentation should be exposing all the concepts and documenting everything. Telling such a usual user to read the code to reverse engineer how it should be done is no good. Especially because the number of concepts and requirements involved is probably not much larger than what's already documented.
I don't really expect too many users to do this, and it was never a design goal to make this easy to do. I sort of begrudgingly added that section at all because Andrzej was hounding me about it. :) If I were to do it properly, as a first-order design element, I'd probably redesign it a bit. Right now, I don't intend to. Maybe this would be future work, if people start writing a lot of them.
Are there small examples of binary parsers somewhere?
No, there are no binary parsers.
The section "The Parse Context" is a little confusing. Unlike other sections, it's not goal-oriented with lots of examples.
This is an excellent point; I'll revise that section. Ticket: https://github.com/tzlaine/parser/issues/145
One thing I'm also missing is some cheat sheets with all the symbols provided by the library related to what a given section is discussing.
This is also available on my branch.
Did you try to use it? What problems or surprises did you encounter?
Yes. It worked fine with MSVC over here but CI needs to improve because people have reported failures with other compiler versions. It's not a reason to reject the library but it's something that needs to be fixed and causes a bad first impression. CI should also include coverage, more warnings, sanitizers, and probably fuzz tests.
I intend to add coverage, do not intend to turn on warnings past -Wall (which I always use), always build with ASan enabled. I'm intrigued by the idea of fuzzing, but I haven't figured out what to fuzz exactly. I guess I could have fuzzed input to the parse() overloads, but I'd really like to also have fuzzed combinations of parsers. I'm all ears if anyone can figure out how to do that. In the earliest versions of the library, I actually randomly generated parsers to catch problems.
I believe Sam can also help with Drone if that's needed.
I'm very interested in covering more platforms. I'll reach out to Sam about setting up Drone. Thanks for your review! Zach
As for turning off Unicode for charN_t and wchar_t, do you have a compelling use case for that? It seems very unlikely to me.
As I wrote earlier, Windows path names or the input to Win32 edit boxes are 16 bit but may be invalid Unicode. Arno -- Dr. Arno Schödl CTO schoedl@think-cell.com | +49 30 6664731-0 We are looking for C++ Developers: https://www.think-cell.com/developers think-cell Software GmbH (https://www.think-cell.com) Leipziger Str. 51, 10117 Berlin, Germany Main phone +49 30 6664731-0 | US toll-free +1 800 891 8091 Amtsgericht Berlin-Charlottenburg HRB 180042 Directors: Christoph Hobo, Dr. Arno Schödl Please refer to our privacy policy (https://www.think-cell.com/privacy) on how we protect your personal data.
On Wed, Feb 28, 2024 at 11:26 PM Arno Schoedl
As for turning off Unicode for charN_t and wchar_t, do you have a compelling use case for that? It seems very unlikely to me.
As I wrote earlier, Windows path names or the input to Win32 edit boxes are 16 bit but may be invalid Unicode.
Ah, thanks. My understanding from talking to the MSVC library people is that this is no longer a problem. That is, if you see this problem now, it's because you're pulling filenames off of a file system from an old version of Windows. There hasn't been a Windows OS one in the last few years that suffers from this. Zach
is that this is no longer a problem. That is, if you see this problem now, it's because you're pulling filenames off of a file system from an old version of Windows. There hasn't been a Windows OS one in the last few years that suffers from this. Linux seems to suffer from it as well: https://unix.stackexchange.com/questions/667652/can-a-file-path-be-invalid-u... [apple-touch-icon@2.png] Can a file path be invalid UTF-8?https://unix.stackexchange.com/questions/667652/can-a-file-path-be-invalid-u... unix.stackexchange.comhttps://unix.stackexchange.com/questions/667652/can-a-file-path-be-invalid-u... And what‘s the benefit of insisting to be able to convert everything to UTF-32? It takes time, makes the parser less flexible, most matching is ASCII, and if you want to match Unicode, you have the extra step of normalization anyway. But I guess I won‘t convince you… Arno -- Dr. Arno Schödl CTO schoedl@think-cell.commailto:schoedl@think-cell.com | +49 30 6664731-0 We are looking for C++ Developers: https://www.think-cell.com/developers think-cell Software GmbH (Web sitehttps://www.think-cell.com) Leipziger Str. 51, 10117 Berlin, Germany Main phone +49 30 6664731-0 | US toll-free +1 800 891 8091 Amtsgericht Berlin-Charlottenburg HRB 180042 Directors: Christoph Hobo, Dr. Arno Schödl Please refer to our privacy policyhttps://www.think-cell.com/privacy on how we protect your personal data.
The rationale is this: All the other character types in C++ -- every one -- advertises that it is a certain Unicode encoding (char{8,16,32}_t) or is specifically designed to hold Unicode code points (wchar_t). char is the only exception to this.
What about unsigned char?
As for turning off Unicode for charN_t and wchar_t, do you have a compelling use case for that? It seems very unlikely to me.
From what I saw in other reviews, I think it's more about not needing it (and not wanting to pay the price) than about needing it not to exist.
hana::tuple has one great feature that std::tuple does not: operator[]. It is *so* much nicer to use that than std::get<>(), that it makes it worth all the trouble. I have been using Parser to write parsers for a few years now, and I'm telling you, don't sleep on op[].
Is there a case where this makes it much more convenient than std::get? Is there are reason why it being a peer-dependency wouldn't work? (Just letting it generically populate a tuple type like it does for other types) Also the other rationale I included.
Telling such a usual user to read the code to reverse engineer how it should be done is no good. Especially because the number of concepts and requirements involved is probably not much larger than what's already documented.
I don't really expect too many users to do this, and it was never a design goal to make this easy to do. I sort of begrudgingly added that section at all because Andrzej was hounding me about it.
I definitely would do this if using the library. I don't know if it's the same rationale as Andrzej's, but I would avoid using the library if I knew I would be stuck if I reached a corner where I couldn't find a way to represent a rule by composing existing rules. Also the other rationale I included.
if people start writing a lot of them.
Yes. I would start writing them from day 0. So you can count on me.
On 2/28/24 11:37 PM, Zach Laine via Boost wrote:
On Wed, Feb 28, 2024 at 9:18 PM Alan de Freitas via Boost
wrote: Boost.Parser Review [snip]
Does the API match with current best practices? Using the same conventions as Spirit to combine parsers seems like a good design here.
I was a little confused about the convention to assume unicode for anything other than char. I'm not sure that's the convention that minimizes surprise. The rationale is this: All the other character types in C++ -- every one -- advertises that it is a certain Unicode encoding (char{8,16,32}_t) or is specifically designed to hold Unicode code points (wchar_t). char is the only exception to this.
Now, Zach, as a long time SG16 participant and contributor, you know full well that wchar_t doesn't imply Unicode :) (On AIX, wchar_t is associated with IBM-eucTW for the zh_TW locale https://www.ibm.com/docs/vi/aix/7.3?topic=representation-wide-character-data. on z/OS, wchar_t is associated with SBCS EBCDIC, DBCS EBCDIC, Unicode, or a locale dependent encoding https://www.ibm.com/docs/en/zos/3.1.0?topic=options-convlit-noconvlit) Tom.
On Thu, Feb 29, 2024 at 10:17 AM Tom Honermann via Boost
Now, Zach, as a long time SG16 participant and contributor, you know full well that wchar_t doesn't imply Unicode :)
I despise wchar_t and avoid it completely in public interfaces. I'll drop down to using it inside an implementation, but my functions which accept Unicode strings use char, and I perform the conversion in the implementation even when doing so has a performance cost (such as on Windows which wants a "wide string"). I try to offer only utf-8 as the choice of encoding for Unicode (again even on Windows). It's a thankless job but someone has to protect the world from the ugliness of wide characters. Thanks
On 2/29/24 1:25 PM, Vinnie Falco wrote:
On Thu, Feb 29, 2024 at 10:17 AM Tom Honermann via Boost
wrote: Now, Zach, as a long time SG16 participant and contributor, you know full well that wchar_t doesn't imply Unicode :) I despise wchar_t and avoid it completely in public interfaces. I'll drop down to using it inside an implementation, but my functions which accept Unicode strings use char, and I perform the conversion in the implementation even when doing so has a performance cost (such as on Windows which wants a "wide string"). I try to offer only utf-8 as the choice of encoding for Unicode (again even on Windows). This approach is not wrong ;) It's a thankless job but someone has to protect the world from the ugliness of wide characters.
It is too late to save ourselves, but the choices we make today may save our great great great grandchildren! :) Tom.
I'm not entirely sold on the value of Hana's `operator[]` for tuple accesses. Especially because in later C++, we have placeholders so we can always just do: auto const& [first,_,_] = my_tuple; https://godbolt.org/z/vvnKxjGos I largely agree with Alan, Hana's a heavyweight dependency for what I think is dubious benefit. Maybe others feel differently here but I'd prefer to just always work with the standard tuple and structured bindings than Hana's `operator[]`. - Christian
It is up to you. I think Zach would appreciate knowing if there was a significant performance issue with the library.
Fwiw, there is a dramatic performance penalty to using this library and I almost couldn't believe it. I copy-pasted the JSON example from the Parser repo and then did an informal comparison to Boost.JSON. For one, Boost.JSON should just export a proper CMake target like all the other compiled Boost libraries. And then two, these were my findings: ❯ ./__build__/test/json_bench /home/exbigboss/cpp/boost-root/libs/json/bench/data/twitter.json Boost.Parser took: 1037ms parse successful! 1 Boost.JSON took: 1ms parse successful! 1 Source: https://github.com/cmazakas/parser-review/blob/main/test/json.cpp I'm running all of this on a 3.6 GHz 12 core with 32 GB of RAM. I'm honestly shocked that it takes Boost.JSON roughly 1ms whereas it takes roughly 1000x that with Parser. I will note this, I did have to increase the recursion limit so I'm thinking this is actually more or less a fault in the implementation of the recursive parsing rules in Parser. Either way, this is 112% worth investigating and I'd almost recommend halting acceptance until we sort this out. - Christian
On Thu, Feb 29, 2024 at 3:14 PM Christian Mazakas via Boost
I copy-pasted the JSON example from the Parser repo and then did an informal comparison to Boost.JSON.
As I alluded to before the review started, that is grossly unfair. Boost.Parser is not optimized for JSON, and your Boost.Parser benchmark is also not optimized. I know this because optimizing the benchmark would take weeks and that amount of time hasn't yet elapsed. JSON uses custom memory allocators. I did not look too closely at Parser, but I am not sure if it has the API needed to customize how allocations are performed (I see Parser's implementation calls some "push_back" function in hot code paths). Zach could weigh in on this. For a fair comparison, you would need this: 1. Boost.Parser offers sufficient customization so that an allocator optimized for JSON can be used 2. An optimized container for storing JSON is used (such as boost::json::value) 3. The benchmark for measuring Boost.Parser's JSON parsing performance is optimized
For one, Boost.JSON should just export a proper CMake target like all the other compiled Boost libraries.
If that does not negatively impact certain workflows (well, mine) then the maintainer should do this. Thanks
On Thu, Feb 29, 2024 at 3:39 PM Vinnie Falco
As I alluded to before the review started, that is grossly unfair...
As usual, the pressing of SEND causes new valuable insights to manifest. What is important is not Boost.Parser's performance implementing a stock JSON parser, but rather that it is POSSIBLE for Boost.Parser to achieve comparable performance given enough work. Thus the question is, does Boost.Parser offer a sufficiently rich public API to allow for controlling how things are allocated and which containers are used, so that the optimizations in Boost.JSON can be replicated. Thanks
On Thu, Feb 29, 2024 at 5:42 PM Vinnie Falco via Boost
On Thu, Feb 29, 2024 at 3:39 PM Vinnie Falco
wrote: As I alluded to before the review started, that is grossly unfair...
As usual, the pressing of SEND causes new valuable insights to manifest.
What is important is not Boost.Parser's performance implementing a stock JSON parser, but rather that it is POSSIBLE for Boost.Parser to achieve comparable performance given enough work. Thus the question is, does Boost.Parser offer a sufficiently rich public API to allow for controlling how things are allocated and which containers are used, so that the optimizations in Boost.JSON can be replicated.
Agreed, and I think it does. There is no allocator support per se, but you can parse into your own types, which may use your custom allocator(s). Parser rarely allocates, and when it does it does it, 1) in a parse-exception path (failed expectation point), 2) in the optional trace functionality, and 3) when you tell it to by the attributes you use (explicitly or implicitly). Zach
On Thu, Feb 29, 2024 at 4:54 PM Zach Laine via Boost
...
I'm not sure if it is possible, but a good comparison could be achieved by writing a Boost.Parser front end to parse JSON, and have it call the corresponding member functions of boost::json::detail::handler on an instance of that object here: https://github.com/boostorg/json/blob/8f5b63510b07caa96f413777a412d930781fac... This will allow an apples to apples comparison, as it removes the question of optimizing the destination container. To retrieve the result after parsing call handler->st.release(). Thanks
On Thu, Feb 29, 2024 at 5:40 PM Vinnie Falco via Boost
On Thu, Feb 29, 2024 at 3:14 PM Christian Mazakas via Boost
wrote: I copy-pasted the JSON example from the Parser repo and then did an informal comparison to Boost.JSON.
As I alluded to before the review started, that is grossly unfair.
I definitely think it's unfair (not apples-to-apples), but 1000x? Something's not right. So, it's good it's come up even if it's not an apples/apples comparison.
Boost.Parser is not optimized for JSON, and your Boost.Parser benchmark is also not optimized. I know this because optimizing the benchmark would take weeks and that amount of time hasn't yet elapsed. JSON uses custom memory allocators. I did not look too closely at Parser, but I am not sure if it has the API needed to customize how allocations are performed (I see Parser's implementation calls some "push_back" function in hot code paths). Zach could weigh in on this.
For a fair comparison, you would need this:
1. Boost.Parser offers sufficient customization so that an allocator optimized for JSON can be used 2. An optimized container for storing JSON is used (such as boost::json::value) 3. The benchmark for measuring Boost.Parser's JSON parsing performance is optimized
For one, Boost.JSON should just export a proper CMake target like all the other compiled Boost libraries.
If that does not negatively impact certain workflows (well, mine) then the maintainer should do this.
I think the allocator might have something to do with it, but it might also just be that Boost.JSON is just a more robust/efficient implementation than the toy I made for that example. Boost.JSON doesn't even allocate a lot of the time, IIRC. The JSON example uses a hand-rolled, dog-simple, allocate-all-the-time implementation for the JSON value type itself. Old-fashioned type erasure, no SBO, no COW, no nothin. That's why I'll be looking at the other example for comparison. The other one is a callback parser, and never actually makes any values. Testing *that* is much closer to testing the parsing speed. Zach
On Thu, Feb 29, 2024 at 5:14 PM Christian Mazakas via Boost
It is up to you. I think Zach would appreciate knowing if there was a significant performance issue with the library.
Fwiw, there is a dramatic performance penalty to using this library and I almost couldn't believe it.
I copy-pasted the JSON example from the Parser repo and then did an informal comparison to Boost.JSON.
For one, Boost.JSON should just export a proper CMake target like all the other compiled Boost libraries.
And then two, these were my findings: ❯ ./__build__/test/json_bench /home/exbigboss/cpp/boost-root/libs/json/bench/data/twitter.json Boost.Parser took: 1037ms parse successful! 1 Boost.JSON took: 1ms parse successful! 1
Source: https://github.com/cmazakas/parser-review/blob/main/test/json.cpp
I'm running all of this on a 3.6 GHz 12 core with 32 GB of RAM. I'm honestly shocked that it takes Boost.JSON roughly 1ms whereas it takes roughly 1000x that with Parser.
I will note this, I did have to increase the recursion limit so I'm thinking this is actually more or less a fault in the implementation of the recursive parsing rules in Parser. Either way, this is 112% worth investigating and I'd almost recommend halting acceptance until we sort this out.
I'm shocked too. That's really crazy. Believe me, I'm not interested in merging something with orders of magnitude of perf overhead into Boost. I will be getting to the bottom of this long before a merge could take place. Zach
Zach Laine wrote:
I'm shocked too. That's really crazy. Believe me, I'm not interested in merging something with orders of magnitude of perf overhead into Boost. I will be getting to the bottom of this long before a merge could take place.
My profiler (VS2022) says that the top performance problem is the construction
of a stringstream here
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da...
This constructs a std::locale, which is apparently very slow.
When I fix it
diff --git a/include/boost/parser/detail/printing.hpp b/include/boost/parser/detail/printing.hpp
index 1e204796..6cbec059 100644
--- a/include/boost/parser/detail/printing.hpp
+++ b/include/boost/parser/detail/printing.hpp
@@ -621,11 +621,19 @@ namespace boost { namespace parser { namespace detail {
flags f,
Attribute const & attr)
{
- std::stringstream oss;
if (detail::do_trace(f))
+ {
+ std::stringstream oss;
detail::print_parser(context, parser, oss);
- return scoped_trace_t
On Thu, Feb 29, 2024 at 8:44 PM Peter Dimov via Boost
Zach Laine wrote:
I'm shocked too. That's really crazy. Believe me, I'm not interested in merging something with orders of magnitude of perf overhead into Boost. I will be getting to the bottom of this long before a merge could take place.
My profiler (VS2022) says that the top performance problem is the construction of a stringstream here
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da...
This constructs a std::locale, which is apparently very slow.
When I fix it
diff --git a/include/boost/parser/detail/printing.hpp b/include/boost/parser/detail/printing.hpp index 1e204796..6cbec059 100644 --- a/include/boost/parser/detail/printing.hpp +++ b/include/boost/parser/detail/printing.hpp @@ -621,11 +621,19 @@ namespace boost { namespace parser { namespace detail { flags f, Attribute const & attr) { - std::stringstream oss; if (detail::do_trace(f)) + { + std::stringstream oss; detail::print_parser(context, parser, oss); - return scoped_trace_t
( - first, last, context, f, attr, oss.str()); + + return scoped_trace_t ( + first, last, context, f, attr, oss.str()); + } + else + { + return scoped_trace_t ( + first, last, context, f, attr, {}); + } } template
the top function becomes the constructor of scoped_trace_t. (It's probably not getting inlined.)
Thanks. I fixed it as well, and now I get about 1.5-2X worse performance than Boost.JSON. I'm just building the "pretty" example from Boost.JSON, and "json" and "callback_json" from my examples. When I feed these the file at https://world.openfoodfacts.org/api/v0/product/5060292302201.json , and prepend "time", I get these results: pretty (Boost.JSON): real 0m0.014s user 0m0.004s sys 0m0.000s json (Parser): real 0m0.022s user 0m0.020s sys 0m0.000s callback_json (Parser, no JSON object creation): real 0m0.023s user 0m0.018s sys 0m0.004s I'm not sure what the tests reported earlier were doing, but I definitely don't see orders of magnitude difference.
Looks like the tracing functionality costs a lot even when off.
If I can verify this (my fix might be a little different), I'll turn the runtime flag into a template parameter. That should do it. Zach
On Thu, Feb 29, 2024 at 9:07 PM Zach Laine
On Thu, Feb 29, 2024 at 8:44 PM Peter Dimov via Boost
wrote: Zach Laine wrote:
I'm shocked too. That's really crazy. Believe me, I'm not interested in merging something with orders of magnitude of perf overhead into Boost. I will be getting to the bottom of this long before a merge could take place.
My profiler (VS2022) says that the top performance problem is the construction of a stringstream here
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da...
This constructs a std::locale, which is apparently very slow.
When I fix it
diff --git a/include/boost/parser/detail/printing.hpp b/include/boost/parser/detail/printing.hpp index 1e204796..6cbec059 100644 --- a/include/boost/parser/detail/printing.hpp +++ b/include/boost/parser/detail/printing.hpp @@ -621,11 +621,19 @@ namespace boost { namespace parser { namespace detail { flags f, Attribute const & attr) { - std::stringstream oss; if (detail::do_trace(f)) + { + std::stringstream oss; detail::print_parser(context, parser, oss); - return scoped_trace_t
( - first, last, context, f, attr, oss.str()); + + return scoped_trace_t ( + first, last, context, f, attr, oss.str()); + } + else + { + return scoped_trace_t ( + first, last, context, f, attr, {}); + } } template
the top function becomes the constructor of scoped_trace_t. (It's probably not getting inlined.)
Thanks. I fixed it as well, and now I get about 1.5-2X worse performance than Boost.JSON. I'm just building the "pretty" example from Boost.JSON, and "json" and "callback_json" from my examples. When I feed these the file at https://world.openfoodfacts.org/api/v0/product/5060292302201.json , and prepend "time", I get these results:
pretty (Boost.JSON): real 0m0.014s user 0m0.004s sys 0m0.000s
json (Parser): real 0m0.022s user 0m0.020s sys 0m0.000s
callback_json (Parser, no JSON object creation): real 0m0.023s user 0m0.018s sys 0m0.004s
I'm not sure what the tests reported earlier were doing, but I definitely don't see orders of magnitude difference.
Looks like the tracing functionality costs a lot even when off.
If I can verify this (my fix might be a little different), I'll turn the runtime flag into a template parameter. That should do it.
Should have mentioned: https://github.com/tzlaine/parser/issues/152 Zach
Zach Laine wrote:
Thanks. I fixed it as well, and now I get about 1.5-2X worse performance than Boost.JSON. I'm just building the "pretty" example from Boost.JSON, and "json" and "callback_json" from my examples. When I feed these the file at https://world.openfoodfacts.org/api/v0/product/5060292302201.json , and prepend "time", I get these results:
pretty (Boost.JSON): real 0m0.014s user 0m0.004s sys 0m0.000s
json (Parser): real 0m0.022s user 0m0.020s sys 0m0.000s
callback_json (Parser, no JSON object creation): real 0m0.023s user 0m0.018s sys 0m0.004s
I'm not sure what the tests reported earlier were doing, but I definitely don't see orders of magnitude difference.
I used this file: https://github.com/boostorg/json/blob/develop/bench/data/twitter.json
On Thu, Feb 29, 2024 at 9:26 PM Peter Dimov
Zach Laine wrote:
Thanks. I fixed it as well, and now I get about 1.5-2X worse performance than Boost.JSON. I'm just building the "pretty" example from Boost.JSON, and "json" and "callback_json" from my examples. When I feed these the file at https://world.openfoodfacts.org/api/v0/product/5060292302201.json , and prepend "time", I get these results:
pretty (Boost.JSON): real 0m0.014s user 0m0.004s sys 0m0.000s
json (Parser): real 0m0.022s user 0m0.020s sys 0m0.000s
callback_json (Parser, no JSON object creation): real 0m0.023s user 0m0.018s sys 0m0.004s
I'm not sure what the tests reported earlier were doing, but I definitely don't see orders of magnitude difference.
I used this file:
https://github.com/boostorg/json/blob/develop/bench/data/twitter.json
Using that file, I get similar results to what I already posted (the numbers are different, but the ratios between them are the same). I managed to make a change that was much smaller -- no template parameter required. I consistenty see 1.5x slowdown for Parser vs. Boost.JSON for files around this size, and 2x or so for much larger files (~25MB). I went ahead and committed the template-parameter approach on a branch, https://github.com/tzlaine/parser/tree/tmpl_param_for_scoped_trace . It made no difference vs. the mid-review branch I've been working on, or at least not for GCC+Linux. Zach
Zach Laine wrote:
I'm not sure what the tests reported earlier were doing, but I definitely don't see orders of magnitude difference.
I used this file:
https://github.com/boostorg/json/blob/develop/bench/data/twitter.json
Using that file, I get similar results to what I already posted (the numbers are different, but the ratios between them are the same).
I managed to make a change that was much smaller -- no template parameter required. I consistenty see 1.5x slowdown for Parser vs. Boost.JSON for files around this size, and 2x or so for much larger files (~25MB).
That's probably because you are measuring I/O in addition to parsing. https://github.com/cmazakas/parser-review/blob/main/test/json.cpp only measures the parsing part.
I'm shocked too. That's really crazy. Believe me, I'm not interested in merging something with orders of magnitude of perf overhead into Boost. I will be getting to the bottom of this long before a merge could take
Zach Laine wrote: place.
My profiler (VS2022) says that the top performance problem is the construction of a stringstream here
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aa de41da4ea/include/boost/parser/detail/printing.hpp#L624
This constructs a std::locale, which is apparently very slow.
When I fix it
diff --git a/include/boost/parser/detail/printing.hpp b/include/boost/parser/detail/printing.hpp index 1e204796..6cbec059 100644 --- a/include/boost/parser/detail/printing.hpp +++ b/include/boost/parser/detail/printing.hpp @@ -621,11 +621,19 @@ namespace boost { namespace parser { namespace detail { flags f, Attribute const & attr) { - std::stringstream oss; if (detail::do_trace(f)) + { + std::stringstream oss; detail::print_parser(context, parser, oss); - return scoped_trace_t
( - first, last, context, f, attr, oss.str()); + + return scoped_trace_t ( + first, last, context, f, attr, oss.str()); + } + else + { + return scoped_trace_t ( + first, last, context, f, attr, {}); + } } template
This change takes me from ~6130ms to ~520ms.
the top function becomes the constructor of scoped_trace_t. (It's probably not getting inlined.)
Looks like the tracing functionality costs a lot even when off.
Commenting out the bodies of scoped_trace_t and ~scoped_trace_t takes me to ~360ms. The top two functions are now `skip` https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da... and `omit_parser::call` https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da... both with 5.5%. The `SkipParser` of `skip` is ` boost::parser::rulejson::ws,boost::parser::detail::nope,boost::parser::detail::nope,boost::pars...`.
On Thu, Feb 29, 2024 at 9:21 PM Peter Dimov via Boost
I'm shocked too. That's really crazy. Believe me, I'm not interested in merging something with orders of magnitude of perf overhead into Boost. I will be getting to the bottom of this long before a merge could take
Zach Laine wrote: place.
My profiler (VS2022) says that the top performance problem is the construction of a stringstream here
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aa de41da4ea/include/boost/parser/detail/printing.hpp#L624
This constructs a std::locale, which is apparently very slow.
When I fix it
diff --git a/include/boost/parser/detail/printing.hpp b/include/boost/parser/detail/printing.hpp index 1e204796..6cbec059 100644 --- a/include/boost/parser/detail/printing.hpp +++ b/include/boost/parser/detail/printing.hpp @@ -621,11 +621,19 @@ namespace boost { namespace parser { namespace detail { flags f, Attribute const & attr) { - std::stringstream oss; if (detail::do_trace(f)) + { + std::stringstream oss; detail::print_parser(context, parser, oss); - return scoped_trace_t
( - first, last, context, f, attr, oss.str()); + + return scoped_trace_t ( + first, last, context, f, attr, oss.str()); + } + else + { + return scoped_trace_t ( + first, last, context, f, attr, {}); + } } template
This change takes me from ~6130ms to ~520ms.
the top function becomes the constructor of scoped_trace_t. (It's probably not getting inlined.)
Looks like the tracing functionality costs a lot even when off.
Commenting out the bodies of scoped_trace_t and ~scoped_trace_t takes me to ~360ms.
The top two functions are now `skip`
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da...
and `omit_parser::call`
https://github.com/tzlaine/parser/blob/f99ae3b94ad0acef0cc92166d5108aade41da...
both with 5.5%.
The `SkipParser` of `skip` is
` boost::parser::rulejson::ws,boost::parser::detail::nope,boost::parser::detail::nope,boost::pars...`.
Thanks! Template parameter it is then. Zach
czw., 29 lut 2024 o 23:30 Christian Mazakas via Boost
I'm not entirely sold on the value of Hana's `operator[]` for tuple accesses.
Especially because in later C++, we have placeholders so we can always just do:
auto const& [first,_,_] = my_tuple;
https://godbolt.org/z/vvnKxjGos
I largely agree with Alan, Hana's a heavyweight dependency for what I think is dubious benefit. Maybe others feel differently here but I'd prefer to just always work with the standard tuple and structured bindings than Hana's `operator[]`.
Once/if https://github.com/tzlaine/parser/issues/106 is implemented. The motivation for using Hana over std::tuple will be reduced practically to zero. https://github.com/tzlaine/parser/issues/106 handles the simple cases, and for more complicated logic, you do not mind calling std::get<>. Regards, &rzej;
On Thu, Feb 29, 2024 at 12:25 PM Vinnie Falco via Boost
On Thu, Feb 29, 2024 at 10:17 AM Tom Honermann via Boost
wrote: Now, Zach, as a long time SG16 participant and contributor, you know full well that wchar_t doesn't imply Unicode :)
I despise wchar_t and avoid it completely in public interfaces. I'll drop down to using it inside an implementation, but my functions which accept Unicode strings use char, and I perform the conversion in the implementation even when doing so has a performance cost (such as on Windows which wants a "wide string"). I try to offer only utf-8 as the choice of encoding for Unicode (again even on Windows).
It's a thankless job but someone has to protect the world from the ugliness of wide characters.
Not going to argue with that at all. I also never use it. I do think that in this case, to have that be the only character type the API doesn't accept would be too weird. Zach
This is my review of Boost.Parser. First of all, a disclaimer: I am employed by the C++ Alliance. # Are you knowledgeable about the problem domain? I've had a college course on formal languages and parsing, so I still remember some fundamentals. I've also played with Spirit.Qi years ago. # How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I've spent about 4 hours reading the documentation, and 10 more hours experimenting with it. # What is your evaluation of the potential usefulness of the library? Parser combinators are useful for quite a lot of applications, and parser DSELs have proven to be a popular approach. I agree with a previous reviewer that not many people would prefer using an external tool to a parser library. I definitely wouldn't. # What is your evaluation of the documentation? Currently documentation looks a bit undercooked. It needs some restructuring. For example, on one page there is a table of attributes for library-provided parsers. On another, there is a table of attributes for parser operators. On yet another those two pages are repeated, and also there's a bunch of extra rules for how sequences and alternatives affect attributes, followed by a table with examples. This creates some confusion at first on where to look up information on attributes. Another complaint I have is that the fact that the attribute for parsers with semantic actions is none is not advertised well enough. This is a very important piece of information, because most parsers use actions. And yet, this is only mentioned in operator tables, and never mentioned in the section on semantic actions. Also, I had a lot of success using _locals. But the docs only mention it in passing. I feel like overall the documentation would have benefited from more examples which solve some specific tasks using various library components. # What is your evaluation of the design? As far as I can tell, this library is largely a new take on Spirit's design. As such, it won't blow anyone's mind. On the other hand, Spirit's design is battle tested, it is known which things work well and which don't. One of more significant changes is the handling of Unicode. I can attest to the simplicity of using the library with Unicode inputs. More on this later. Overall, I felt that using this library was easier than using Spirit. One thing I want to note is that the library seems to have copied several naming conventions from Spirit. I'm not convinced there's much value in naming context accessors with a leading underscore. I'm also not convinced dependency on Boost.Hana for the sole purpose of using its tuple's operator[] is warranted. # What is your evaluation of the implementation? I didn't look at the implementation. # Did you try to use the library? With what compiler? Did you have any problems? I've written several test programs with it. I mostly used GCC 12, but also tried building with Clang 17. I haven't encountered any issues when using GCC. Clang build seemed to work originally, but later the build started to fail. I suspect, this is related to the inclusion of transcode_view.hpp. The first thing I tried to do was implementing a parser for arithmetic expressions. Converting a BNF to C++ was very straightforward, which is a plus. After that I thought of a task that requires Unicode support. So, I've made a parser for Russian numerals. What makes this task interesting is that various parts of a numeral have to be in the same gender, and some parts affect other parts' plurality category (one, few, or many). After finishing the second program I suddenly realised that the two parsers can be combined to create a parser for arithmetic expressions written in Russian. I managed to do this with only trivial changes in a couple of minutes. This is a testament to the exceptional composability of the library's parsers. Finally, I was looking for a way to avoid building and then accumulating containers. This is where _locals came in handy. The final program can be seen here: https://gist.github.com/grisumbras/90d2e99b8eb8b6c82147188c8a6287f6. While I was writing those parsers using trace::on was of great help for debugging errors. I still had a bunch of several pages-long template errors, particularly when parser attributes were not of the type I expected it to be. A related issue is that I have been using the branch that allowed returning values from semantic actions, and the library often couldn't correctly figure out whether I wanted to pass a context to the action, or just the attribute. In the end, I had to add explicit return type of void to all of the former. With the latter the problem sometimes manifested when a lambda had several return statements. I predict that this will be a significant usability issue. Several people have asked for a way to define rules without a macro. I have a different request: I think that there should be a macro which rather than relying on a naming convention allows you to explicitly name the parser that implements the rule. This would allow people to use their own conventions. I tried using nocase and discovered that the symbols parser ignores it. Is this by design? If so, there should be a separate parser that allows case-insensitive symbol matching. Finally, several people asked for benchmarks. Conveniently, the library includes a JSON parsing example, and conveniently I am maintaining Boost.JSON and I know how to add new parser implementations to its benchmark runner. The results aren't very good: Boost.JSON with default allocator is approximately 500 times faster than Parser on apache_builds.json. Niels Lohmann's JSON library is 100 times faster. # Do you think the library should be accepted as a Boost library? I recommend to ACCEPT the library on several CONDITIONS: * The main requirement is to fix build failures on popular C++ implementations. Related to this would be a well-functioning CI setup. * Either symbols should support case-insensitive mode, or an alternative parser should be added to the library. * There should be a rule definition macro that allows explicitly specifying parser that implements the rule. Overall, I enjoyed my experience with the library. Given that Spirit is being sunsetted, Boost needs a replacement.
On Thu, Feb 29, 2024 at 5:17 AM Дмитрий Архипов via Boost
Boost.JSON with default allocator is approximately 500 times faster than Parser on apache_builds.json. Niels Lohmann's JSON library is 100 times faster.
Really? This is very difficult to believe. nlohmann's JSON is already pretty slow and to be a hundred times slower than that would require intent. Can you please paste a link to the benchmark program? Thanks
чт, 29 февр. 2024 г. в 16:35, Vinnie Falco
On Thu, Feb 29, 2024 at 5:17 AM Дмитрий Архипов via Boost
wrote: Boost.JSON with default allocator is approximately 500 times faster than Parser on apache_builds.json. Niels Lohmann's JSON library is 100 times faster.
Really? This is very difficult to believe. nlohmann's JSON is already pretty slow and to be a hundred times slower than that would require intent. Can you please paste a link to the benchmark program?
I didn't save it, unfortunately, as putting it all into a usable branch would have required more work (I simply pointed the build to a local copy of Parser). I can recreate it properly if you want.
On Thu, Feb 29, 2024 at 7:16 AM Дмитрий Архипов via Boost
This is my review of Boost.Parser.
[snip]
# What is your evaluation of the documentation?
Currently documentation looks a bit undercooked. It needs some restructuring. For example, on one page there is a table of attributes for library-provided parsers. On another, there is a table of attributes for parser operators. On yet another those two pages are repeated, and also there's a bunch of extra rules for how sequences and alternatives affect attributes, followed by a table with examples. This creates some confusion at first on where to look up information on attributes.
Hopefully, this is addressed by the Cheat Sheet I made. It has all the tables copied into one place.
Another complaint I have is that the fact that the attribute for parsers with semantic actions is none is not advertised well enough. This is a very important piece of information, because most parsers use actions. And yet, this is only mentioned in operator tables, and never mentioned in the section on semantic actions.
Right, good point. Ticket: https://github.com/tzlaine/parser/issues/148
Also, I had a lot of success using _locals. But the docs only mention it in passing. I feel like overall the documentation would have benefited from more examples which solve some specific tasks using various library components.
I agree; there's already a ticket open for this one. [snip]
One thing I want to note is that the library seems to have copied several naming conventions from Spirit. I'm not convinced there's much value in naming context accessors with a leading underscore. I'm also not convinced dependency on Boost.Hana for the sole purpose of using its tuple's operator[] is warranted.
I feel strongly that it is, having used op[] a lot. That being said, I'm becoming aware that this is a minority opinion. Perhaps I should make std::tuple the default, instead of the other way around. I don't want to remove hana::tuple entirely, because I want to use it in my own parsers. [snip]
While I was writing those parsers using trace::on was of great help for debugging errors. I still had a bunch of several pages-long template errors, particularly when parser attributes were not of the type I expected it to be.
A related issue is that I have been using the branch that allowed returning values from semantic actions, and the library often couldn't correctly figure out whether I wanted to pass a context to the action, or just the attribute. In the end, I had to add explicit return type of void to all of the former. With the latter the problem sometimes manifested when a lambda had several return statements. I predict that this will be a significant usability issue.
Yeah, I implemented it to see how it works, and it only kind of does. I might experiment with alternatives that cannot be ambiguous, or might strike it. It was an interesting idea, but being forced to rigorously constrain your lambda's is not fun.
Several people have asked for a way to define rules without a macro. I have a different request: I think that there should be a macro which rather than relying on a naming convention allows you to explicitly name the parser that implements the rule. This would allow people to use their own conventions.
This is easy for me to add, but it's also really easy for you to write. I find such low-effort macros unappealing as a library feature, since they're so easy for the user to use, if desired.
I tried using nocase and discovered that the symbols parser ignores it. Is this by design? If so, there should be a separate parser that allows case-insensitive symbol matching.
Nope, that's just an oversight. I went through all the parsers one by one, but the symbol table is its own thing, and got missed. Ticket: https://github.com/tzlaine/parser/issues/149
Finally, several people asked for benchmarks. Conveniently, the library includes a JSON parsing example, and conveniently I am maintaining Boost.JSON and I know how to add new parser implementations to its benchmark runner. The results aren't very good: Boost.JSON with default allocator is approximately 500 times faster than Parser on apache_builds.json. Niels Lohmann's JSON library is 100 times faster.
Ha! That's hilariously bad. I'll have to take a look at what might be going on there. Did you happen to take a look at the callback JSON parser for comparison?
* The main requirement is to fix build failures on popular C++ implementations. Related to this would be a well-functioning CI setup. * Either symbols should support case-insensitive mode, or an alternative parser should be added to the library. * There should be a rule definition macro that allows explicitly specifying parser that implements the rule.
Overall, I enjoyed my experience with the library. Given that Spirit is being sunsetted, Boost needs a replacement.
I fully agree with the first two. I'll probably only do the last one if Marshall approves the library, and makes me. Zach
Hi everyone,
This is my review of the proposed Boost.Parser. I'm posting my review
without an Accept/Reject vote.
# Are you knowledgeable about the problem domain?
I've had no prior experience using parser combinators. Therefore, I'm
writing this review as someone who attempted to read the documentation
and utilize the library to create a parser for parsing chunk-encoded
HTML bodies.
# How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
I've spent approximately 1 day reading the documentation and 1 day
experimenting with the code and comparing it with Boost.Spirit.
# Does this library bring real benefit to C++ developers for
real-world use-case?
Yes, especially given that we frequently encounter applications or
libraries that necessitate parsing strings or plain text network
protocols. Hand-crafted parsers are prone to bugs and require thorough
reviews and fuzz tests to ensure accuracy. Having a library that
enables the construction of complex parsers by combining simple and
verifiable rules can significantly reduce development time.
# Do you have an application for this library?
I find it useful for testing serializers and parsers in network
libraries that implement a plain-text network protocol like HTTP. We
can load test the serializer or parts of it against simple parsers
made using this library, in addition to the more complex parser that
is built for efficiency but is more bug-prone.
# What is your evaluation of the documentation?
I noticed a few areas where the documentation could be improved:
1. The documentation assumes that the reader is already familiar with
using former parsers in Boost libraries like Spirit.QI, making it
nearly unusable for those who lack prior knowledge of Spirit.
2. There are very few examples demonstrating how to use directives,
such as 'repeat', for instance.
3. There is a lack of a step-by-step guide that explains fundamental
concepts through examples.
4. The table of contents is either overly verbose or not categorized
properly, making it difficult to locate related sections.
# What is your evaluation of the design?
I found the necessity to go through multiple steps and use macros for
defining rules a bit cumbersome. It would be much more ergonomic to
enable the definition of rules in a single line, if possible.
# Did you try to use the library? With what compiler? Did you have any problems?
Parsing chunk-encoded HTML bodies requires the utilization of
attributes from a preceding parser to extract each chunk. I found the
placeholders in Boost.Spirit very handy for accessing attributes and
local variables. here is how the parser appears in Boost.Spirit:
```
chunk_rule %= qi::hex[_a = _1] >> qi::eps(_a) >> "\r\n" >>
qi::repeat(_a)[qi::char_] >> "\r\n";
```
In comparison to using lambda expressions and external state variables
in the proposed library:
```
unsigned int size = 0;
auto chunk_parser =
bp::hex[([&size](auto & ctx) { size = bp::_attr(ctx); })] >>
bp::eps([&size](auto &) { return size != 0; }) >> "\r\n" >>
bp::repeat(std::ref(size))[bp::char_] >> "\r\n";
```
I attempted to define a rule with local variables as well, but there
was no way to access the variable within the repeat directive. I'm
unsure if it's technically feasible, but it would be very ergonomic to
write concise code like using placeholders in Boost.Spirit.
Another issue was the fact writing:
```
bp::repeat(std::ref(size))[bp::char_]
```
Is not an efficient method for extracting a chunk of string. This
approach loops and executes the `bp::char_ parser` on each character
and there is no parser available that can chunk a portion of the input
string.
In summary, my experience wasn't great. I anticipated a simple parser
that could be written in one line and was easy to reason about.
However, it ended up requiring manual state management and was quite
convoluted.
I'd like to thank Zach for his work, and I hope this review might
prove useful for improving the library.
- Affiliation disclosure
Please note that I'm affiliated with the C++ Alliance.
Regards,
Mohammad Nejati
On Fri, Mar 1, 2024 at 3:43 PM Andrzej Krzemienski via Boost
czw., 29 lut 2024 o 23:30 Christian Mazakas via Boost
napisał(a): I'm not entirely sold on the value of Hana's `operator[]` for tuple accesses.
Especially because in later C++, we have placeholders so we can always just do:
auto const& [first,_,_] = my_tuple;
https://godbolt.org/z/vvnKxjGos
I largely agree with Alan, Hana's a heavyweight dependency for what I think is dubious benefit. Maybe others feel differently here but I'd prefer to just always work with the standard tuple and structured bindings than Hana's `operator[]`.
Once/if https://github.com/tzlaine/parser/issues/106 is implemented. The motivation for using Hana over std::tuple will be reduced practically to zero. https://github.com/tzlaine/parser/issues/106 handles the simple cases, and for more complicated logic, you do not mind calling std::get<>.
Regards, &rzej;
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mon, Feb 12, 2024 at 7:34 PM Marshall Clow via Boost
That’s a week from now.
Github: https://github.com/tzlaine/parser Docs: https://tzlaine.github.io/parser
— Marshall
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mar 1, 2024, at 9:37 AM, Mohammad Nejati [ashtum] via Boost
Hi everyone,
This is my review of the proposed Boost.Parser. I'm posting my review without an Accept/Reject vote.
I was traveling on the 1st, and missed this review. Thanks for the review! — Marshall
# Are you knowledgeable about the problem domain?
I've had no prior experience using parser combinators. Therefore, I'm writing this review as someone who attempted to read the documentation and utilize the library to create a parser for parsing chunk-encoded HTML bodies.
# How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
I've spent approximately 1 day reading the documentation and 1 day experimenting with the code and comparing it with Boost.Spirit.
# Does this library bring real benefit to C++ developers for real-world use-case?
Yes, especially given that we frequently encounter applications or libraries that necessitate parsing strings or plain text network protocols. Hand-crafted parsers are prone to bugs and require thorough reviews and fuzz tests to ensure accuracy. Having a library that enables the construction of complex parsers by combining simple and verifiable rules can significantly reduce development time.
# Do you have an application for this library?
I find it useful for testing serializers and parsers in network libraries that implement a plain-text network protocol like HTTP. We can load test the serializer or parts of it against simple parsers made using this library, in addition to the more complex parser that is built for efficiency but is more bug-prone.
# What is your evaluation of the documentation?
I noticed a few areas where the documentation could be improved: 1. The documentation assumes that the reader is already familiar with using former parsers in Boost libraries like Spirit.QI, making it nearly unusable for those who lack prior knowledge of Spirit. 2. There are very few examples demonstrating how to use directives, such as 'repeat', for instance. 3. There is a lack of a step-by-step guide that explains fundamental concepts through examples. 4. The table of contents is either overly verbose or not categorized properly, making it difficult to locate related sections.
# What is your evaluation of the design?
I found the necessity to go through multiple steps and use macros for defining rules a bit cumbersome. It would be much more ergonomic to enable the definition of rules in a single line, if possible.
# Did you try to use the library? With what compiler? Did you have any problems?
Parsing chunk-encoded HTML bodies requires the utilization of attributes from a preceding parser to extract each chunk. I found the placeholders in Boost.Spirit very handy for accessing attributes and local variables. here is how the parser appears in Boost.Spirit: ``` chunk_rule %= qi::hex[_a = _1] >> qi::eps(_a) >> "\r\n" >> qi::repeat(_a)[qi::char_] >> "\r\n"; ``` In comparison to using lambda expressions and external state variables in the proposed library: ``` unsigned int size = 0; auto chunk_parser = bp::hex[([&size](auto & ctx) { size = bp::_attr(ctx); })] >> bp::eps([&size](auto &) { return size != 0; }) >> "\r\n" >> bp::repeat(std::ref(size))[bp::char_] >> "\r\n"; ``` I attempted to define a rule with local variables as well, but there was no way to access the variable within the repeat directive. I'm unsure if it's technically feasible, but it would be very ergonomic to write concise code like using placeholders in Boost.Spirit.
Another issue was the fact writing: ``` bp::repeat(std::ref(size))[bp::char_] ``` Is not an efficient method for extracting a chunk of string. This approach loops and executes the `bp::char_ parser` on each character and there is no parser available that can chunk a portion of the input string.
In summary, my experience wasn't great. I anticipated a simple parser that could be written in one line and was easy to reason about. However, it ended up requiring manual state management and was quite convoluted.
I'd like to thank Zach for his work, and I hope this review might prove useful for improving the library.
- Affiliation disclosure
Please note that I'm affiliated with the C++ Alliance.
Regards, Mohammad Nejati
On Fri, Mar 1, 2024 at 3:43 PM Andrzej Krzemienski via Boost
wrote: czw., 29 lut 2024 o 23:30 Christian Mazakas via Boost
napisał(a): I'm not entirely sold on the value of Hana's `operator[]` for tuple accesses.
Especially because in later C++, we have placeholders so we can always just do:
auto const& [first,_,_] = my_tuple;
https://godbolt.org/z/vvnKxjGos
I largely agree with Alan, Hana's a heavyweight dependency for what I think is dubious benefit. Maybe others feel differently here but I'd prefer to just always work with the standard tuple and structured bindings than Hana's `operator[]`.
Once/if https://github.com/tzlaine/parser/issues/106 is implemented. The motivation for using Hana over std::tuple will be reduced practically to zero. https://github.com/tzlaine/parser/issues/106 handles the simple cases, and for more complicated logic, you do not mind calling std::get<>.
Regards, &rzej;
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mon, Feb 12, 2024 at 7:34 PM Marshall Clow via Boost
wrote: That’s a week from now.
Github: https://github.com/tzlaine/parser Docs: https://tzlaine.github.io/parser
— Marshall
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (14)
-
Alan de Freitas
-
Andrzej Krzemienski
-
Arno Schoedl
-
Christian Mazakas
-
Christopher Kormanyos
-
Marshall Clow
-
Matt Borland
-
Mohammad Nejati [ashtum]
-
Peter Dimov
-
Peter Turcan
-
Tom Honermann
-
Vinnie Falco
-
Zach Laine
-
Дмитрий Архипов