Reviewing Boost.JSON
I'm gonna use a perhaps unusual text structure for this review. Consuming
text is a tiring activity that demands energy. Having that in mind, I
ignored the order in which the library is presented through the docs and
moved all trivial concepts discussion to the end. That's my attempt to make
sure the message gets through while the reader still has it -- the energy.
## A short introduction to pull parsers vs push parsers
The act of parsing an incoming stream will generate events from a
predefined set. Users match the generated events against rules to trigger
actions. For the following JSON sample:
```json
{ "foo" : 42 , "bar" : 33 }
```
One of such event streams could be:
```
begin_object -> string_key -> number -> string_key -> number -> end_object
```
The essential difference between pull parsers and push parsers is how this
event stream is delivered. A push parser (a.k.a. SAX parsers) uses the
callback handler model. A pull parser (a.k.a. StAX parsers) stores the
current parsing state and allows the user to query this state.
Essentially every push parser has an implicit pull parser that doesn't get
exposed to the user. This is to say that a push parser would be implemented
in the lines of:
```cpp
// parsing loop
while (it != end) {
match_token();
dispatch_token(user_handlers);
}
```
Whereas a pull parser doesn't have the loop construct above, just its
body. The loop construct would be under user-control. That's why pull
parsers are the more fundamental model. A push parser can be built on top
of a pull parser, but not the inverse.
Concrete examples will be shown later in this text.
## Performance: a common misconception
Some push parsing proponents believe pull parsers are slower. This belief
is unsubstantiated. As we have seen in the previous section, a push parser
is always built on top of an implicit pull parser. The only thing that
differs them both is that a small part of the parser logic under the
pulling model moves out of the interface to become user code. That's the
only difference.
One of the arguments brought by push parser proponents is the ability to
merge matching and decoding steps to perform less work. We can use the
previous sample to illustrate this idea:
```
{ "foo" : 42 , "bar" : 33 }
^^^^^
```
If we are to decode the `"foo"` token above, the parser first needs to
discover where the token begins and where the token ends. This is the
required matching step that every parser needs to perform. JSON strings can
be escaped and one such example would be:
```json
"na\u00EFve"
```
The decoding steps take care of converting escaped string slices to
unescaped ones. So user code would receive 3 string runs while this token
is being decoded:
- "na"
- "ï"
- "ve"
The parser does not allocate any memory to decode any string. The slices
are feeded to the user who in turns is responsible to put them together
(possibly allocating heap memory).
The idea here is... having separate match and decode steps forces the
parser to walk over the tokens twice. For Boost.JSON, the merged match and
decode steps materialize in the `on_string_part()` (and `on_string()`)
callback:
```cpp
bool on_string_part(string_view v, std::size_t, error_code& ec)
{
current_str.append(v.data(), v.size());
return true;
}
```
For a pull parser with _separate_ match and decode steps the interface
would be in the lines of:
```cpp
reader.next();
if (reader.kind() == token::string)
reader.string(current_str);
```
But as we've seen before, the only difference between pull and push parsers
is something else. Any features lateral to this classification originate
from different taxonomies. Just as a push parser can merge the matching and
decoding steps so can a pull parser:
```cpp
// A new overload.
//
// `my_match_options` here would provide
// the string collector upfront.
reader.next(my_match_options);
```
Another small note here is that just as merging matching and decoding steps
can be beneficial performance-wise so can skipping the whole decoding
step. I'll be touching on this very idea again in the section below.
Appendix A will present how another feature from Boost.JSON can be
implemented in pull parsers: strings split among different buffers.
## The case for pull parsers
So far we've only been exposed to the difference between push parsers and
pull parsers, but there wasn't a case for which choice is ideal (aside from
the naturally understood concept that pull parsers are more fundamental and
push parsers can be built on top when desired).
The defining trait present in pull parsers that is impossible to replicate
in push parsers is their ability to compose algorithms. Composition is
enabled by temporarily handing over the token stream to a different
consumer. Not every consumer needs to know how to process every part of the
stream. That's a trait impossible to find in push parsers.
I have created a [gawk](https://www.gnu.org/software/gawk/) plug-in to
illustrate this property. The same plug-in was implemented several times
using different libraries/approaches. I incrementally added more features
under different branches (there are 8 branches in total). You can find a
rationale for the AWK choice in Appendix B by the end of this review.
The plugin's code can be found at
https://gitlab.com/vinipsmaker/gawk-jsonstream-archive. And you may
forgive my beginner code, but I had no experience to write gawk plug-ins
beforehand. That's my first.
AWK is a C-like language (in syntax) where you define pattern-action
pairs. You can find a quick'n'dirty introduction to the language from
Wikipedia. This plug-in adds JSON capabilities to the tool. Given the
following data-set:
```
{"name": "Beth", "pay_rate": 4.00, "hours_worked": 0}
{"name": "Dan", "pay_rate": 3.75, "hours_worked": 0}
{"name": "Kathy", "pay_rate": 4.00, "hours_worked": 10}
{"name": "Mark", "pay_rate": 5.00, "hours_worked": 20}
{"name": "Mary", "pay_rate": 5.50, "hours_worked": 22}
{"name": "Susie", "pay_rate": 4.25, "hours_worked": 18}
```
And the following AWK program (with this plug-in loaded):
```awk
BEGIN {
JPAT[1] = "/name"
JPAT[2] = "/pay_rate"
JPAT[3] = "/hours_worked"
}
J[3] > 0 { print J[1], J[2] * J[3] }
```
The output will be:
```
Kathy 40
Mark 100
Mary 121
Susie 76.5
```
The paths for the JSON tree are given in [standard JSON
Pointer](https://tools.ietf.org/html/rfc6901) syntax. That's the plugin's
core and it's the code that doesn't change among branches, so it's the
appropriate place to begin the explanation.
In the file `jpat.ypp` you'll find a `read_jpat()` function that gets
called for each record. There's some boilerplate in the beginning to bail
out earlier if there were no changes to `JPAT`. Then there is this snippet
that gets processed by [re2c](https://re2c.org/) to generate valid C++
code:
```cpp
* {
if (*begin != '\0')
return bad_path();
jpat[i].components.emplace_back(std::move(current_component));
goto end;
}
{prefix} {
jpat[i].components.emplace_back(std::move(current_component));
current_component = std::string{};
goto loop;
}
{unescaped} {
current_component.append(begin, YYCURSOR - begin);
goto loop;
}
{escaped_tilde} {
current_component.push_back('~');
goto loop;
}
{escaped_slash} {
current_component.push_back('/');
goto loop;
}
```
The definitions for the data structures follow:
```cpp
struct Path
{
std::vectorstd::string components;
awk_value_cookie_t cookie = nullptr;
awk_string_t orig;
};
struct Tree: std::map
Wait a second! 312MB/s is almost 10x slower than Daniel Lemire's parser, the one that actually parses JSON, and we are only creating the objects that would result if we were parsing, without doing any actual parsing.
This is one of the many unintuitive aspects of parsing performance: the actual low-level, character-level parsing is generally the least important part for overall performance. Unless you do something crazy like use `NSScanner`. Don't use `NSScanner`. Please.
Let's dive in the code to see how they differ.
### `basic`
Every branch reads a record one document at a line. That's the
`get_record()` function. For the `boostjson-dom` branch, the code goes in
the lines of:
```cpp
try {
parsed_value.emplace_null();
json_parser.reset(parsed_value.storage());
std::error_code ec;
json_parser.write(line.data(), line.size(), ec);
if (ec)
throw std::system_error{ec};
json_parser.finish(ec);
if (ec)
throw std::system_error{ec};
parsed_value = json_parser.release(ec);
assert(!ec);
switch (parsed_value.kind()) {
case json::kind::array:
fill_j(parsed_value.get_array(), tree);
break;
case json::kind::object:
fill_j(parsed_value.get_object(), tree);
break;
// ...
}
} catch (const std::exception& e) {
warning(ext_id, "Error while processing record: %s", e.what());
}
```
The code parses the whole JSON at once and then follows to consume the
document by calling the `fill_j()` function:
```cpp
// Several overloads:
static void fill_j(const json::array& arr, Tree& node);
static void fill_j(const json::object& obj, Tree& node);
static void fill_j(const json::value& v, Tree& node);
// Here is the object overload. You can find the rest in the repository.
static void fill_j(const json::object& obj, Tree& node)
{
for (auto& [key, value]: node) {
auto it = obj.find(key);
if (it == obj.end())
continue;
std::visit(
hana::overload(
[&](Tree& tree) { fill_j(it->value(), tree); },
[&](size_t idx) { consume_value(it->value(), idx); }),
value);
}
}
```
The code is pretty straight-forward. We check if the fields the user asked
for are present in the document and recurse on them. When we reach a leaf
node, the `J` AWK variable is set (that'd be the `consume_value()`
function).
Do notice that there is no uncontrolled recursion here. Recursion is always
JPAT-delimited. And JPAT is part of the source code, not user-input. There
is no risk of stack overflow here.
The pull parser approach (branch master) somewhat follows the same trend,
but the code is bigger. Here's a selected highlight:
```cpp
void fill_j(json::reader& reader, Tree& node)
{
// ...
switch (reader.symbol()) {
case json::token::symbol::error:
throw json::error{reader.error()};
case json::token::symbol::begin_object: {
if (!reader.next()) throw json::error{reader.error()};
std::string current_key;
for (;;) {
if (reader.symbol() == json::token::symbol::end_object) {
reader.next();
break;
}
// Key
assert(reader.symbol() == json::token::symbol::string);
current_key.clear();
auto ec = reader.string(current_key);
assert(!ec); (void)ec;
if (!reader.next()) throw json::error{reader.error()};
// Value
auto it = node.find(current_key);
if (it == node.end()) {
json::partial::skip(reader);
continue;
}
std::visit(
hana::overload(
[&](Tree& tree) {
switch (reader.symbol()) {
case json::token::symbol::begin_array:
case json::token::symbol::begin_object:
fill_j(reader, tree);
break;
default:
json::partial::skip(reader);
}
},
consume_value),
it->second);
}
if (reader.symbol() == json::token::symbol::error)
throw json::error{reader.error()};
}
// ...
}
}
```
And here is the first sample on what I meant by composability. Our
"callback" only gets called to handle trees that are part of the decision
tree one way or another (and the root node always belongs here). When it
reaches a subtree that doesn't belong to the JPAT-defined decision tree,
`json::partial::skip()` is called. `json::partial::skip()` is defined
elsewhere and has no knowledge on the internals of our algorithm. This is a
different "callback" being called to handle a different part of the
tree. The two of them know how to handle different groups of tokens.
A nice side effect is that these "phantom" trees will be fully skipped
over. We don't allocate strings to collect values over them. This is an
optimization we get for free. [Designing JSON parsers to exploit this idea
is not unknown](https://github.com/zserge/jsmn).
The last one will be the code for the `boostjson` branch, but I feel like
it'll be easier to understand if I first introduce the tasks it has to
perform in the token stream.
```
{ "foo" : 42 , "bar" : { "foo" : 33 } }
^^
```
When the field's value is parsed, we must match the key against our
rule-set to possibly trigger an action. But it is not enough to store the
current key. The whole path must be stored to support the JSON nest-able
structure. We start with the structure to save current context information:
```cpp
struct State
{
bool is_object; //< false = ARRAY
std::string current_key;
size_t current_idx;
Tree* node;
};
std::vector<State> state;
```
So the beginning of the code for `on_object_begin()` must be:
```cpp
Tree* node = nullptr;
if (state.size() == 0) {
node = &tree;
} else if (state.back().node) {
auto it = state.back().node->find(current_key());
if (it != state.back().node->end())
node = std::get_if<Tree>(&it->second);
}
state.emplace_back();
state.back().is_object = true;
state.back().node = node;
```
And a matching action in the code for `on_object_end()`:
```cpp
state.pop_back();
```
But it's not really that simple. Paths such as
`/features/10000/properties/LOT_NUM` won't work because `current_key` won't
be provided by any `on_key()`-like callback when current node is
`10000`. `current_key` must be computed for arrays in the value handler
itself. That's why the following snippet must be present in the beginning
of every value-callback (and `on_{object,array}_begin()` too):
```cpp
if (state.size() > 0 && /*is_array=*/!state.back().is_object)
update_current_array_key();
```
And the matching helper functions:
```cpp
std::string_view current_key()
{
assert(state.size() > 0);
return state.back().current_key;
}
void update_current_array_key()
{
assert(state.size() > 0);
assert(!state.back().is_object);
state.back().current_key.resize(max_digits);
auto s_size = std::to_chars(
state.back().current_key.data(),
state.back().current_key.data() + state.back().current_key.size(),
state.back().current_idx++
).ptr - state.back().current_key.data();
state.back().current_key.resize(s_size);
}
```
The final touch is to realize that the event stream will be something like:
```
string_key -> value -> string_key -> value -> string_key -> value -> ...
```
And we need to clear the `string_key` by the end of every value consumed:
```cpp
if (state.size() > 0) state.back().current_key.clear();
```
So the callback implementation for a simple atom value such as a string is:
```cpp
bool on_string_part(std::string_view v, std::error_code&)
{
if (state.size() > 0) {
if (state.back().node == nullptr)
return true;
auto it = state.back().node->find(current_key());
if (it == state.back().node->end())
return true;
if (!std::holds_alternative
The defining trait present in pull parsers that is impossible to replicate in push parsers is their ability to compose algorithms. Composition is enabled by temporarily handing over the token stream to a different consumer. Not every consumer needs to know how to process every part of the stream. That's a trait impossible to find in push parsers.
Every callback depends on implementation knowledge from every other
callback. Spaghetti code as they say. I also implemented another change to
both branches -- `master` and `boostjson` -- to further demonstrate this
property. What if we want to store string representation for array indexes
in static arrays instead `std::string` objects?
There's this changeset for `master` branch:
```
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -184,20 +184,19 @@ void fill_j(json::reader& reader, Tree& node)
case json::token::symbol::begin_array: {
if (!reader.next()) throw json::error{reader.error()};
- std::string current_key;
+ std::array
But as we've seen before, the only difference between pull and push parsers is something else. Any features lateral to this classification originate from different taxonomies.
The enabling feature here is not the push interface. Rather it is the chosen set of generated events for the JSON token stream. If you change the notified events for the stream, you enable the same feature. For a pull parser, these are the events it'd have to provide (among other valid choices): - `string` (if non-split) - `string_run_begin` - `string_run` - `string_run_end` ## Appendix B: Why gawk? There are a few reasons why I've chosen a (GNU) AWK plug-in to illustrate the concepts presented in this review. First, I don't want to invent a new tool. Artificial examples could always be created to fabricate an argument. AWK is one of the classical UNIX tools. It's established and old. From [gawk manual](https://www.gnu.org/software/gawk/manual/html_node/History.html):
The original version of `awk` was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful [...] This new version became widely available with Unix System V Release 3.1 (1987).
We're talking 40 years of history here. If the tool is established, that's one reason. Another reason why I've chosen AWK is that AWK has no JSON processing facilities, so it fills a gap in an existing tool. The third reason why I've chosen GNU AWK to create this plug-in is that gawk already has an extension mechanism. Therefore I don't have the freedom to modify the AWK processor to accommodate the JSON library idiosyncrasies. On the contrary, it's the JSON library the one to prove its flexibility. Now there's another question to answer here: why take MongoDB as inspiration to design the gap that would be filled in AWK (such as the `$unwind` operator)? And the answer will be the same. MongoDB is one of the JSON-oriented DBMS that got popular back in the NoSQL boom days. The end result tries to fit in the AWK's mindset and not the other way around. AWK is a tool designed for tabular data processing. JSON is in no way tabular, but its structure allows us to build a tabular view for the underlying data. There are two main variables in gawk to control how the fields are extracted: - `FS`: defines how to find the field _separators_. - `FPAT`: defines how to find the field _contents_. The plug-in just follows this trend and introduces a `JPAT` variable that also instructs how to find the field _contents_. `JPAT` is an array and each element has a [standard JSON Pointer](https://tools.ietf.org/html/rfc6901) expression that was designed exactly for this purpose. There's one very interesting fact about this outcome. The usage pattern here is removing from the library the step to process the stream (as the step that'd happen with the DOM approach) and moving much of the processing logic to be embedded in the parsing phase itself. That's a continuation from the trend we saw earlier about merging matching and decoding steps. It's also the same approach we'd use to build serious JSON processing tools such as [JMESPath](https://jmespath.org/) (some JMESPath expressions require a partial DOM, but it's doable to use a hybrid approach anyway as it has already been demonstrated in the `JUNWIND` code). ## Appendix C: Static decision trees with Boost.Hana The reader may be interested to know that the decision tree auxiliar to the parsing process developed for our (GNU) AWK plug-in is not only useful for AWK but to C++ programs as well. The decision tree strategy may even use Boost.Hana CT structures to fully unroll the code. It can be used to implement facilities such as: ```cpp using hana::literals::operator""_s; int foo; std::string bar; json::scanf(input, "{ foo: { bar: ? }, bar: ? }"_s, foo, bar); ``` That's also possible with Boost.JSON's current parser. But the following is not: ```cpp using hana::literals::operator""_s; Foo foo; std::string bar; json::scanf( input, "{ foo: { bar: ? }, bar: ? }"_s, [&foo](json::reader& reader) { // ... }, bar); ``` As it has been explained before, push parsers don't compose. And you aren't limited to root-level scanning. You should have `json::partial::scanf()` to act on subtrees too. A prototype for this idea can be found at https://github.com/breese/trial.protocol/pull/43. ## Review questions
Please be explicit about your decision (ACCEPT or REJECT).
REJECT.
How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
I've stalled a project of mine for a few weeks just to participate in this review process. I think I've spent close to 50 hours. I wrote a JSON tool in Boost.JSON and the same tool under a different parser model to highlight important traits that were being dismissed in previous debates. I've gone through the documentation and source code to better understand some Boost.JSON details. I've also been involved in previous Boost.JSON debates in this very mailing list.
Are you knowledgeable about the problem domain?
I've tested many JSON parser libraries in C++ to this point. In my last job I have leveraged a JSON (pull) parser to merge multiple validation layers into an one-pass operation and later leveraged Boost.Hana to automate the parser generation. Our API is JSON-based and we apply plenty of transformations (including DOM-based when unavoidable), some similar to the GNU AWK plug-in demonstrated here (e.g. synthesizing a new field on routed messages while DOM tree related costs are avoided). I've contributed a few algorithms and subtree parsing ideas back to the leveraged JSON (pull) parser. I offered the same composability ideas to Boost.JSON a few months back when the topic appeared here in this very mailing list. In the past, I've designed parsers in C++ -- pull and push parsers. I've played with some tools for parser generation such as re2c, gperf (gperf might not be really parsing, but it is related), and Boost.Xpressive. I've also experienced the pain of parsers that try to be too lightweight but ultimately backfire by moving not only performance costs back to the consumer but also greatly increasing complexity. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
On Thu, Sep 17, 2020 at 12:31 PM Vinícius dos Santos Oliveira via
Boost
... The idea here is... ... Appendix A will present how another feature from Boost.JSON can be implemented in pull parsers: strings split among different buffers. ...
This is all very nice and I appreciate the time that you put into this thorough and detailed analysis. What is missing from this screed, however, is an example of a complete working library which implements these wonderful ideas of yours.
Boost.JSON -- will prove unfit to build anything higher-level and a new core will have to be developed from the ground-up because current bad design choices.
On the other hand, we have a couple of examples of libraries in the style of Boost.JSON: RapidJSON, and JSON For Modern C++ (nlohmann's JSON). These libraries also offer a DOM and a monolithic parser (although, in nlohmann's case it cannot be instantiated directly). They have tens of thousands of users and can be seen as dependencies in other projects, which provides a stark contradiction to your claim above. Based on the popularity of these two libraries alone, it should be uncontroversial that a JSON library which offers the same functionality (a DOM, plus parsing, plus serialization) with great performance and the level of interface quality you expect from a Boost component, would be immensely popular and quite the opposite of "current bad design choices." Quite frankly I find your position to be completely indefensible.
## The case for pull parsers ... The defining trait present in pull parsers that is impossible to replicate in push parsers is their ability to compose algorithms. ...
This is a feature no one is asking for, and not a goal of the library. Boost.JSON is about bringing to C++ the same level of ease of use and flexibility for working with JSON that you find in other languages like say, Javascript. It is not a competitor to Boost.Serialization, nor is it meant for the parser and serializer to satisfy every possible use-case. It is meant for a particular use-case and designed to do it very well. Besides, I have a theory that there is no single parser design which will be ideal for every use-case. That is to say that optimizing for one feature will by necessity disadvantage other features. I would of course love to be proven wrong, if someone could show me a complete working example with benchmarks. But nlohmann's JSON took this route and well, we know what the results are (not good).
Composition is enabled by temporarily handing over the token stream to a different consumer. Not every consumer needs to know how to process every part of the stream. That's a trait impossible to find in push parsers.
Practical experience with a 5-year old Boost library and a 2-year old Not-In-Boost library, plus feedback from actual users, has shown that these use-cases are not what people want.
Can the handler pause the parsing loop by returning `false` w/o setting any error code (and how'd the stream be resumed)?
No, this is explained in the docs. Upon error, the only valid operations are destruction and reset. Regards
On 17/09/2020 20:30, Vinícius dos Santos Oliveira via Boost wrote:
As it has been explained before, push parsers don't compose. And you aren't limited to root-level scanning. You should have `json::partial::scanf()` to act on subtrees too. A prototype for this idea can be found at https://github.com/breese/trial.protocol/pull/43.
Firstly, that was a great essay on the backing theory Vinicius. I only wish more people who write parsers would read that first. I would urge you to convert it into a blog post or something similar, post it online so people can find it, and all that great explanation of theory doesn't get lost forever.
## Review questions
Please be explicit about your decision (ACCEPT or REJECT).
REJECT.
I understand your motivation here, and given nobody else on this list will say this here, you are absolutely right that in the general case, pull based parsers are the right choice. I chose a pull design for pcpp (the pure Python C preprocessor), despite that strictly speaking it's really not necessary for parsing contents whose length is always fully known in advance. However, pull based designs are just better in general, more flexible, more extensible. In the very specific case of parsing JSON however, I'm not sure if the standard rules of evaluation apply. The author of sajson claims that most of his speed comes from not being a pull parser. What you do is zero copy DMA the incoming socket data into a memory mapped buffer, execute sajson's AST parse upon that known sized memory mapped buffer which encodes the AST directly into the source by modifying the buffer in place to avoid dynamic memory allocations completely, and voila bam there's your JSON parsed with a strict minimum of memory copied or cache lines modified. He claims, and I have no reason to doubt him, that because he can make these hard coded assumptions about the input buffer, he was able to make a very fast JSON parser (amongst the fastest non-SIMD parsers). By inference, a pull parser couldn't be as fast. I find that explanation by sajson's author compelling. The fact he completely avoids dynamic memory allocation altogether, and builds the AST inline into the original buffer of JSON, is particularly compelling. I haven't looked at Boost.JSON. But it seems to target a more C++ idiomatic API, be pluggable for other formats like Boost.Serialisation, but retain most of the performance of JSON parsers such as sajson or simdjson. As Boost reviews primarily review API design, Boost.JSON's choice of approach fits well for the process here. Boost prefers purity over performance. Personally speaking, for JSON I care solely and exclusively about maximum possible parse speed. I have no use for customisation or extensibility. If Boost.JSON beats sajson in in-place AST building and it also beats simdjson, I'll use it. If it doesn't, I won't. I suspect most users of JSON by far would have the exact same attitude as I do. For users like us, we really don't care what the parser does, or how it is designed, or whatever crappy API it might have, all we care about is maximum possible data extraction performance. Never ever calling malloc is an excellent sign of the right kind of JSON parser design, at least in my book. Niall
Em sex., 18 de set. de 2020 às 05:57, Niall Douglas via Boost
Firstly, that was a great essay on the backing theory Vinicius. I only wish more people who write parsers would read that first. I would urge you to convert it into a blog post or something similar, post it online so people can find it, and all that great explanation of theory doesn't get lost forever.
Thanks, Niall. You can share this link if you want: https://gitlab.com/-/snippets/2016550
In the very specific case of parsing JSON however, I'm not sure if the standard rules of evaluation apply. The author of sajson claims that most of his speed comes from not being a pull parser. What you do is zero copy DMA the incoming socket data into a memory mapped buffer, execute sajson's AST parse upon that known sized memory mapped buffer which encodes the AST directly into the source by modifying the buffer in place to avoid dynamic memory allocations completely, and voila bam there's your JSON parsed with a strict minimum of memory copied or cache lines modified. He claims, and I have no reason to doubt him, that because he can make these hard coded assumptions about the input buffer, he was able to make a very fast JSON parser (amongst the fastest non-SIMD parsers). By inference, a pull parser couldn't be as fast.
I find that explanation by sajson's author compelling. The fact he completely avoids dynamic memory allocation altogether, and builds the AST inline into the original buffer of JSON, is particularly compelling.
Design-wise, sajson has at least 2 tricks worth discussing: - It doesn't expose stream events to the user. So the pull/push taxonomy doesn't really apply here. - It modifies the input stream. That's a destructive parsing technique. That's the section "a faster DOM tree" from my review. Thanks for bringing this project to our attention. As for Boost.JSON, none of the above matters. Point 1 one could matter if the parser was an implementation detail, but that's not the case. Leaving the Boost.JSON review topic aside for a sentence, I share your assessment.
I haven't looked at Boost.JSON. But it seems to target a more C++ idiomatic API, be pluggable for other formats like Boost.Serialisation, but retain most of the performance of JSON parsers such as sajson or simdjson. As Boost reviews primarily review API design, Boost.JSON's choice of approach fits well for the process here. Boost prefers purity over performance.
`json::value` can have integration to Boost.Serialization, but its parser (this would be the archive concept) can't. I could write a detailed explanation here like I've done for the pull/push taxonomy... maybe another day.
I suspect most users of JSON by far would have the exact same attitude as I do. For users like us, we really don't care what the parser does, or how it is designed, or whatever crappy API it might have, all we care about is maximum possible data extraction performance. Never ever calling malloc is an excellent sign of the right kind of JSON parser design, at least in my book.
A coworker of mine has thoughts similar to yours. Great guy. Anyway, I feel like resuming the project that I've put on stall, so that's my goodbye. I'll still keep an eye on the discussions, but I'll try to stay mostly silent. Have fun, you all. And nice talking to you again, Niall. How many years since we worked together (even if only for a brief period)? :) -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
On Thu, Sep 17, 2020 at 3:31 PM Vinícius dos Santos Oliveira wrote:
A short introduction to pull parsers vs push parsers
The act of parsing an incoming stream will generate events from a predefined set. Users match the generated events against rules to trigger actions. For the following JSON sample:
{ "foo" : 42 , "bar" : 33 }
One of such event streams could be:
begin_object -> string_key -> number -> string_key -> number -> end_object
The essential difference between pull parsers and push parsers is how this event stream is delivered. A push parser (a.k.a. SAX parsers) uses the callback handler model. A pull parser (a.k.a. StAX parsers) stores the current parsing state and allows the user to query this state.
Essentially every push parser has an implicit pull parser that doesn't get exposed to the user. This is to say that a push parser would be implemented in the lines of: [snip] ## Review questions
Please be explicit about your decision (ACCEPT or REJECT).
REJECT.
Hi Vinícius, It took me a while to come to terms with the terminology ("pull parser"), but eventually I reached the same page. I am interested in a JSON pull parser as you've described it. But I would want a library that is part of Boost - i.e. one that has been through a formal review, and whose presence in Boost would result in future maintenance and contributions etc. Are you writing such a library with the intent of contributing it to Boost? I understand that what Boost.JSON offers is a higher level abstraction. i.e. Boost.JSON could be implemented on top of such a JSON pull parser library. But that higher level abstraction is what many users want, correct? I said above I'd be in the market for a JSON pull parser. I prefer to have my own types, not reuse a DOM type provided by a library. I would think Boost should have both: * A library that offers boost::json::value for the people that want that * A library that has a pull parser for people like me I know that one of your complaints was that Boost.JSON "does too little". But aside from these other JSON-related use cases that it does not cover, for what it does, it wasn't clear what the objection was. i.e. It sounded like your major objection to Boost.JSON was just that it did not ship with the lower level pull parser. Glen
Em sex., 18 de set. de 2020 às 15:25, Glen Fernandes
Hi Vinícius,
Hi Glen,
I know that one of your complaints was that Boost.JSON "does too little". But aside from these other JSON-related use cases that it does not cover, for what it does, it wasn't clear what the objection was.
i.e. It sounded like your major objection to Boost.JSON was just that it did not ship with the lower level pull parser.
That's not how I work. I keep an eye to the discussions. For instance, we have Peter Dimov who will be my stakeholder on the array of scalars for improved CBOR case. I trust Peter Dimov here. I have no urge to repeat somebody else's observation. Doing so would only contribute to noise. And when I say "it does too little", I mean what I say. Boost.JSON design is solely concerned with 3 pieces: - A read() function. - A write() function. - The variant class used in these 2 functions. Anything else will receive no consideration to design at all. Can we go in the opposite direction from parsers and review that? No, we can't. Boost.JSON has no serialization framework. Compare these two pretty-printer examples and you'll understand what I mean: - https://master.json.cpp.al/json/examples.html#json.examples.pretty - https://github.com/breese/trial.protocol/blob/develop/example/json/pretty_pr... If I didn't happen to spend a few extra hours reviewing Boost.JSON serialization framework, the reason is... Boost.JSON has nothing that I can review/work-with here.
I am interested in a JSON pull parser [...] Are you writing such a library with the intent of contributing it to Boost?
I do intend to write libraries for Boost, but not in the JSON domain. The pull parser library that I use daily at work is the same one I used to showcase pull parsers in the gawk plug-in. You can try to persuade Bjørn -- he authored the library that I use -- if you want. You can try to persuade Vinnie as well, but good luck there (I've been trying for a long time).
I understand that what Boost.JSON offers is a higher level abstraction. i.e. Boost.JSON could be implemented on top of such a JSON pull parser library. But that higher level abstraction is what many users want, correct?
I said above I'd be in the market for a JSON pull parser. I prefer to have my own types, not reuse a DOM type provided by a library. I would think Boost should have both:
* A library that offers boost::json::value for the people that want that * A library that has a pull parser for people like me
Boost is not a mere ceremony to award this and that library. Boost's value relies on peer reviews, and that's my contribution here. I know some proponents were very persistent on "DOM-tree is valuable", but I won't be baited. If the container is all that matters, this "library" should instead be a PR to Boost.Container. And you aren't the only one who would prefer to use your own types. The whole "my users never requested [...]" narrative fails to observe that on the very first page from Boost.JSON documentation itself, "I want to parse directly into my own data structures" is one of the few FAQ items: https://master.json.cpp.al/json/overview.html#json.overview.frequently_asked... Meanwhile the pull parser library that I use -- and it is older than Boost.JSON -- has Boost.Serialization integration that only happens to work thanks to the pull parser choice. I'll answer Niall next, but I need a break right now, so it'll take a few hours at least. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
participants (4)
-
Glen Fernandes
-
Niall Douglas
-
Vinnie Falco
-
Vinícius dos Santos Oliveira