- Do you think the library should be accepted as a Boost library? Yes, my vote is to ACCEPT this library. - What is your evaluation of the design? Much has been debated recently on the merits of push parsers vs. pull parsers. Boost.JSON provides both a push parser (SAX-like interface) and a DOM parser (which is somewhat pull-like but I don't think is technically either kind). It does not provide a "real" pull parser, which are sometimes more efficient for deserialisation into custom types; the SAX parser can somewhat do that but is not composable, which makes it harder to use in practice, unless I'm missing something (this is usually why I end up never using SAX parsers). If so, I'd like to see a more complete example where the handler delegates part of the parsing to a class that in turn delegates part to another class (as with reading typed sub-values). I don't think this is currently possible. So I think I can understand some of the debate around this point; while I don't think it's strictly necessary that this particular library implement that, if it doesn't then another library will end up having potentially competing types and the natural Boost.Json name will already be taken. Personally, however, I don't think this is sufficient reason to block acceptance of this library. Direct pull parsing doesn't mesh well with JSON, due to the requirement to tolerate reordered keys, so any non-defective parser would have to have something DOM-like under the hood anyway. (Some kind of cache, if not necessarily a traditional DOM.) As such it should be sufficient to use json::value itself as part of a pull-like parsing interface. Regarding the SAX parser specifically, I dislike the *_part methods in the callback interface. I feel that these should be removed and only the final form retained, called only when the complete text of the value has been determined. I don't feel it is ever sensible to call back with a partially-parsed value (unless I am misunderstanding what these are for, but there is little explanation in the docs, and I wasn't able to get them called at all in my test). I also don't see the point of supplying a size argument to on_key* and on_string, as this can be read from the string_view argument instead. The various number methods don't pass this separately. Similarly, having a bool return as well as an error_code output argument seems peculiar; one or the other should be sufficient, as is it just invites inconsistency. parser.write() is incremental but basic_parser.write() is not, which seems odd. I assume this means that parser is storing an incrementally populated buffer of the full document rather than _really_ incrementally parsing. parse/parser do not appear to support consuming only part of the supplied input as a complete JSON value (returning how much was consumed), which is useful for streams where multiple objects are concatenated without explicit separation. Regarding other aspects of design, I haven't drilled down in great detail but the basic interface seems reasonably sound, other than the concerns raised above. I have mixed feelings on the initializer-list construction method -- I like that there is some method of doing this, and it seems like idiomatic C++ (particularly regarding how key-value pairs appear), but also very non-JSON -- an array of objects ends up having three nested braces. But I don't have any specific suggestion for improvement. - What is your evaluation of the implementation? It seems unfortunate (though understandable, due to heap allocation) that you cannot declare a json::value / json::object as constexpr, for cases where you do want some kind of constant reference object (e.g. a schema description). You can somewhat work around this by declaring the JSON as a constexpr string_view and later parsing it to a non-constexpr json::value, but that seems a bit awkward (and potentially bad-perf-prone if someone parses it multiple times). Otherwise I didn't really look into the implementation at all. - What is your evaluation of the documentation? Overall, good, if perhaps a bit sparse on some aspects. One pet peeve however is that most of the examples appear to be written as if "using namespace boost::json" is in effect, which I feel both encourages bad practice and makes it harder to distinguish which types are provided by the library vs. exist only for the example. This is especially confusing in the examples for tag_invoke, but not only there -- another example of confusion are the aliases for types present in both boost and std namespaces. I strongly recommend going through and changing these to explicitly use a json:: namespace prefix on all library types (including the boost/std aliases). Brevity is useful, but not at the expense of clarity. The "Using Numbers" section of the docs appears to be entirely blank. "serializer can be used to incrementally serialize a value incrementally" (redundant words are redundant) (On a related note, an example use case for serializer is listed as "ensure a fixed amount of work is performed in each cycle", but the only means of control is the length of the provided buffer. I guess this is equating work to number of characters output, which seems dubious. But I have no specific objections.) - What is your evaluation of the potential usefulness of the library? The library seems well suited for JSON as inter-process/network data interchange, other than perhaps the omission of partial input parsing. It seems less suited for JSON as configuration file -- while it can parse such files (via common extensions such as ignoring comments and trailing commas) it lacks means to pretty-print and to round-trip the comments, which would be essential if writing back a config file intended for human use. However while supporting this use case would be nice, I don't consider it mandatory. - Did you try to use the library? With what compiler? Did you have any problems? I tried it out with MSVC 2017 in standalone header-only mode, though only with a few small examples. The basic_parser examples do not compile out of the box -- with the suggested json.hpp and src.hpp includes there are link errors with basic_parser symbols. Additionally including basic_parser.hpp resolves these but starts complaining about missing max_{key,string,array,object}_size members of the handler type, which are not documented. After these are supplied then it works as expected. I don't see why these should be needed, however. - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? Roughly four hours over a few days reviewing docs and trying some small experiments. - Are you knowledgeable about the problem domain? I have some apps that use JSON as both interchange and configuration storage formats, although most of these do not use C++. So, "somewhat".
Em ter., 22 de set. de 2020 às 05:12, Gavin Lambert via Boost
Direct pull parsing doesn't mesh well with JSON, due to the requirement to tolerate reordered keys, so any non-defective parser would have to have something DOM-like under the hood anyway. (Some kind of cache, if not necessarily a traditional DOM.) As such it should be sufficient to use json::value itself as part of a pull-like parsing interface.
Not really. We can debate this topic when the 10-day review period pressure is over. Check my previous emails and you'll find hints on where this is going. And it's not hypothetical as some suggest. There's actual code lying around these concepts already. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
On Tue, Sep 22, 2020 at 1:13 AM Gavin Lambert via Boost
Regarding the SAX parser specifically, I dislike the *_part methods in the callback interface. I feel that these should be removed and only the final form retained, called only when the complete text of the value has been determined.
The parser (and serializer) support incremental / streaming operation. To provide the "complete" text would require allocating memory and saving the partial result until enough input has been supplied to finish the text. basic_parser does not currently allocate memory, which was a design goal, and this would change that.
I don't feel it is ever sensible to call back with a partially-parsed value (unless I am misunderstanding what these are for, but there is little explanation in the docs, and I wasn't able to get them called at all in my test).
In order to see partial results, you have to invoke the parser with partial JSON. In other words, break up the input JSON into two or more consecutive buffers of characters, and supply them in order.
I also don't see the point of supplying a size argument to on_key* and on_string, as this can be read from the string_view argument instead.
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
having a bool return as well as an error_code output argument seems peculiar; one or the other should be sufficient, as is it just invites inconsistency.
Peculiar, but fast :) this was profiled and optimized. The bool return plus ec is superior to having only ec.
parser.write() is incremental but basic_parser.write() is not, which seems odd. I assume this means that parser is storing an incrementally populated buffer of the full document rather than _really_ incrementally parsing.
Everything is incremental. parser::write() has the condition that if there are any leftover characters after receiving a complete JSON, an error is returned. Contrast this with parser::write_some which does not produce an error in this case. basic_parser::write is more similar to parser::write_some. We will be renaming it to basic_parser::write_some, as another reviewer suggested, to make this more clear. There is also a synergy with asio's naming with that change.
parse/parser do not appear to support consuming only part of the supplied input as a complete JSON value (returning how much was consumed), which is useful for streams where multiple objects are concatenated without explicit separation.
parse() the free function does not, but parser does - use parser::write_some.
The "Using Numbers" section of the docs appears to be entirely blank.
This is fixed in develop.
(On a related note, an example use case for serializer is listed as "ensure a fixed amount of work is performed in each cycle", but the only means of control is the length of the provided buffer. I guess this is equating work to number of characters output, which seems dubious.
Well, yeah, that's exactly what it is. Consider a network server that delivers JSON to clients. We don't want a completion handler to be stuck serializing 100MB of JSON while another client is waiting who only needs 1MB serialized. The way to ensure fairness here is to serialize incrementally with some reasonable buffer size, say 32kb. The light client will finish in ~33 i/o cycles while the heavy client still makes progress.
It seems less suited for JSON as configuration file -- while it can parse such files (via common extensions such as ignoring comments and trailing commas) it lacks means to pretty-print and to round-trip the comments, which would be essential if writing back a config file intended for human use.
To store the comments would mean storing them in the DOM, which would have a significant impact on performance as the space for everything is tightly controlled.
The basic_parser examples do not compile out of the box -- with the suggested json.hpp and src.hpp includes there are link errors with basic_parser symbols. Additionally including basic_parser.hpp resolves these but starts complaining about missing max_{key,string,array,object}_size members of the handler type, which are not documented. After these are supplied then it works as expected. I don't see why these should be needed, however.
Those max sizes should be documented in the handler exemplar: https://master.json.cpp.al/json/ref/boost__json__basic_parser.html#json.ref.... The reason they are required is for optimization. Handling the limit in the basic_parser leads to performance gains. Thanks
On 23/09/2020 05:55, Vinnie Falco wrote:
On Tue, Sep 22, 2020 at 1:13 AM Gavin Lambert wrote:
Regarding the SAX parser specifically, I dislike the *_part methods in the callback interface. I feel that these should be removed and only the final form retained, called only when the complete text of the value has been determined.
The parser (and serializer) support incremental / streaming operation. To provide the "complete" text would require allocating memory and saving the partial result until enough input has been supplied to finish the text. basic_parser does not currently allocate memory, which was a design goal, and this would change that.
Not necessarily. A small buffer can be statically allocated for a single value-in-progress. It's trivial for all numeric values (except perhaps pathological precision, but that can be ignored). It's harder for string values; that might have to support dynamic allocation, it's true. I just don't see the point in having an interface that can return half of a value -- this just moves the problem of allocation and keeping track of the full value to the parser handler, which seems like an SRP violation.
I don't feel it is ever sensible to call back with a partially-parsed value (unless I am misunderstanding what these are for, but there is little explanation in the docs, and I wasn't able to get them called at all in my test).
In order to see partial results, you have to invoke the parser with partial JSON. In other words, break up the input JSON into two or more consecutive buffers of characters, and supply them in order.
I tried that, but the parse simply failed at the end of the first partial buffer and it completely ignored the second buffer. Hence my statement that basic_parser did not operate incrementally. Though see below; this appears to be a failure of example rather than of implementation.
I also don't see the point of supplying a size argument to on_key* and on_string, as this can be read from the string_view argument instead.
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
Which is a poor design choice, as I said.
parser.write() is incremental but basic_parser.write() is not, which seems odd. I assume this means that parser is storing an incrementally populated buffer of the full document rather than _really_ incrementally parsing.
Everything is incremental. parser::write() has the condition that if there are any leftover characters after receiving a complete JSON, an error is returned. Contrast this with parser::write_some which does not produce an error in this case. basic_parser::write is more similar to parser::write_some. We will be renaming it to basic_parser::write_some, as another reviewer suggested, to make this more clear. There is also a synergy with asio's naming with that change.
I wrote a print_parser (which is just like the null_parser example but prints the method calls and parameters) and invoked it like so: json::error_code ec; print_parser parser; std::string data1 = R"j({"foo":14,"bar":1)j"; std::string data2 = R"j(8,"baz":"bo)j"; std::string data3 = R"j(b"})j"; parser.write(&data1[0], data1.size(), ec); parser.write(&data2[0], data2.size(), ec); parser.write(&data3[0], data3.size(), ec); It stopped after calling on_int64(1, "1") and did not continue parsing, completely ignoring the second buffer, never calling any _part methods and reporting the wrong integer value. It parses correctly only if both are supplied. Hence, as I said, it is not behaving incrementally. However I have since noticed that the missing link is that it must pass 'true' as the 'more' parameter to basic_parser.write, which is not demonstrated in any example; it may be helpful to show that. (In fact this resolves my other concern about multi-document parsing as well, which also could use an explicit example.) Once this is done, then (as expected by the docs) it reports these calls (in part): on_number_part("1") on_int64(18, "8") on_string_part("bo", 2) on_string("b", 3) So I reiterate that the number and string handling is inconsistent and these part methods are more confusing than helpful -- numbers are parsed into the complete value but strings are not, and neither give you the full text of the value (which may be important if the goal is to parse into an arbitrary-precision value type). I do regard this as a very poor design choice, but not sufficiently so to reject the library. (I still also regard SAX-style parsing itself as a poor design choice [vs pull parsing], due to poor composability, but that is not a failure of the library itself, except perhaps in offering it at all. But I can understand why.)
Well, yeah, that's exactly what it is. Consider a network server that delivers JSON to clients. We don't want a completion handler to be stuck serializing 100MB of JSON while another client is waiting who only needs 1MB serialized. The way to ensure fairness here is to serialize incrementally with some reasonable buffer size, say 32kb. The light client will finish in ~33 i/o cycles while the heavy client still makes progress.
I was assuming that a "deep" JSON (many nested objects) may have a different cost per output character than a wide but shallow one. But it's not a big deal.
To store the comments would mean storing them in the DOM, which would have a significant impact on performance as the space for everything is tightly controlled.
I don't see why. It's just an extra string associated with any value. FWIW, I don't think it needs to *precisely* round-trip the comments -- it's ok to convert comments to /**/ style (that's mandatory if the output isn't pretty-printed) and to consolidate all comments associated with one value to a single comment that's always output after that value (or perhaps have 'before' and 'after' slots). This should be much more manageable. As it stands, the library is completely useless for the case of writing a JSON config file intended for human use. It's ok if you don't want that to be a goal, but this should be made clear up front so that people don't waste time considering the library for use cases it's not intended for. (Though it would be preferable if you did support it.)
The basic_parser examples do not compile out of the box -- with the suggested json.hpp and src.hpp includes there are link errors with basic_parser symbols. Additionally including basic_parser.hpp resolves these but starts complaining about missing max_{key,string,array,object}_size members of the handler type, which are not documented. After these are supplied then it works as expected. I don't see why these should be needed, however.
Those max sizes should be documented in the handler exemplar:
https://master.json.cpp.al/json/ref/boost__json__basic_parser.html#json.ref....
The docs I am referring to are the ones referenced in the original pre-announcement post (although now I notice that the main announcement post has a different link): http://vinniefalco.github.io/doc/json/json/ref/boost__json__basic_parser.htm... This is also where the "using numbers" section is entirely blank.
The reason they are required is for optimization. Handling the limit in the basic_parser leads to performance gains.
I don't really understand why. basic_parser doesn't actually store any of the keys/values (especially as it doesn't merge value parts), so it should have no reason to impose any kind of limit. This just seems like it's doing an extra count/comparison that could be removed entirely, and removing it would naturally improve performance. Unless you're saying that moving the check to basic_parser improves the performance of parser, in which case: you're doing it wrong. Imposing restrictions on all users of basic_parser for something that only parser cares about is the wrong design choice.
On Tue, Sep 22, 2020 at 4:47 PM Gavin Lambert via Boost
I just don't see the point in having an interface that can return half of a value -- this just moves the problem of allocation and keeping track of the full value to the parser handler, which seems like an SRP violation.
It is the opposite of SRP violation. basic_parser is concerned only with parsing and not keeping track of the full value. The responsibility of tracking full values is moved to the caller. That way, basic_parser does not impose its opinions on how the full value should be stored. This was litigated in earlier posts if I recall (Peter?).
I don't feel it is ever sensible to call back with a partially-parsed value (unless I am misunderstanding what these are for, but there is little explanation in the docs, and I wasn't able to get them called at all in my test).
The interfaces are built up in layers. At the lowest layer you have basic_parser which concerns itself only with processing the input buffer and turning it into semantic components. It does _not_ have the responsibility of buffering input to present it as a larger component, that's the job of the handler. basic_parser is exposed to users so they can use their own schemes for dealing with the semantic components. Above basic_parser, you have parser which uses value_stack. The value_stack does take responsibility for buffering input, and producing DOM values. As with basic_parser, both of these components are exposed to users so they can write their own layers on top. On top of parser we have the parse() free functions, which address the most popular use-case of parsing. These functions will satisfy most needs. And for the needs that are not satisfied, users can easily implement their own parse functions with different behaviors, as the library exposes all the intermediate lower levels.
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
Which is a poor design choice, as I said.
No, the information is already there so we are just passing it to the handler which can use it, or not. If the handler doesn't use it, then it is optimized away.
parser.write(&data1[0], data1.size(), ec); parser.write(&data2[0], data2.size(), ec); parser.write(&data3[0], data3.size(), ec);
You have to call basic_parser::reset in order to start parsing a new "document."
So I reiterate that the number and string handling is inconsistent and these part methods are more confusing than helpful -- numbers are parsed into the complete value but strings are not, and neither give you the full text of the value (which may be important if the goal is to parse into an arbitrary-precision value type).
I do regard this as a very poor design choice, but not sufficiently so to reject the library.
No it is not a poor design choice, it is the correct design choice. If basic_parser were to buffer the input, then users who have no need of buffered input (such as value_stack) would be paying for it for nothing. This interface is driven first and foremost for performance and the incremental/streaming functionality. If callers want to see full text, they can append the partial strings into a data member until the final on_string or on_int64 / on_uint64 / on_double and then interact with it that way. But they are not obliged to do so. Again I repeat, this is not a poor design choice. json::string for example is perfectly capable of being constructed from two separate buffers of characters (an implementation detail): https://github.com/CPPAlliance/json/blob/6ddddfb16f9b54407d153abdda11a29f746... If we were to follow your "correct design" to make basic_parser buffer the input, the implementation would be needlessly allocating memory and copying characters.
I was assuming that a "deep" JSON (many nested objects) may have a different cost per output character than a wide but shallow one. But it's not a big deal.
Yes, it isn't perfect of course but over time and with a reasonable distribution of possible JSON inputs it averages out to be good enough. Certainly better than not having streaming at all (which would be terrible).
To store the comments would mean storing them in the DOM, which would have a significant impact on performance as the space for everything is tightly controlled.
I don't see why. It's just an extra string associated with any value.
Yes exactly. sizeof(value)==16 on 32bit platforms, and sizeof(value)==24 on 64-bit platforms. When arrays and objects are constructed during parsing, the implementation performs a memcpy after allocation. Every additional byte in the sizeof(value) slows down the creation of arrays and objects. Adding a pointer to value would cost 25% on 32-bit and 33% on 64-bit, and make Boost.JSON unable to ever beat RapidJSON in performance.
This just seems like it's doing an extra count/comparison that could be removed entirely, and removing it would naturally improve performance.
If you don't want a limit then just set the max to `size_t(-1)` and the compiler will elide the check completely.
Unless you're saying that moving the check to basic_parser improves the performance of parser, in which case: you're doing it wrong. Imposing restrictions on all users of basic_parser for something that only parser cares about is the wrong design choice.
The design of these components balances usability with performance. If you don't care about limits you can just set the max to size_t(-1). And I think this is the right choice. The library makes common things easy, e.g. call parse(). If you need more complexity, you instantiate a `parser` whose API requires a few more steps. If you really want to go low level, you use `basic_parser` which being low level can seem esoteric. And I think this is okay, because the percentage of users who will use basic_parser is incredibly tiny. The percentage of `parser` users much greater. While the percentage of users of the parse() free function will be closer to 100% than anything else. In other words the library favors performance and utility at the lowest levels over theoretical interface correctness, where the number of users is naturally smaller (and their domain-expertise naturally higher). Regards
On 23/09/2020 12:52, Vinnie Falco wrote:
It is the opposite of SRP violation. basic_parser is concerned only with parsing and not keeping track of the full value. The responsibility of tracking full values is moved to the caller. That way, basic_parser does not impose its opinions on how the full value should be stored. This was litigated in earlier posts if I recall (Peter?).
Except it does, because it parses into int/uint/double, which is inherently a storage decision. You might have a valid argument if it left that step to the handler to determine as well. (i.e. only providing on_number_part and on_number, just the text without any attempt to parse into a numeric type.) Making that change would probably make the arbitrary-precision-parsing people happier as well.
The interfaces are built up in layers. At the lowest layer you have basic_parser which concerns itself only with processing the input buffer and turning it into semantic components. It does _not_ have the responsibility of buffering input to present it as a larger component, that's the job of the handler. basic_parser is exposed to users so they can use their own schemes for dealing with the semantic components.
Above basic_parser, you have parser which uses value_stack. The value_stack does take responsibility for buffering input, and producing DOM values. As with basic_parser, both of these components are exposed to users so they can write their own layers on top.
My point is that this seems like the wrong choice of layers. As it stands, consumers should never care about the basic_parser layer, and instead will actually want a different layer (not currently exposed) that is between the two, which exposes complete values that are not in any DOM. The current form of basic_parser could still exist, if you like, but it should be made an implementation detail and the complete-value basic_parser actually exposed to consumers.
parser.write(&data1[0], data1.size(), ec); parser.write(&data2[0], data2.size(), ec); parser.write(&data3[0], data3.size(), ec);
You have to call basic_parser::reset in order to start parsing a new "document."
This is not parsing new documents, but new document fragments. As such it must not reset.
No it is not a poor design choice, it is the correct design choice. If basic_parser were to buffer the input, then users who have no need of buffered input (such as value_stack) would be paying for it for nothing. This interface is driven first and foremost for performance and the incremental/streaming functionality. If callers want to see full text, they can append the partial strings into a data member until the final on_string or on_int64 / on_uint64 / on_double and then interact with it that way. But they are not obliged to do so.
There should never be any such consumers. Partial values are not useful to anyone. If they exist in the library, that is because things are structured improperly.
If we were to follow your "correct design" to make basic_parser buffer the input, the implementation would be needlessly allocating memory and copying characters.
It has to happen somewhere, regardless. Why are you pushing that reponsibility onto the consumer?
And I think this is the right choice. The library makes common things easy, e.g. call parse(). If you need more complexity, you instantiate a `parser` whose API requires a few more steps. If you really want to go low level, you use `basic_parser` which being low level can seem esoteric.
And I think this is okay, because the percentage of users who will use basic_parser is incredibly tiny. The percentage of `parser` users much greater. While the percentage of users of the parse() free function will be closer to 100% than anything else. In other words the library favors performance and utility at the lowest levels over theoretical interface correctness, where the number of users is naturally smaller (and their domain-expertise naturally higher).
So why provide basic_parser at all, if its interface is deliberately incorrect?
Mere moments ago, quoth I:
On 23/09/2020 12:52, Vinnie Falco wrote:
It is the opposite of SRP violation. basic_parser is concerned only with parsing and not keeping track of the full value. The responsibility of tracking full values is moved to the caller. That way, basic_parser does not impose its opinions on how the full value should be stored. This was litigated in earlier posts if I recall (Peter?).
Except it does, because it parses into int/uint/double, which is inherently a storage decision. You might have a valid argument if it left that step to the handler to determine as well. (i.e. only providing on_number_part and on_number, just the text without any attempt to parse into a numeric type.)
It's not just a storage *decision*; in order to implement this interface with the observed behaviour it also means that basic_parser must currently actually have *storage* for partial numbers updated across multiple fragments. This is inconsistent with how it treats keys and strings. Hence I reiterate: it would be more consistent to delete those methods and only have on_number_part + on_number that provide just the text. (Make it parser's responsibility to do the actual text-to-numeric-type parsing, ideally via a helper class that user code can call too. Since it's the layer that's dictating the storage format via json::value.)
The current form of basic_parser could still exist, if you like, but it should be made an implementation detail and the complete-value basic_parser actually exposed to consumers.
I still think that it would be better to instead provide an interface that precomposes the text fragments, but that is a separate argument from the prior one, that can be treated separately. Perhaps you could provide both -- basic_parser as-is (but with the number parsing/storage removed), and another layer on top (custom_parser, perhaps) that is nearly identical but omits the _part methods to only provide precomposed keys/values. This is the one that most consumers of a SAX-style interface would be expected to actually use.
Le mercredi 23 septembre 2020 à 14:08 +1200, Gavin Lambert via Boost a écrit :
Mere moments ago, quoth I:
On 23/09/2020 12:52, Vinnie Falco wrote:
It is the opposite of SRP violation. basic_parser is concerned only with parsing and not keeping track of the full value. The responsibility of tracking full values is moved to the caller. That way, basic_parser does not impose its opinions on how the full value should be stored. This was litigated in earlier posts if I recall (Peter?).
Except it does, because it parses into int/uint/double, which is inherently a storage decision. You might have a valid argument if it left that step to the handler to determine as well. (i.e. only providing on_number_part and on_number, just the text without any attempt to parse into a numeric type.)
It's not just a storage *decision*; in order to implement this interface with the observed behaviour it also means that basic_parser must currently actually have *storage* for partial numbers updated across multiple fragments. This is inconsistent with how it treats keys and strings.
The base inconsistency is the fact that storing a number, whatever large it is, takes a fixed amount of storage (talking about doubles or longs, which are the only types handled by boost.json), whereas for a string it is not the case. A non-allocating parser can safely handle integers and double itself (and it can do it in a more efficient way than if it's done in an upper layer), but has no other choice than splitting strings (unless there is an upper limit on string size).
I still think that it would be better to instead provide an interface that precomposes the text fragments, but that is a separate argument from the prior one, that can be treated separately.
Perhaps you could provide both -- basic_parser as-is (but with the number parsing/storage removed), and another layer on top (custom_parser, perhaps) that is nearly identical but omits the _part methods to only provide precomposed keys/values. This is the one that most consumers of a SAX-style interface would be expected to actually use.
From what i've seen, such an interface would be quite easy to implement. That would break the purpose of the non-allocating parser, though, so i'm not sure it would be that useful.
Regards, Julien
On 23/09/2020 18:31, Julien Blanc wrote:
The base inconsistency is the fact that storing a number, whatever large it is, takes a fixed amount of storage (talking about doubles or longs, which are the only types handled by boost.json), whereas for a string it is not the case. A non-allocating parser can safely handle integers and double itself (and it can do it in a more efficient way than if it's done in an upper layer), but has no other choice than splitting strings (unless there is an upper limit on string size).
I'm aware of that (although it's not entirely true). My point is that it's the wrong thing to do, especially given that JSON technically allows for arbitrarily large numbers (even though is typically more limited in practice). This could be avoided if basic_parser didn't try to dictate the number storage format and treated these the same as strings. That would allow both for arbitrary length/precision numbers to be handled as well as embedded clients that want to use 16-bit integer storage rather than 64-bit, for example. And it'd be more consistent with how keys and strings are handled. Let the class that actually does want to dictate the storage format (i.e. json::parser, which dictates use of json::value) do so, and keep that away from json::basic_parser. It just makes sense. And if anything, it should improve performance (by not doing it when not needed); I don't see why you think it'd be less efficient in an upper layer.
Le 2020-09-23 10:17, Gavin Lambert via Boost a écrit :
On 23/09/2020 18:31, Julien Blanc wrote:
And if anything, it should improve performance (by not doing it when not needed); I don't see why you think it'd be less efficient in an upper layer.
The number would need to be parsed twice: first in the json parser, to check it is a valid number, secondly in the handler, to give an actual number. Given that some json parsing libraries implements their own number parsing to avoid this double parsing, constructing the value on the fly, i’m not alone in thinking it makes a difference (from what i've seen, boost::json authors also are on that boat, which makes their interface choice a logical one). Regards, Julien
On 23/09/2020 13:36, I wrote:
On 23/09/2020 12:52, Vinnie Falco wrote:
The interfaces are built up in layers. At the lowest layer you have basic_parser which concerns itself only with processing the input buffer and turning it into semantic components. It does _not_ have the responsibility of buffering input to present it as a larger component, that's the job of the handler. basic_parser is exposed to users so they can use their own schemes for dealing with the semantic components.
Above basic_parser, you have parser which uses value_stack. The value_stack does take responsibility for buffering input, and producing DOM values. As with basic_parser, both of these components are exposed to users so they can write their own layers on top.
My point is that this seems like the wrong choice of layers. As it stands, consumers should never care about the basic_parser layer, and instead will actually want a different layer (not currently exposed) that is between the two, which exposes complete values that are not in any DOM.
The current form of basic_parser could still exist, if you like, but it should be made an implementation detail and the complete-value basic_parser actually exposed to consumers.
To restate this in a clearer way: it is still my belief that aspects of json::parser performance considerations have "polluted" the design of json::basic_parser to the point where it is not actually a meaningful interface. As such I strongly recommend that json::basic_parser be turned into an implementation detail and not part of the public-facing part of the library. The library should only be used via its DOM interface (json::parser/json::value), as that is the only one that is actually properly baked. I don't regard this as a reason to change my vote to accept the library (and it's too late anyway), as the DOM interface by itself is highly useful. I just think that the basic_parser interface should be treated as out of scope.
Vinnie Falco wrote:
On Tue, Sep 29, 2020 at 6:27 PM Gavin Lambert via Boost
wrote: As such I strongly recommend that json::basic_parser be turned into an implementation detail and not part of the public-facing part of the library.
Does anyone else agree?
I don't. If you don't like basic_parser, don't use it. It's not clear to me what anyone would gain from preventing others from using it.
On Wed, Sep 30, 2020 at 8:30 AM Peter Dimov
I don't. If you don't like basic_parser, don't use it. It's not clear to me what anyone would gain from preventing others from using it.
If Boost.JSON is accepted and we are targeting December ship date, I am weakly in favor of making basic_parser private for one release so we can work out more details. We might need to tune the API to achieve the space savings we want. Thanks
śr., 30 wrz 2020 o 17:39 Vinnie Falco via Boost
On Wed, Sep 30, 2020 at 8:30 AM Peter Dimov
wrote: I don't. If you don't like basic_parser, don't use it. It's not clear to me what anyone would gain from preventing others from using it.
If Boost.JSON is accepted and we are targeting December ship date, I am weakly in favor of making basic_parser private for one release so we can work out more details. We might need to tune the API to achieve the space savings we want.
I support this decision. Good interfaces are an important part of boost, so it may be better to wait with exposing it. Also, you can call it an implementation detail but nonetheless put it in a header accessible by the users. The eager ones can still use it, even if it is not part of the official interface. Regards, &rzej;
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Wed, 30 Sep 2020 at 17:15, Vinnie Falco via Boost
On Tue, Sep 29, 2020 at 6:27 PM Gavin Lambert via Boost
wrote: As such I strongly recommend that json::basic_parser be turned into an
implementation detail and not part of the public-facing part of the
library.
Does anyone else agree?
Emphatically not. Without basic_parser it would not be possible to parse directly to struct and avoid building a redundant DOM.
Thanks
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
--
Richard Hodges hodges.r@gmail.com office: +442032898513 home: +376841522 mobile: +376380212
śr., 23 wrz 2020 o 02:53 Vinnie Falco via Boost
On Tue, Sep 22, 2020 at 4:47 PM Gavin Lambert via Boost
wrote: I just don't see the point in having an interface that can return half of a value -- this just moves the problem of allocation and keeping track of the full value to the parser handler, which seems like an SRP violation.
It is the opposite of SRP violation. basic_parser is concerned only with parsing and not keeping track of the full value. The responsibility of tracking full values is moved to the caller. That way, basic_parser does not impose its opinions on how the full value should be stored. This was litigated in earlier posts if I recall (Peter?).
I don't feel it is ever sensible to call back with a partially-parsed value (unless I am misunderstanding what these are for, but there is little explanation in the docs, and I wasn't able to get them called at all in my test).
The interfaces are built up in layers. At the lowest layer you have basic_parser which concerns itself only with processing the input buffer and turning it into semantic components. It does _not_ have the responsibility of buffering input to present it as a larger component, that's the job of the handler. basic_parser is exposed to users so they can use their own schemes for dealing with the semantic components.
Above basic_parser, you have parser which uses value_stack. The value_stack does take responsibility for buffering input, and producing DOM values. As with basic_parser, both of these components are exposed to users so they can write their own layers on top.
On top of parser we have the parse() free functions, which address the most popular use-case of parsing. These functions will satisfy most needs. And for the needs that are not satisfied, users can easily implement their own parse functions with different behaviors, as the library exposes all the intermediate lower levels.
Maybe this explanation is worth putting in the documentation. Regards, &rzej;
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
Which is a poor design choice, as I said.
No, the information is already there so we are just passing it to the handler which can use it, or not. If the handler doesn't use it, then it is optimized away.
parser.write(&data1[0], data1.size(), ec); parser.write(&data2[0], data2.size(), ec); parser.write(&data3[0], data3.size(), ec);
You have to call basic_parser::reset in order to start parsing a new "document."
So I reiterate that the number and string handling is inconsistent and these part methods are more confusing than helpful -- numbers are parsed into the complete value but strings are not, and neither give you the full text of the value (which may be important if the goal is to parse into an arbitrary-precision value type).
I do regard this as a very poor design choice, but not sufficiently so to reject the library.
No it is not a poor design choice, it is the correct design choice. If basic_parser were to buffer the input, then users who have no need of buffered input (such as value_stack) would be paying for it for nothing. This interface is driven first and foremost for performance and the incremental/streaming functionality. If callers want to see full text, they can append the partial strings into a data member until the final on_string or on_int64 / on_uint64 / on_double and then interact with it that way. But they are not obliged to do so.
Again I repeat, this is not a poor design choice. json::string for example is perfectly capable of being constructed from two separate buffers of characters (an implementation detail):
< https://github.com/CPPAlliance/json/blob/6ddddfb16f9b54407d153abdda11a29f746...
If we were to follow your "correct design" to make basic_parser buffer the input, the implementation would be needlessly allocating memory and copying characters.
I was assuming that a "deep" JSON (many nested objects) may have a different cost per output character than a wide but shallow one. But it's not a big deal.
Yes, it isn't perfect of course but over time and with a reasonable distribution of possible JSON inputs it averages out to be good enough. Certainly better than not having streaming at all (which would be terrible).
To store the comments would mean storing them in the DOM, which would have a significant impact on performance as the space for everything is tightly controlled.
I don't see why. It's just an extra string associated with any value.
Yes exactly. sizeof(value)==16 on 32bit platforms, and sizeof(value)==24 on 64-bit platforms. When arrays and objects are constructed during parsing, the implementation performs a memcpy after allocation. Every additional byte in the sizeof(value) slows down the creation of arrays and objects. Adding a pointer to value would cost 25% on 32-bit and 33% on 64-bit, and make Boost.JSON unable to ever beat RapidJSON in performance.
This just seems like it's doing an extra count/comparison that could be removed entirely, and removing it would naturally improve performance.
If you don't want a limit then just set the max to `size_t(-1)` and the compiler will elide the check completely.
Unless you're saying that moving the check to basic_parser improves the performance of parser, in which case: you're doing it wrong. Imposing restrictions on all users of basic_parser for something that only parser cares about is the wrong design choice.
The design of these components balances usability with performance. If you don't care about limits you can just set the max to size_t(-1). And I think this is the right choice. The library makes common things easy, e.g. call parse(). If you need more complexity, you instantiate a `parser` whose API requires a few more steps. If you really want to go low level, you use `basic_parser` which being low level can seem esoteric.
And I think this is okay, because the percentage of users who will use basic_parser is incredibly tiny. The percentage of `parser` users much greater. While the percentage of users of the parse() free function will be closer to 100% than anything else. In other words the library favors performance and utility at the lowest levels over theoretical interface correctness, where the number of users is naturally smaller (and their domain-expertise naturally higher).
Regards
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Tue, Sep 22, 2020 at 7:48 PM Gavin Lambert via Boost < boost@lists.boost.org> wrote:
On 23/09/2020 05:55, Vinnie Falco wrote:
I also don't see the point of supplying a size argument to on_key* and on_string, as this can be read from the string_view argument instead.
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
Which is a poor design choice, as I said.
How? The goal is to provide as much information as possible to the handler. Forcing users to track it themselves would introduce redundancy.
I don't really understand why. basic_parser doesn't actually store any
of the keys/values (especially as it doesn't merge value parts), so it should have no reason to impose any kind of limit. This just seems like it's doing an extra count/comparison that could be removed entirely, and removing it would naturally improve performance.
basic_parser only checks the limit; it does not enforce it. The handler is responsible for determining what that limit is. I suppose that we could add a little TMP that elides the check statically, but I don't see too much of a point in doing so. On Tue, Sep 22, 2020 at 7:48 PM Gavin Lambert via Boost < boost@lists.boost.org> wrote:
On 23/09/2020 05:55, Vinnie Falco wrote:
On Tue, Sep 22, 2020 at 1:13 AM Gavin Lambert wrote:
Regarding the SAX parser specifically, I dislike the *_part methods in the callback interface. I feel that these should be removed and only the final form retained, called only when the complete text of the value has been determined.
The parser (and serializer) support incremental / streaming operation. To provide the "complete" text would require allocating memory and saving the partial result until enough input has been supplied to finish the text. basic_parser does not currently allocate memory, which was a design goal, and this would change that.
Not necessarily. A small buffer can be statically allocated for a single value-in-progress. It's trivial for all numeric values (except perhaps pathological precision, but that can be ignored). It's harder for string values; that might have to support dynamic allocation, it's true.
I just don't see the point in having an interface that can return half of a value -- this just moves the problem of allocation and keeping track of the full value to the parser handler, which seems like an SRP violation.
I don't feel it is ever sensible to call back with a partially-parsed value (unless I am misunderstanding what these are for, but there is little explanation in the docs, and I wasn't able to get them called at all in my test).
In order to see partial results, you have to invoke the parser with partial JSON. In other words, break up the input JSON into two or more consecutive buffers of characters, and supply them in order.
I tried that, but the parse simply failed at the end of the first partial buffer and it completely ignored the second buffer. Hence my statement that basic_parser did not operate incrementally. Though see below; this appears to be a failure of example rather than of implementation.
I also don't see the point of supplying a size argument to on_key* and on_string, as this can be read from the string_view argument instead.
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
Which is a poor design choice, as I said.
parser.write() is incremental but basic_parser.write() is not, which seems odd. I assume this means that parser is storing an incrementally populated buffer of the full document rather than _really_ incrementally parsing.
Everything is incremental. parser::write() has the condition that if there are any leftover characters after receiving a complete JSON, an error is returned. Contrast this with parser::write_some which does not produce an error in this case. basic_parser::write is more similar to parser::write_some. We will be renaming it to basic_parser::write_some, as another reviewer suggested, to make this more clear. There is also a synergy with asio's naming with that change.
I wrote a print_parser (which is just like the null_parser example but prints the method calls and parameters) and invoked it like so:
json::error_code ec; print_parser parser; std::string data1 = R"j({"foo":14,"bar":1)j"; std::string data2 = R"j(8,"baz":"bo)j"; std::string data3 = R"j(b"})j"; parser.write(&data1[0], data1.size(), ec); parser.write(&data2[0], data2.size(), ec); parser.write(&data3[0], data3.size(), ec);
It stopped after calling on_int64(1, "1") and did not continue parsing, completely ignoring the second buffer, never calling any _part methods and reporting the wrong integer value. It parses correctly only if both are supplied. Hence, as I said, it is not behaving incrementally.
However I have since noticed that the missing link is that it must pass 'true' as the 'more' parameter to basic_parser.write, which is not demonstrated in any example; it may be helpful to show that. (In fact this resolves my other concern about multi-document parsing as well, which also could use an explicit example.)
Once this is done, then (as expected by the docs) it reports these calls (in part):
on_number_part("1") on_int64(18, "8")
on_string_part("bo", 2) on_string("b", 3)
So I reiterate that the number and string handling is inconsistent and these part methods are more confusing than helpful -- numbers are parsed into the complete value but strings are not, and neither give you the full text of the value (which may be important if the goal is to parse into an arbitrary-precision value type).
I do regard this as a very poor design choice, but not sufficiently so to reject the library.
(I still also regard SAX-style parsing itself as a poor design choice [vs pull parsing], due to poor composability, but that is not a failure of the library itself, except perhaps in offering it at all. But I can understand why.)
Well, yeah, that's exactly what it is. Consider a network server that delivers JSON to clients. We don't want a completion handler to be stuck serializing 100MB of JSON while another client is waiting who only needs 1MB serialized. The way to ensure fairness here is to serialize incrementally with some reasonable buffer size, say 32kb. The light client will finish in ~33 i/o cycles while the heavy client still makes progress.
I was assuming that a "deep" JSON (many nested objects) may have a different cost per output character than a wide but shallow one. But it's not a big deal.
To store the comments would mean storing them in the DOM, which would have a significant impact on performance as the space for everything is tightly controlled.
I don't see why. It's just an extra string associated with any value.
FWIW, I don't think it needs to *precisely* round-trip the comments -- it's ok to convert comments to /**/ style (that's mandatory if the output isn't pretty-printed) and to consolidate all comments associated with one value to a single comment that's always output after that value (or perhaps have 'before' and 'after' slots). This should be much more manageable.
As it stands, the library is completely useless for the case of writing a JSON config file intended for human use. It's ok if you don't want that to be a goal, but this should be made clear up front so that people don't waste time considering the library for use cases it's not intended for. (Though it would be preferable if you did support it.)
The basic_parser examples do not compile out of the box -- with the suggested json.hpp and src.hpp includes there are link errors with basic_parser symbols. Additionally including basic_parser.hpp resolves these but starts complaining about missing max_{key,string,array,object}_size members of the handler type, which are not documented. After these are supplied then it works as expected. I don't see why these should be needed, however.
Those max sizes should be documented in the handler exemplar:
< https://master.json.cpp.al/json/ref/boost__json__basic_parser.html#json.ref....
The docs I am referring to are the ones referenced in the original pre-announcement post (although now I notice that the main announcement post has a different link):
http://vinniefalco.github.io/doc/json/json/ref/boost__json__basic_parser.htm...
This is also where the "using numbers" section is entirely blank.
The reason they are required is for optimization. Handling the limit in the basic_parser leads to performance gains.
I don't really understand why. basic_parser doesn't actually store any of the keys/values (especially as it doesn't merge value parts), so it should have no reason to impose any kind of limit. This just seems like it's doing an extra count/comparison that could be removed entirely, and removing it would naturally improve performance.
Unless you're saying that moving the check to basic_parser improves the performance of parser, in which case: you're doing it wrong. Imposing restrictions on all users of basic_parser for something that only parser cares about is the wrong design choice.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 23/09/2020 15:45, Krystian Stasiowski wrote:
On 23/09/2020 05:55, Vinnie Falco wrote:
I also don't see the point of supplying a size argument to on_key* and on_string, as this can be read from the string_view argument instead.
Right, so this size argument is actually the total number of characters, while the string_view::size is the number of characters in the current piece.
Which is a poor design choice, as I said.
How? The goal is to provide as much information as possible to the handler. Forcing users to track it themselves would introduce redundancy.
That was in the context of stating that requiring that users of basic_parser need to self-assemble multiple fragments of text was a poor design choice, vs. handing them the complete preassembled value text. I've walked back on that one a little bit; I now think that it's ok for basic_parser to do that (to retain a zero-allocation guarantee) but in that case it should treat numbers the same way (which it currently doesn't), and ideally there should be a standard slightly-higher-level parser that has a similar interface but does preassemble the values, eliminating the _part callbacks (for people who prefer ease of use over raw performance).
I don't really understand why. basic_parser doesn't actually store any of the keys/values (especially as it doesn't merge value parts), so it should have no reason to impose any kind of limit. This just seems like it's doing an extra count/comparison that could be removed entirely, and removing it would naturally improve performance.
basic_parser only checks the limit; it does not enforce it. The handler is responsible for determining what that limit is. I suppose that we could add a little TMP that elides the check statically, but I don't see too much of a point in doing so.
I don't see the point in checking the limit in the first place. Either the handler cares, in which case it can easily count calls to value callbacks within an on_array_begin/on_array_end block itself (or just check the count at the end), or it doesn't care, in which case it's a waste of time for basic_parser to check. It also assumes that all parts of the object tree have the same limits, which seems odd (in the case where something does care). I guess the main issue is that nowhere in the docs can I find any justification for their existence or why you might ever want to set them. Overall, basic_parser itself feels more like an interface that fell out of json::parser rather than one that was deliberately designed on its own.
On Wed, Sep 23, 2020 at 1:40 AM Gavin Lambert via Boost
I've walked back on that one a little bit; I now think that it's ok for basic_parser to do that (to retain a zero-allocation guarantee) but in that case it should treat numbers the same way (which it currently doesn't), and ideally there should be a standard slightly-higher-level parser that has a similar interface but does preassemble the values, eliminating the _part callbacks (for people who prefer ease of use over raw performance).
I had that originally, it was called number_parser, it was a public interface, and it was one of the first things to go during our 3-month journey of optimization. You can fish it out of the commit log if you care to see it. Thanks
Vinnie Falco wrote:
I had that originally, it was called number_parser, it was a public interface, and it was one of the first things to go during our 3-month journey of optimization.
I suspect that something similar would have happened to the hypothetical pull parser lowest level, if there were one. It seems to me that the pull parser would be very similar to the current parser, except it would need to suspend on each token, instead of on each buffer boundary. This probably has a performance cost.
On Sep 22, 2020, at 4:12 AM, Gavin Lambert via Boost
wrote: It seems less suited for JSON as configuration file -- while it can parse such files (via common extensions such as ignoring comments and trailing commas) it lacks means to pretty-print and to round-trip the comments, which would be essential if writing back a config file intended for human use. However while supporting this use case would be nice, I don't consider it mandatory.
Out of curiosity, why would you want to round-trip the comments from a config file? Wouldn’t you want to write the comments anew each time you write the file, in case your code version had newer or corrected info? (I’m assuming the comments are like “// the following setting enables self-destruct (default=false)”) I would think you’d want such “comments” to come from a schema, not the DOM. (Of course that would require schema-based serialization, or some form of serialization control, but that’s a separate topic) Or do you want the human to be able to change comments and/or write their own, that get saved back? We have something like that at my job, but it’s a first-class/normal citizen in the DOM, using normal json-object string field entries named “description”. That way users can set descriptions for config objects regardless whether they edit config files by hand, or use a GUI, or whatever. -hadriel
On 23/09/2020 13:56, Hadriel Kaplan wrote:
Out of curiosity, why would you want to round-trip the comments from a config file?
Wouldn’t you want to write the comments anew each time you write the file, in case your code version had newer or corrected info? (I’m assuming the comments are like “// the following setting enables self-destruct (default=false)”)
I would think you’d want such “comments” to come from a schema, not the DOM. (Of course that would require schema-based serialization, or some form of serialization control, but that’s a separate topic)
No, I wasn't thinking of comments as software-provided documentation. But that could still work if the serialiser set the comment before writing the file. Though I guess it'd have to be more careful if there were human-specified comments as well.
Or do you want the human to be able to change comments and/or write their own, that get saved back? We have something like that at my job, but it’s a first-class/normal citizen in the DOM, using normal json-object string field entries named “description”. That way users can set descriptions for config objects regardless whether they edit config files by hand, or use a GUI, or whatever.
Yes, I was envisaging the human to be the one creating the comments, either as notes for themselves, or more commonly commenting out a particular setting because they don't want it to take effect but don't want to actually delete it. You can work around these things in strict JSON by altering the key name (or using a reserved key name, as you suggested), and that's fine, provided that it round-trips unknown keys. But it's more awkward than comments.
participants (9)
-
Andrzej Krzemienski
-
Gavin Lambert
-
Hadriel Kaplan
-
Julien Blanc
-
Krystian Stasiowski
-
Peter Dimov
-
Richard Hodges
-
Vinnie Falco
-
Vinícius dos Santos Oliveira