New JSON library from the maker of Beast!
I've been working on a massively-multiplayer online blackjack casino server, called Beast Lounge [1]. The server and client communicate using JSON-RPC over WebSocket. I have developed a brand-new JSON library for this project, in accordance with the following design goals: * Robust support for custom allocators throughout. * Array and object interfaces closely track their corresponding C++20 container equivalents. * Use `std::basic_string` for strings. * Minimize use of templates for reduced compilation times. * Parsers and serializers work incrementally (['online algorithms]). * Elements in objects may also be iterated in insertion order. You can see the JSON library in development here: https://github.com/vinniefalco/json Is there any interest in proposing this for Boost? I'm happy to hear feedback or answer questions about this library. Feel free to open an issue on the repository, or reply here. Thanks [1] https://github.com/vinniefalco/BeastLounge
Hi Boost, it looks great indeed. A few questions: - can you provide a point-by-point comparison with this other great library: https://github.com/nlohmann/json? - would it be able to interact with property_tree too? Thanks David On Mon, Sep 23, 2019 at 4:06 AM Vinnie Falco via Boost < boost@lists.boost.org> wrote:
I've been working on a massively-multiplayer online blackjack casino server, called Beast Lounge [1]. The server and client communicate using JSON-RPC over WebSocket. I have developed a brand-new JSON library for this project, in accordance with the following design goals:
* Robust support for custom allocators throughout. * Array and object interfaces closely track their corresponding C++20 container equivalents. * Use `std::basic_string` for strings. * Minimize use of templates for reduced compilation times. * Parsers and serializers work incrementally (['online algorithms]). * Elements in objects may also be iterated in insertion order.
You can see the JSON library in development here:
https://github.com/vinniefalco/json
Is there any interest in proposing this for Boost?
I'm happy to hear feedback or answer questions about this library. Feel free to open an issue on the repository, or reply here.
Thanks
[1] https://github.com/vinniefalco/BeastLounge
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Sun, Sep 22, 2019 at 5:46 PM David Bellot via Boost
- can you provide a point-by-point comparison with this other great library: https://github.com/nlohmann/json?
Features in nlohmann but not planned for Boost.JSON: * JSON Pointer * JSON Patch * Specializing enum conversion * Binary formats (BSON, CBOR, MessagePack, and UBJSON) * Output format control * A bunch of syntactic sugar Features in Boost.JSON but not in nlohmann: * Full allocator support To my knowledge, nlohmann JSON only supports allocators which are DefaultConstructible. Boost.JSON gives full control over allocation and uses a type-erased, reference counted allocator almost identical to the polymorphic allocators in the standard. The library enforces a simple invariant: All elements which are part of the same JSON "document" (i.e. all share a common ancestor) are guaranteed to use the same. This is enforced at runtime: when an element which uses a different allocator is moved into a container, it is instead copied to use the same allocator as the container (and thus the enclosing JSON document). * Object elements are also sorted on insertion order: Iterating a JSON value of object type visits the elements in the order of insertion. https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93... * Insert order control, e.g https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93... * C++20 conforming container APIs Including node_type, remove values from an object and reinsert them elsewhere, or live outside the container: https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93... * Elements are implemented using non-template classes, most of the function definitions may be compiled and linked through a static or shared library. A header-only option is also available (via configuration macro) * The parser and serializer are designed to work incrementally. However, the parser interface allows the caller to inform the parser when the entirety (or remainder) of the JSON document is present. This enables the parser to use a different algorithm when it is known ahead of time that the entire document is available (currently, the implementation does not take advantage of this feature but it is possible in the future). * The parser uses error codes. The implementation does not allow exceptions to be raised from untrusted inputs. * User-defined types can be exchanged to and from JSON value objects in 3 ways: - by writing a free function which is found via ADL for the type, - by adding member functions with conforming signatures for class types, or - by specializing a class template for the type Here is an example of specialization to allow Boost.Asio IP address to be constructed from a JSON value: https://github.com/vinniefalco/BeastLounge/blob/e4b085d3523047ffe8fa0c910592... There is still some implementation left in my library so it should be considered as an early beta software. BeastLounge runs on it, you can see example code that uses it here: https://github.com/vinniefalco/BeastLounge/blob/e4b085d3523047ffe8fa0c910592... With respect to nlohmann, the initial version 1.0 of Boost.JSON will certainly not be as feature rich. Instead, I plan to focus on the areas where I think it can offer things lacking in other JSON libraries and do so with the high standards the community has come to expect from Boost libraries.
- would it be able to interact with property_tree too?
I don't plan on introducing a dependency on Boost.PropertyTree. The parser is structured the same way as the Beast HTTP parser. The algorithm is encoded into an abstract base class which is subclassed to output into the JSON container. If desired, you can subclass the parser yourself to store the results in a property tree: https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93...
On Sun, Sep 22, 2019 at 12:06 PM Vinnie Falco via Boost < boost@lists.boost.org> wrote:
I've been working on a massively-multiplayer online blackjack casino server, called Beast Lounge [1]. The server and client communicate using JSON-RPC over WebSocket. I have developed a brand-new JSON library for this project, in accordance with the following design goals:
* Robust support for custom allocators throughout. * Array and object interfaces closely track their corresponding C++20 container equivalents. * Use `std::basic_string` for strings. * Minimize use of templates for reduced compilation times. * Parsers and serializers work incrementally (['online algorithms]). * Elements in objects may also be iterated in insertion order.
You can see the JSON library in development here:
https://github.com/vinniefalco/json
Is there any interest in proposing this for Boost?
I'm very interested; except... I'm happy to hear feedback or answer questions about this library.
Feel free to open an issue on the repository, or reply here.
Why does it need to depend on Beast? That's not the kind of dependency I'd expect from a JSON library. -- -- Rene Rivera -- Grafik - Don't Assume Anything -- Robot Dreams - http://robot-dreams.net
On Sun, Sep 22, 2019 at 6:12 PM Rene Rivera
Why does it need to depend on Beast? That's not the kind of dependency I'd expect from a JSON library.
There are these four Beast includes which are leftovers from the
initial development which will eventually be removed so that the
library only depends on basic Boost facilities (Align, Assert, Config,
Container, Utility):
#include
On Sun, Sep 22, 2019 at 8:06 PM Vinnie Falco via Boost < boost@lists.boost.org> wrote:
[...] I have developed a brand-new JSON library for this project,
in accordance with the following design goals:
* Robust support for custom allocators throughout. * Array and object interfaces closely track their corresponding C++20 container equivalents. * Use `std::basic_string` for strings. * Minimize use of templates for reduced compilation times. * Parsers and serializers work incrementally (['online algorithms]). * Elements in objects may also be iterated in insertion order.
Hi, What about performance? Have you heard of https://github.com/lemire/simdjson? Where would your library fall in that benchmark at the above link? The already mentioned and popular nlohmann/json has a convenient API, but fairs poorly in that benchmark for example. Also, in client/server communications, it's less often a few huge JSON documents, but rather lots of small documents, so the constant "startup" time of the parser matters too, and in that same vein, a PULL-parser that allows to build the native data-structures directly, rather than the DOM-like approach of fully converting the document to a built-in JSON object, and then convert that to the native-structure, avoids the temp document, which is especially useful for large documents. Finally, there are many corner cases in JSON parsing. Is there the equivalent of Autobabhn for WebSocket, but for JSON parsing? Any plans to integrate with such infrastructure, assuming there's one? Thanks --DD
On Mon, Sep 23, 2019 at 5:19 AM Dominique Devienne via Boost
Hi,
What about performance? Have you heard of https://github.com/lemire/simdjson? Where would your library fall in that benchmark at the above link? Finally, there are many corner cases in JSON parsing. Is there the equivalent of Autobabhn for WebSocket, but for JSON parsing? Any plans to integrate with such infrastructure, assuming there's one?
There's also https://github.com/miloyip/nativejson-benchmark which benchmarks both performance and conformance.
The already mentioned and popular nlohmann/json has a convenient API, but fairs poorly in that benchmark for example.
Also, in client/server communications, it's less often a few huge JSON documents, but rather lots of small documents, so the constant "startup" time of the parser matters too, and in that same vein, a PULL-parser that allows to build the native data-structures directly, rather than the DOM-like approach of fully converting the document to a built-in JSON object, and then convert that to the native-structure, avoids the temp document, which is especially useful for large documents.
It looks like nlohmann/json now has this kind of API too: 1. https://tinyurl.com/nl-json-parse 2. https://tinyurl.com/nl-json-parse-callback Glen
On Mon, Sep 23, 2019 at 6:44 AM Glen Fernandes wrote:
On Mon, Sep 23, 2019 at 5:19 AM Dominique Devienne via Boost
Also, in client/server communications, it's less often a few huge JSON documents, but rather lots of small documents, so the constant "startup" time of the parser matters too, and in that same vein, a PULL-parser that allows to build the native data-structures directly, rather than the DOM-like approach of fully converting the document to a built-in JSON object, and then convert that to the native-structure, avoids the temp document, which is especially useful for large documents.
It looks like nlohmann/json now has this kind of API too: 1. https://tinyurl.com/nl-json-parse 2. https://tinyurl.com/nl-json-parse-callback
That is the use case that I find more interesting. If you want to decouple your data representation in your program from the serialization format. e.g. In memory if your data structure is something like: vector
On 23. Sep 2019, at 13:06, Glen Fernandes via Boost
wrote: On Mon, Sep 23, 2019 at 6:44 AM Glen Fernandes wrote:
On Mon, Sep 23, 2019 at 5:19 AM Dominique Devienne via Boost
Also, in client/server communications, it's less often a few huge JSON documents, but rather lots of small documents, so the constant "startup" time of the parser matters too, and in that same vein, a PULL-parser that allows to build the native data-structures directly, rather than the DOM-like approach of fully converting the document to a built-in JSON object, and then convert that to the native-structure, avoids the temp document, which is especially useful for large documents.
It looks like nlohmann/json now has this kind of API too: 1. https://tinyurl.com/nl-json-parse 2. https://tinyurl.com/nl-json-parse-callback
That is the use case that I find more interesting. If you want to decouple your data representation in your program from the serialization format.
e.g. In memory if your data structure is something like: vector
This could be expressed in json by something like: [ { "key": [1, 0.01], "def": [9, 0.37], ... }, { "xyz": [5, 1.25], "abc": [2, 4.68], ... }, ... ]
Your load it from some JSON file. Your don't want to store/use it in your program as a SomeLibrary::JsonArray. Past the point of serialization it should be your own data structures. Of course your could always convert it from SomeLIbrary::JsonArray to your own structures, but that's overhead you don't need if there's hundreds of megabytes worth of content in that data.
Just throwing our library in the ring: https://github.com/taocpp/json We have a SAX-style interface at the core, parsers only generate events, from there you can generate a DOM object or some direct UDT structure like mentioned above. (Uses some macros to register types). Or you directly pretty print it or convert to CBOR, MsgPack, etc. You could even generate an nlohmann-DOM-value with our parsers. Benchmarks in the nativejson-benchmark showed that we were faster than nlohmann's builtin parser (that was some time ago, might have changed in the meantime) Currently, our parser(s) do not support incremental parsing, though. Daniel
Glen
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mon, Sep 23, 2019 at 7:15 AM Daniel Frey wrote:
Just throwing our library in the ring: https://github.com/taocpp/json
We have a SAX-style interface at the core, parsers only generate events, from there you can generate a DOM object or some direct UDT structure like mentioned above. (Uses some macros to register types). Or you directly pretty print it or convert to CBOR, MsgPack, etc.
You could even generate an nlohmann-DOM-value with our parsers. Benchmarks in the nativejson-benchmark showed that we were faster than nlohmann's builtin parser (that was some time ago, might have changed in the meantime)
Looks interesting. Missing a Boost license though. :) Glen
On 23. Sep 2019, at 13:27, Glen Fernandes
wrote: On Mon, Sep 23, 2019 at 7:15 AM Daniel Frey wrote:
Just throwing our library in the ring: https://github.com/taocpp/json
We have a SAX-style interface at the core, parsers only generate events, from there you can generate a DOM object or some direct UDT structure like mentioned above. (Uses some macros to register types). Or you directly pretty print it or convert to CBOR, MsgPack, etc.
You could even generate an nlohmann-DOM-value with our parsers. Benchmarks in the nativejson-benchmark showed that we were faster than nlohmann's builtin parser (that was some time ago, might have changed in the meantime)
Looks interesting. Missing a Boost license though. :)
Thanks, and making it dual-licensed is certainly not a problem if that's an actual problem. 😊
Glen
On Mon, Sep 23, 2019 at 4:06 AM Glen Fernandes via Boost
Your don't want to store/use it in your program as a SomeLibrary::JsonArray. Past the point of serialization it should be your own data structures.
This is accomplished by making your own class derived from `json::basic_parser`, and implementing the abstract virtual "event" members: https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93... Thanks
Hi Vinnie! I know that it is not standard JSON, but do you plan to support comments (// ..., /* ... */)? It would be useful for storing configuration with comments, for example. Thanks, Jarda po 23. 9. 2019 v 15:36 odesílatel Vinnie Falco via Boost < boost@lists.boost.org> napsal:
On Mon, Sep 23, 2019 at 4:06 AM Glen Fernandes via Boost
wrote: Your don't want to store/use it in your program as a SomeLibrary::JsonArray. Past the point of serialization it should be your own data structures.
This is accomplished by making your own class derived from `json::basic_parser`, and implementing the abstract virtual "event" members:
< https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93...
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mon, Sep 23, 2019 at 7:51 AM JF via Boost
I know that it is not standard JSON, but do you plan to support comments (// ..., /* ... */)? It would be useful for storing configuration
I used the same approach with the JSON parser that I used with Beast's HTTP parser. That is, a strict parser which strives to adhere to the letter of the spec. This does favor the use-case for networking purposes (since parser inputs are often from untrusted sources) and disadvantage the configuration file case. However, there are already lots of nice libraries (including JSON) for processing configuration files, but no JSON libraries that are specifically tuned for networking (allocator and incremental parsing features). So comments are not supported, and I wasn't thinking about adding them. Although alternate parser implementations could be something I explore in the future.
On Mon, Sep 23, 2019 at 10:39 AM Vinnie Falco via Boost
On Mon, Sep 23, 2019 at 7:51 AM JF via Boost
wrote: I know that it is not standard JSON, but do you plan to support comments (// ..., /* ... */)? It would be useful for storing configuration
I used the same approach with the JSON parser that I used with Beast's HTTP parser. That is, a strict parser which strives to adhere to the letter of the spec.
I agree with this approach. The popular nlohmann::json project made the same design decision, and I think their reasoning is quite solid. See here: https://github.com/nlohmann/json#comments-in-json -Vicram
On 23. Sep 2019, at 23:56, Vicram Rajagopalan via Boost
wrote: On Mon, Sep 23, 2019 at 10:39 AM Vinnie Falco via Boost
wrote: On Mon, Sep 23, 2019 at 7:51 AM JF via Boost
wrote: I know that it is not standard JSON, but do you plan to support comments (// ..., /* ... */)? It would be useful for storing configuration
I used the same approach with the JSON parser that I used with Beast's HTTP parser. That is, a strict parser which strives to adhere to the letter of the spec.
I agree with this approach. The popular nlohmann::json project made the same design decision, and I think their reasoning is quite solid. See here: https://github.com/nlohmann/json#comments-in-json
If you have a common interface between the different parts (e.g. the SAX-style events interface I mentioned), you can have multiple parsers and you can choose which parser to apply in which situation. In our library we have a strict JSON parser and a parser for a slightly extended JSON format that we call JAXN https://github.com/stand-art/jaxn which supports comments, binary data and non-finite values (NaN, Infinity, -Infinity). All other components of the library can be reuse easily and do not depend on the parser. We also have CBOR, UBJSON, MsgPack parsers, you could add TOML or high-performance SIMD parsers, etc.
On Mon, Sep 23, 2019 at 2:19 AM Dominique Devienne via Boost
What about performance?
I have not run any benchmarks, or invested any time in optimizing the code. My main efforts thus far have been making sure that the public interfaces are correct, and that things work.
Have you heard of https://github.com/lemire/simdjson?
Yes I have seen that library. It is more of a proof-of-concept, and it has the requirement that it requires the entire JSON documented to be presented in a single contiguous input. For many use-cases, this condition is acceptable. So I have designed the parser to be informed when the condition is met, and use a different algorithm such as the SIMD one above.
a PULL-parser that allows to build the native data-structures directly, rather than the DOM-like approach of fully converting the document to a built-in JSON object
Yes, the JSON parser and the HTTP parser (in Beast) both are implemented with an event-style interface and do not make assumptions about the container used to store the data (this is provided by a subclass). See: https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93... https://github.com/vinniefalco/json/blob/25ddea5f4d088f3e56911bf9f97549eeb93...
Finally, there are many corner cases in JSON parsing. Is there the equivalent of Autobabhn for WebSocket, but for JSON parsing?
No idea, but if anyone knows of such a facility please let me know :)
Any plans to integrate with such infrastructure, assuming there's one?
Sure! Thanks
On Mon, Sep 23, 2019 at 2:19 AM Dominique Devienne via Boost
What about performance? Have you heard of https://github.com/lemire/simdjson? Where would your library fall in that benchmark at the above link?
simdjson gives the highest performance, but the resulting output is in a specialized format which is read-only. My library has roughly the same or better performance as RapidJSON. Thanks
On Tue, 22 Oct 2019 at 18:06, Vinnie Falco via Boost
On Mon, Sep 23, 2019 at 2:19 AM Dominique Devienne via Boost
wrote: What about performance? Have you heard of https://github.com/lemire/simdjson? Where would your library fall in that benchmark at the above link?
simdjson gives the highest performance, but the resulting output is in a specialized format which is read-only.
My library has roughly the same or better performance as RapidJSON.
Does it mean you have run it under Milo's nativejson-benchmark? Can you post the results? FYI, I was preparing to benchmark your library myself, prepared a bunch of workarounds for some issues/outdates in the current nativejson-benchmark https://github.com/miloyip/nativejson-benchmark/issues/102#issuecomment-5340... and applied them here https://github.com/mloskot/nativejson-benchmark/tree/ml/issue-102-add-workar... but I have not found enough time to pull your library in. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
On Tue, Oct 22, 2019 at 10:18 AM Mateusz Loskot via Boost
Does it mean you have run it under Milo's nativejson-benchmark?
I tried to integrate it into that benchmark but to be honest, that project looks like a hot mess. I don't have the resources to try to get my library working with it.
Can you post the results?
I have the results from a simple benchmarking program that I wrote, which parses a bunch of random JSON. There are two data sets, a small one and a big one: small data set: rapidjson parse 55952763 bytes in 230ms rapidjson parse 55952763 bytes in 229ms rapidjson parse 55952763 bytes in 228ms nlohmann parse 55952763 bytes in 514ms nlohmann parse 55952763 bytes in 514ms nlohmann parse 55952763 bytes in 533ms Boost.JSON parse 55952763 bytes in 234ms Boost.JSON parse 55952763 bytes in 233ms Boost.JSON parse 55952763 bytes in 233ms large data set: rapidjson parse 488889121 bytes in 1793ms rapidjson parse 488889121 bytes in 1791ms rapidjson parse 488889121 bytes in 1789ms nlohmann parse 488889121 bytes in 3921ms nlohmann parse 488889121 bytes in 3942ms nlohmann parse 488889121 bytes in 3965ms Boost.JSON parse 488889121 bytes in 1754ms Boost.JSON parse 488889121 bytes in 1761ms Boost.JSON parse 488889121 bytes in 1764ms The benchmark program is here: https://github.com/vinniefalco/json/blob/master/bench/bench.cpp I will likely expand on it to measure more things and report the results in the documentation (a work in progress).
I have not found enough time to pull your library in.
I still have a bit of work to do with the treatment of floating point numbers, and there are a couple of bugs with the handling of utf code points. Those will be done very soon. Thanks
On Tue, 22 Oct 2019 at 19:49, Vinnie Falco
On Tue, Oct 22, 2019 at 10:18 AM Mateusz Loskot via Boost
wrote: Does it mean you have run it under Milo's nativejson-benchmark?
I tried to integrate it into that benchmark but to be honest, that project looks like a hot mess. I don't have the resources to try to get my library working with it.
I don't disagree.
Can you post the results?
I have the results from a simple benchmarking program that I wrote, which parses a bunch of random JSON. There are two data sets, a small one and a big one: [...]
Thank you for posting this. It looks promising indeed! Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
I have the results from a simple benchmarking program that I wrote, which parses a bunch of random JSON. There are two data sets, a small one and a big one:
small data set: rapidjson parse 55952763 bytes in 230ms rapidjson parse 55952763 bytes in 229ms rapidjson parse 55952763 bytes in 228ms nlohmann parse 55952763 bytes in 514ms nlohmann parse 55952763 bytes in 514ms nlohmann parse 55952763 bytes in 533ms Boost.JSON parse 55952763 bytes in 234ms Boost.JSON parse 55952763 bytes in 233ms Boost.JSON parse 55952763 bytes in 233ms
large data set: rapidjson parse 488889121 bytes in 1793ms rapidjson parse 488889121 bytes in 1791ms rapidjson parse 488889121 bytes in 1789ms nlohmann parse 488889121 bytes in 3921ms nlohmann parse 488889121 bytes in 3942ms nlohmann parse 488889121 bytes in 3965ms Boost.JSON parse 488889121 bytes in 1754ms Boost.JSON parse 488889121 bytes in 1761ms Boost.JSON parse 488889121 bytes in 1764ms
Impressive Vinnie. Not at all a small accomplishment to match RapidJSON. Congrats! Niall
On 10/22/19 1:45 PM, Niall Douglas via Boost wrote:
I have the results from a simple benchmarking program that I wrote, which parses a bunch of random JSON. There are two data sets, a small one and a big one:
small data set: rapidjson parse 55952763 bytes in 230ms rapidjson parse 55952763 bytes in 229ms rapidjson parse 55952763 bytes in 228ms nlohmann parse 55952763 bytes in 514ms nlohmann parse 55952763 bytes in 514ms nlohmann parse 55952763 bytes in 533ms Boost.JSON parse 55952763 bytes in 234ms Boost.JSON parse 55952763 bytes in 233ms Boost.JSON parse 55952763 bytes in 233ms
large data set: rapidjson parse 488889121 bytes in 1793ms rapidjson parse 488889121 bytes in 1791ms rapidjson parse 488889121 bytes in 1789ms nlohmann parse 488889121 bytes in 3921ms nlohmann parse 488889121 bytes in 3942ms nlohmann parse 488889121 bytes in 3965ms Boost.JSON parse 488889121 bytes in 1754ms Boost.JSON parse 488889121 bytes in 1761ms Boost.JSON parse 488889121 bytes in 1764ms
Impressive Vinnie. Not at all a small accomplishment to match RapidJSON. Congrats!
Nail, I'd love to see the result you'd get by using one of number of JSON parsers implemented by Boost.Spirit. A number of them pop up by googling "C++ JSON spirit" including one by our own Michael Caisse. I'd also be curious to know the compile time as well as the runtime for the programs above. Robert Ramey.
On 10/22/19 15:40, Robert Ramey via Boost wrote:
On 10/22/19 1:45 PM, Niall Douglas via Boost wrote:
I have the results from a simple benchmarking program that I wrote, which parses a bunch of random JSON. There are two data sets, a small one and a big one:
<snip impressive comparisons of parse times>
Impressive Vinnie. Not at all a small accomplishment to match RapidJSON. Congrats!
Nail,
I'd love to see the result you'd get by using one of number of JSON parsers implemented by Boost.Spirit. A number of them pop up by googling "C++ JSON spirit" including one by our own Michael Caisse. I'd also be curious to know the compile time as well as the runtime for the programs above.
The Michael Caisse one isn't made for speed. It does something else. It will perform poorly in comparison I suspect. -- Michael Caisse Ciere Consulting ciere.com
On Wed, 23 Oct 2019 at 01:31, Michael Caisse via Boost
On 10/22/19 15:40, Robert Ramey via Boost wrote:
On 10/22/19 1:45 PM, Niall Douglas via Boost wrote:
I have the results from a simple benchmarking program that I wrote, which parses a bunch of random JSON. There are two data sets, a small one and a big one:
<snip impressive comparisons of parse times>
Impressive Vinnie. Not at all a small accomplishment to match RapidJSON. Congrats!
Nail,
I'd love to see the result you'd get by using one of number of JSON parsers implemented by Boost.Spirit. A number of them pop up by googling "C++ JSON spirit" including one by our own Michael Caisse. I'd also be curious to know the compile time as well as the runtime for the programs above.
The Michael Caisse one isn't made for speed. It does something else. It will perform poorly in comparison I suspect.
Yes, I confirm. I run it through my basic JSON benchmark in 2013 (before Milo from RapidJSON started his nativejson-benchmark): https://github.com/mloskot/json_benchmark However, it was/is a quality learning material! Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
Hi Vinnie, Vinnie Falco wrote:
I've been working on a massively-multiplayer online blackjack casino server, called Beast Lounge [1]. The server and client communicate using JSON-RPC over WebSocket. I have developed a brand-new JSON library for this project, in accordance with the following design goals:
I am reminded of the various discussions of alternative styles of XML parsers that have happened on this list over the years. People have a surprising variety of often-conflicting requirements or preferences. I think it's unlikely that any one solution will suit everyone - but maybe there are common bits of functionality that can be shared? My preference has always been for parsing by memory-mapping the entire file, or equivalently reading the entire document into memory as a blob of text, and then providing iterators that advance through the text looking for the next element, attribute, character etc. I think one of the first XML parsers to work this way was RapidXML. Their aim was to get the parsing speed as close to strlen() as possible and they did pretty well. Others have mentioned some benchmarks and I encourage you to try them - and/or at least be clear about whether performance is a design goal. Regards, Phil.
On Mon, Sep 23, 2019 at 8:17 AM Phil Endecott via Boost
but maybe there are common bits of functionality that can be shared?
Perhaps, but my preference is for monolithic parsers with no external dependencies and little to no configurability. They are easier to audit (and cheaper), and I do plan to commission a security review for Boost.JSON as I have done with Beast: https://vinniefalco.github.io/BeastAssets/Beast%20-%20Hybrid%20Application%2... It is also easier to maintain, and less likely to require changes (which bring the risk of new vulnerabilities) if the scope of functionality is strictly limited. It is true that this results in a parser which is less flexible. A survey of parsers in existing JSON libraries shows great diversity, so there is no shortage of flexibility there. I think there is room for one more strict parser with a static set of features.
My preference has always been for parsing by memory-mapping the entire file, or equivalently reading the entire document into memory as a blob of text
While parsing is important, as with the HTTP it is the least interesting aspect of JSON since parsing happens only once but inspection and modification of a JSON document (the `boost::json::value` type) happens continually, including across library API boundaries where JSON value types appear in function signatures or data members.
...be clear about whether performance is a design goal.
Performance is a design goal, but it is with respect to performance in the larger context of a network application. This library is less concerned about parsing a large chunk of in-memory serialized JSON over and over again inside a tight loop to hit a meaningless number in a contrived benchmark, and more concerned about ensuring that network programs have control over how and when memory allocation takes place, latency, and resource fairness when handling a large number of connections. That is why the library is built from the ground up to support allocators, to support incremental operation for parsing and serialization (bounded work in each I/O cycle reduces latency and increases fairness), Since the parser is presented with one or more buffers of memory containing the JSON document, and there is an API to inform the parser when these memory buffers represent the complete document, it should be possible to apply most of the optimizations currently used in other libraries, including SIMD algorithms when the complete document is presented. That said, if the experience with HTTP in Beast is representative of network applications which use JSON (a reasonable assumption), relatively little time is spent parsing a JSON RPC command coming from a connection compared to the time required to process the command, so the gains to be had from an optimized parser may not be so impressive. I will still eventually apply optimizations to it of course, for bragging rights. But I am in no hurry. Regards
On 9/23/19 5:16 PM, Phil Endecott via Boost wrote:
I am reminded of the various discussions of alternative styles of XML parsers that have happened on this list over the years. People have a surprising variety of often-conflicting requirements or preferences. I think it's unlikely that any one solution will suit everyone - but maybe there are common bits of functionality that can be shared?
As a former developer of one of said XML parsers, we learned the proper abstractions the hard way. If you start with a pull parser (what Vinnie refers to as an online parser, and what you refer to as an iterating parser), such as the XmlTextReader, then all the other interfaces flows naturally from that. Although the pull parser is mainly used as the basic building block for the other abstractions, it can also be used directly, e.g. for quick scanning of large JSON documents without memory allocation. A push parser (SAX) can easily be created by calling the pull parser in a loop and firing off events. Serialization is done by incrementally using a pull parser inside a serialization input archive, and likewise a a similar interface for generating the layout (e.g. XmlTextWriter) can be used for output archives. A tree parser (DOM) is simply a push parser that generates nodes as events are fired off. That is the design principles behind this JSON parser: http://breese.github.io/trial/protocol/
My preference has always been for parsing by memory-mapping the entire file, or equivalently reading the entire document into memory as a blob of text, and then providing iterators that advance through the text looking for the next element, attribute, character etc. I think one of the first XML parsers to work this way was RapidXML. Their aim was to
The Microsoft XML parser came first.
On Mon, Sep 23, 2019 at 8:58 AM Bjorn Reese via Boost
...online parser... A push parser (SAX)... A tree parser (DOM)
I have no experience with these terms other than occasionally coming across them in my Google searching adventures. The parsers that I have written take as input one or more buffers of contiguous characters, and produce as "output" a series of calls to abstract member functions which are implemented in the derived class. These calls represent tokens or events, such as "key string", "object begin", "array end". So what would we call this in the taxonomy above? Thanks
On 9/23/19 9:11 AM, Vinnie Falco via Boost wrote:
On Mon, Sep 23, 2019 at 8:58 AM Bjorn Reese via Boost
wrote: ...online parser... A push parser (SAX)... A tree parser (DOM)
I have no experience with these terms other than occasionally coming across them in my Google searching adventures. The parsers that I have written take as input one or more buffers of contiguous characters, and produce as "output" a series of calls to abstract member functions which are implemented in the derived class. These calls represent tokens or events, such as "key string", "object begin", "array end". So what would we call this in the taxonomy above?
Hmmmm - sounds like a job for Boost Spirit ! This is what I used 15+ years ago to create a special purpose for the limited subset of XML that boost serialization library. In all that time, through innumerable variations of compilers, linkers, C++ versions, ... everything - modification has been required on only a couple of cases. And by separating syntax from actions, it has permitted other collaborators to discover and suggest fixes to weird corner cases. The code seems pretty efficient too - at least no one has complained about that aspect - or another other either. The only complain would be slow compile time. But since the serialization library is compiled and the code is only compiled when the grammar changes - it's not really an issue. This software and its approach has been under appreciated - need a good CppCon talk on this subject. Food for thought. Robert Ramey
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Vinnie Falco wrote:
On Mon, Sep 23, 2019 at 8:58 AM Bjorn Reese via Boost
wrote: ...online parser... A push parser (SAX)... A tree parser (DOM)
Real-soon-now we'll have coroutines, and the two parties (the parser and its user) won't have to choose anymore about who pushes and who pulls. From the perspective of Boost, I'd love to see an exploration of how to solve design problems like this in more of a "modern C++ style". For example, quoting from Dominique Devienne's message: virtual bool handle_number(int); virtual bool handle_number(int64_t); virtual bool handle_number(uint64_t); virtual bool handle_number(double value); virtual bool handle_string(const std::string& value); virtual bool handle_boolean(bool value); virtual bool handle_null(); Rather than that, how about a co-routine that yields a variant of those types? Regards, Phil.
On Tue, Sep 24, 2019 at 1:50 AM Phil Endecott via Boost
Real-soon-now we'll have coroutines, and the two parties (the parser and its user) won't have to choose anymore about who pushes and who pulls.
A design criteria for this library is to require only C++11
Rather than that, how about a co-routine that yields a variant of those types?
If you mean something like json::value jv = co_await parse(source); I think that the inversion of the flow of control resulting from this design has excessive impact on the rest of the code. If we wanted to retrieve a JSON document from a socket for example, then `source` would have to be a coroutine. In the current design, the parser does not impose any model as to how the data buffers are acquired. I could be wrong though, do you have a more complete example? Thanks
On Monday, September 23, 2019, Bjorn Reese wrote:
As a former developer of one of said XML parsers, we learned the proper abstractions the hard way. If you start with a pull parser (what Vinnie refers to as an online parser, and what you refer to as an iterating parser), such as the XmlTextReader, then all the other interfaces flows naturally from that.
Although the pull parser is mainly used as the basic building block for the other abstractions, it can also be used directly, e.g. for quick scanning of large JSON documents without memory allocation.
A push parser (SAX) can easily be created by calling the pull parser in a loop and firing off events.
Serialization is done by incrementally using a pull parser inside a serialization input archive, and likewise a a similar interface for generating the layout (e.g. XmlTextWriter) can be used for output archives.
A tree parser (DOM) is simply a push parser that generates nodes as events are fired off.
Dominique explained some of the pull (stax) / push (sax) terminology to me off-list, and I agree. This does appear to be the more appealing underlying facility. Glen
On Mon, Sep 23, 2019 at 6:12 PM Glen Fernandes via Boost < boost@lists.boost.org> wrote:
Dominique explained some of the pull (stax) / push (sax) terminology to me off-list, and I agree. This does appear to be the more appealing underlying facility.
I didn't realize it was off-list, usually plain Reply goes to the list. But doesn't matter, Bjorn explained it better than me anyway. On Mon, Sep 23, 2019 at 6:11 PM Vinnie Falco via Boost < boost@lists.boost.org> wrote:
On Mon, Sep 23, 2019 at 8:58 AM Bjorn Reese via Boost
wrote: ...online parser... A push parser (SAX)... A tree parser (DOM)
I have no experience with these terms other than occasionally coming across them in my Google searching adventures. The parsers that I have written take as input one or more buffers of contiguous characters, and produce as "output" a series of calls to abstract member functions which are implemented in the derived class. These calls represent tokens or events, such as "key string", "object begin", "array end". So what would we call this in the taxonomy above?
That's a PUSH parser IMHO. The doc on Qt's XML PULL parser should make that clearer perhaps: https://doc.qt.io/qt-5/qxmlstreamreader.html#details Many of these terms originated in the XML world, and many (like SAX) from the Java world too. To give you a feel for it, here's my PUSH parser API: class JSONHandler { public: ... virtual bool handle_object_begin(); virtual bool handle_object_key(const std::string& key); virtual bool handle_object_end(); virtual bool handle_array_begin(); virtual bool handle_array_end(); virtual bool handle_number(int); virtual bool handle_number(int64_t); virtual bool handle_number(uint64_t); virtual bool handle_number(double value); virtual bool handle_string(const std::string& value); virtual bool handle_boolean(bool value); virtual bool handle_null(); ... }; bool json_parse(const char* json_utf8_text, size_t len, JSONHandler& handler); While that's my PULL parser API: enum JSONParsingEventType { //! Special end-of-document token. JSON_END = 0, // Value tokens. JSON_NULL, JSON_TRUE, JSON_FALSE, JSON_STRING, JSON_NUMBER, JSON_OBJECT_BEGIN, JSON_OBJECT_KEY, JSON_OBJECT_END, JSON_ARRAY_BEGIN, JSON_ARRAY_END, ... }; class JSONReader { public: JSONReader( const char* json_utf8_text, size_t len, const JSONParserOptions& options = JSONParserOptions() ); ~JSONReader(); JSONParsingEventType peek() const; JSONParsingEventType next(); JSONParsingEventType current() const; size_t skip_next(); size_t skip_current(); JSONToken token(); size_t depth(); size_t count(); bool is_integral(); int get_int(); int64_t get_int64_t(); uint64_t get_uint64_t(); float get_float(); double get_double(); std::string get_string(); std::string get_string_or_null(); bool get_boolean(); std::string get_key(); bool is_key(const char* key); bool is_key(const char* key, size_t len); ... }; where JSONToken is basically a std::string_view-like object into the raw JSON doc bytes, with low-level info for more control, about seeing a numeric sign, fractional point, or exponent, or about strings having escaped characters, including unicode ones, i.e. can't be used as-is, must be decoded according to JSON rules to get back UTF-8 text. The former parser "pushes" information at you, the client code. The parser does the looping. While in the latter, the client code is in the driver seat and does the loop, and controls the parser, extracting information out of it. There's also no inheritance necessary with a PULL parser, virtual or static-CRTP. As Bjorn wrote, a PULL parser is the lowest level building block, and the most convenient one to use. A PULL parser is typically passed around to code decoding various data structures, to instantiate them and their "children/descendants" from the infoset in the JSON doc. To make that safe from misbehaving code, I added concepts like "scopes" and "savepoints", so that the function you pass the reader to cannot step out of the current object, and to allow the caller code to recover by "rewinding" the doc to before the misbehaving reader, skip that object, and try the next one. Which means I also basically support incremental parsing too, even though I don't have an API for it, as obvious from above. Many parsers also have safeguards and "limits" in terms of depth of the stack, or maximum size allowed for strings, which are configured here via the JSONParserOptions struct. Anyways, I'm just showing this to illustrate differences between parsers. There are much better and faster parsers than mine. I learned a lot building them though, it was fun. Mine is comparable to nlohmann in terms of performance, i.e. not that fast :). --DD
Em seg, 23 de set de 2019 às 12:58, Bjorn Reese via Boost < boost@lists.boost.org> escreveu:
As a former developer of one of said XML parsers, we learned the proper abstractions the hard way. If you start with a pull parser (what Vinnie refers to as an online parser, and what you refer to as an iterating parser), such as the XmlTextReader, then all the other interfaces flows naturally from that.
Another interesting property from pull parsers is that you can implement backtracking trivially as I've demonstrated in a branch for an HTTP parser: https://github.com/BoostGSoC14/boost.http/blob/83b83041de3e8b04cc9475497648f... Although I don't see much of use for backtracking in a JSON parser. But the composition of algorithms on the other side is fantastic. I've been using your JSON parser at work for some years now and the possibility to skip several tokens and delegate their processing to different functions works really well. -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
On 9/23/19 6:11 PM, Vinnie Falco wrote:
On Mon, Sep 23, 2019 at 8:58 AM Bjorn Reese via Boost
wrote: ...online parser... A push parser (SAX)... A tree parser (DOM)
I have no experience with these terms other than occasionally coming across them in my Google searching adventures. The parsers that I have written take as input one or more buffers of contiguous characters, and produce as "output" a series of calls to abstract member functions which are implemented in the derived class. These calls represent tokens or events, such as "key string", "object begin", "array end". So what would we call this in the taxonomy above?
So if I understand your model correctly, it parses as many token as possible, and each token results in a callback. When you reach the end of the buffer, you manually suspend parsing and resume when more data is fed into it. That sounds like a variation of a push parser. It may be easier to understand the parser models in C++ terms, although these terms do not accurately cover the parser models. A pull parser is equivalent to a forward iterator. The user tells it when to advance to the next element. A pull parser usually has a richer interface than a C++ iterator because we are iterating over heterogenous elements. We therefore need to query multiple attributes of the current element, such as * The type of the current element (e.g. integer, string, start of array) * The converted value (e.g. text-to-integer conversion, or JSON string to UTF-8 string conversion) * The unconverted value (a string view of the input corresponding to the current value) A push parser is equivalent to a loop with a visitor. The loop traverses the entire input, and for each recognized element it calls the visitor. Assuming you already have a rich iterator, then it is straight-forward to use this for the loop and call a visitor in each iteration. That is, push parsers are easily build on top of pull parsers. You can find a simple example of this here: https://github.com/breese/trial.protocol/tree/develop/example/json/push_pars... A more elaborate example shows an alternative implementation of boost::property_tree JSON parser using a pull parser: https://github.com/breese/trial.protocol/blob/develop/example/json/property_...
"Bjorn Reese via Boost"
That is, push parsers are easily build on top of pull parsers.
Do you have an example of an efficient implementation of a pull parser on top of async io ? I find push parsers really more straightforward in that case. Pull parsers can also be built on top of push parsers, although it is less easy. Regards, Julien
On 30/09/2019 23:14, Julien BLANC wrote:
"Bjorn Reese" – 29 septembre 2019 12:56
That is, push parsers are easily build on top of pull parsers.
Do you have an example of an efficient implementation of a pull parser on top of async io ? I find push parsers really more straightforward in that case.
Usually that requires the ReadNextToken call (whatever its name) to be async as well (whether that's via callbacks, futures, coroutines, or some other async framework). Failing that, it would have to return some special error code that means "I need more input". Usually (outside of a proper async framework) the caller pulling tokens from the parser is directly responsible for feeding it new input as well.
On Sun, Sep 29, 2019 at 3:56 AM Bjorn Reese via Boost
So if I understand your model correctly, it parses as many token as possible, and each token results in a callback. When you reach the end of the buffer, you manually suspend parsing and resume when more data is fed into it. ... That sounds like a variation of a push parser.
Ah yeah, that description is accurate! Thanks
On Sun, 22 Sep 2019 at 19:05, Vinnie Falco via Boost
I've been working on a massively-multiplayer online blackjack casino server, called Beast Lounge [1]. The server and client communicate using JSON-RPC over WebSocket. I have developed a brand-new JSON library for this project, in accordance with the following design goals:
* Robust support for custom allocators throughout. * Array and object interfaces closely track their corresponding C++20 container equivalents. * Use `std::basic_string` for strings. * Minimize use of templates for reduced compilation times. * Parsers and serializers work incrementally (['online algorithms]). * Elements in objects may also be iterated in insertion order.
You can see the JSON library in development here:
Features that matter to me: - no allocation, ever - parsing/generating data incrementally from/into segmented buffers without copying - full arbitrary-precision decimal support without imposing a decimal representation on me - not imposing any string or blob type - fast conversion between numbers and text, with correct shortest representation for floating-point Unfortunately your library does not satisfy all of those criteria, so it is not very useful to me. Main problem appears to be that the parser is hardcoded to convert the values into a specific type. In practice I use JSON as a self-describing serialization/deserialization system in text form, which also happens to work well as a representation to transmit data across systems and languages. I do not want to convert a stream of JSON bytes into a boost::json::value type, which might already be a lossy conversion, I want to convert it directly to my type.
Le mardi 24 septembre 2019 à 10:29 +0100, Mathias Gaunard via Boost a écrit :
On Sun, 22 Sep 2019 at 19:05, Vinnie Falco via Boost Features that matter to me: - no allocation, ever - parsing/generating data incrementally from/into segmented buffers without copying - full arbitrary-precision decimal support without imposing a decimal representation on me - not imposing any string or blob type - fast conversion between numbers and text, with correct shortest representation for floating-point
You may have a look at https://github.com/Julien-Blanc-tgcm/jbc-json . This is a library i developed with nearly the same constraints in mind. I planned released it and eventually propose it to boost when it would be ready (the main missing feature currently being the documentation). I just putted it on github since there seems to be some interest, and it is in a completely working state (both parser and writers are conformant). Regarding speed, it is a bit slower than rapidjson, thus faster than most other json cpp libraries. The jsonitem representation (which is completely optional) could benefit from some tuning (allocator support is missing, for example). It is also the only json library i'm aware of that can validate a several Gbytes json file with only 32k of memory. I plan to include a pull parser api on top of the push one, and add some additional features such as i-json and json-schema validation (everything will be build on top of the current parser, and completely optional).
I do not want to convert a stream of JSON bytes into a boost::json::value type, which might already be a lossy conversion, I want to convert it directly to my type.
This is exactly what parser_callbacks template parameter is designed for in my library. Regards, Julien
On Tue, Sep 24, 2019 at 4:29 AM Julien Blanc via Boost
You may have a look at https://github.com/Julien-Blanc-tgcm/jbc-json .
Do I understand this correctly, that jbc-json parses one character at a time and stores function pointers in the stack, calling through the function pointer for each character? https://github.com/Julien-Blanc-tgcm/jbc-json/blob/ccd8f482fd285eb52ad4350b7... Thanks
Le mardi 22 octobre 2019 à 18:12 -0700, Vinnie Falco a écrit :
On Tue, Sep 24, 2019 at 4:29 AM Julien Blanc via Boost
wrote: You may have a look at https://github.com/Julien-Blanc-tgcm/jbc-json .
Do I understand this correctly, that jbc-json parses one character at a time and stores function pointers in the stack, calling through the function pointer for each character?
You understand correctly. IIRC i measured at the time i decided to go this way that it was not slower than the traditional switch / case statement. When i started this library i put safety requirements over speed. This is why in the first versions, there weren't even external api to parse more than one character at a time (a strong guarantee against buffer overflows). This has improved to allow for in-situ parsing, since memory use has been an important goal for me. There’s still plenty of room for improvements, however (for example, the default structure proposed to store parsed object is particularly inefficient and does not support allocators). Regards, Julien
On Tue, Sep 24, 2019 at 2:29 AM Mathias Gaunard
Features that matter to me: - no allocation, ever - parsing/generating data incrementally from/into segmented buffers without copying - full arbitrary-precision decimal support without imposing a decimal representation on me - not imposing any string or blob type - fast conversion between numbers and text, with correct shortest representation for floating-point
These are all features of a JSON parser. As I said earlier, I think the parser is the least interesting part of a JSON library. There are already a bunch of parsers that do exactly what you are asking above. Since parsers are unlikely to appear in public interfaces, there is little value in standardizing on a particular one. Library interoperability is not enhanced in this case. What we are missing, is a robust *container* for storing JSON values. Having a great container for representing all or part of a JSON document does make it easier to write interoperable library components that work with JSON. A library that implements the JSON-RPC specification [1] may have a function with this signature: // Validates and extracts a conforming JSON-RPC request from the // specified JSON object. Throws an exception if the request is not // conforming rpc_call get_rpc_request (boost::json::value& jv); For this to work, boost::json::value needs to be a good general purpose JSON container that satisfies most users. We can never satisfy ALL users; some design choices must represent tradeoffs between conflicting goals. The library I am providing, to propose to Boost eventually, places emphasis on the design of the JSON container because this is the surface which will be exposed between libraries. It is this part that hopefully will stimulate the growth of an ecosystem of libraries which use JSON and interoperate. A final example, someone may use Boost.Beast and Boost.JSON to implement a generic RPC client or server library, taking care of the socket and connection management, making requests and receiving responses. Users of such a library will care little about the parser it uses, other than that it exists and works. But they will care very much about the types used to represent JSON data, since they will be interacting with it regularly to implement their business logic. Regards [1] https://www.jsonrpc.org/specification
On Tue, Sep 24, 2019 at 3:11 PM Vinnie Falco via Boost < boost@lists.boost.org> wrote:
[...] What we are missing, is a robust *container* for storing JSON values. Having a great container for representing all or part of a JSON document
does make it easier to write interoperable library components that work
with JSON.
[...] For this to work, boost::json::value needs to be a good general
purpose JSON container that satisfies most users. We can never satisfy ALL users; some design choices must represent tradeoffs between conflicting goals. The library I am providing, to propose to Boost eventually, places emphasis on the design of the JSON container because this is the surface which will be exposed between libraries. It is this part that hopefully will stimulate the growth of an ecosystem of libraries which use JSON and interoperate.
Sounds good to me. A boost::json::value with by-value semantic, proper R-value support, a nice API, that's compact and efficient, would have value. How different from boost::variant is it though? --DD
On Tue, Sep 24, 2019 at 6:56 AM Dominique Devienne via Boost
A boost::json::value with by-value semantic, proper R-value support, a nice API, that's compact and efficient, would have value.
Yes!
How different from boost::variant is it though?
Do you mean literally
boost::variant
On 9/24/19 3:11 PM, Vinnie Falco via Boost wrote:
What we are missing, is a robust *container* for storing JSON values. Having a great container for representing all or part of a JSON document does make it easier to write interoperable library components that work with JSON. A library that implements the JSON-RPC [...] For this to work, boost::json::value needs to be a good general purpose JSON container that satisfies most users. We can never satisfy
In which case you may want to look at trial::dynamic::variable [1] which is a general-purpose container that supports fundamental types, strings, and nested containers and map of nested containers. Unlike a variant, the dynamic variable does not support custom types, but it means that it works with most standard algorithms. dynamic::variable is independent on JSON, but can be used as the parse tree for JSON and its binary cousins. It can also be used for other purposes such as configuration data. [1] http://breese.github.io/trial/protocol/trial_protocol/dynamic_variable.html
On 9/22/19 11:05 AM, Vinnie Falco via Boost wrote:
I've been working on a massively-multiplayer online blackjack casino server, called Beast Lounge [1]. The server and client communicate using JSON-RPC over WebSocket. I have developed a brand-new JSON library for this project, in accordance with the following design goals:
* Robust support for custom allocators throughout. * Array and object interfaces closely track their corresponding C++20 container equivalents. * Use `std::basic_string` for strings. * Minimize use of templates for reduced compilation times. * Parsers and serializers work incrementally (['online algorithms]). * Elements in objects may also be iterated in insertion order.
You can see the JSON library in development here:
https://github.com/vinniefalco/json
Is there any interest in proposing this for Boost?
I'm happy to hear feedback or answer questions about this library. Feel free to open an issue on the repository, or reply here.
Thanks
Vinnie - looks like something that boost should have. I took a cursory look at the repo just to get a feel for it. My question is, as usual, a little off topic. Did you look into using boost.spirit for parsing? I found it very interesting and used for of for fun in parsing the xml using by the serialization library. I a very interesting and fun job. The whole concept of isolating the grammar from the parsing it makes sense to me. It was one more think to learn though. It started out as an experiment, but when I saw how it turned out, I left it in and have been very pleased. It's been part of the serialization library for 15 years. I don't remember having to dip back into that code to fix anything! This is a big thing for me. I'm not asking you to actually do anything or even respond. I just like to keep the pot boiling. Robert Ramey
On Tue, Oct 22, 2019 at 10:40 AM Robert Ramey via Boost
Did you look into using boost.spirit for parsing?
I didn't even consider it. All of the parsers that I work with accept untrusted inputs, so when writing a parser I prefer to have no external dependencies. Spirit is an enormous dependency and scares off potential users. Thanks
On 10/22/19 11:03 AM, Vinnie Falco via Boost wrote:
On Tue, Oct 22, 2019 at 10:40 AM Robert Ramey via Boost
wrote: Did you look into using boost.spirit for parsing?
I didn't even consider it. All of the parsers that I work with accept untrusted inputs,
I don't buy this as a reason - in fact I'd call it reason to specify and enforce a formal grammar.
so when writing a parser I prefer to have no external dependencies. Spirit is an enormous dependency and scares off potential users.
This I can appreciate and sympathize with.. It points to the problem that C++ and boost have been wrestling with forever - dependency management. I'm pretty doubtful that anyone can write a demonstrably correct parser without using such a tool. So if security is an issue perhaps you might want to write a test program based on spirit. This would give one confidence that your library won't introduce security holes - at least on the inputs tested. This might be appreciated. Since users don't typically build/run tests (though I've advocated that they should!) there wouldn't be any kind of dependency issue for them. And it would give you the option of writing 1000++ test cases without having to check them all by hand. Just food for thought - feeding the beast. Robert Ramey
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Tue, 22 Oct 2019 at 20:30, Robert Ramey via Boost
On 10/22/19 11:03 AM, Vinnie Falco via Boost wrote:
On Tue, Oct 22, 2019 at 10:40 AM Robert Ramey via Boost
wrote: Did you look into using boost.spirit for parsing?
I didn't even consider it. All of the parsers that I work with accept untrusted inputs,
I don't buy this as a reason - in fact I'd call it reason to specify and enforce a formal grammar.
so when writing a parser I prefer to have no external dependencies. Spirit is an enormous dependency and scares off potential users.
This I can appreciate and sympathize with.. It points to the problem that C++ and boost have been wrestling with forever - dependency management.
I'm pretty doubtful that anyone can write a demonstrably correct parser without using such a tool. So if security is an issue perhaps you might want to write a test program based on spirit. This would give one confidence that your library won't introduce security holes - at least on the inputs tested. This might be appreciated. Since users don't typically build/run tests (though I've advocated that they should!) there wouldn't be any kind of dependency issue for them. And it would give you the option of writing 1000++ test cases without having to check them all by hand.
I'd consider covering the thing with https://google.github.io/oss-fuzz/ instead. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
On Tue, Oct 22, 2019 at 12:55 PM Mateusz Loskot via Boost
I'd consider covering the thing with https://google.github.io/oss-fuzz/ instead.
My strategy for ensuring correctness is two-fold. First, as with Beast, it will be reviewed by an external company (they will do the fuzzing). Second, are the special tests I write so that I have confidence everything works. This methodology is as follows: * Create a set of representative test vectors (examples of correct and invalid inputs) I have written my own inputs, and I have imported these test: https://github.com/nst/JSONTestSuite/tree/master/test_parsing Then, for each test vector: * Parse the input as one string and verify the output * Loop over every possible location that the input may be split into two pieces, parse the input as two individual pieces, verify the output. Code: https://github.com/vinniefalco/json/blob/cb348218345cfe2bea09d4a8ca8ea4c0f13... Then, for each possible split point also perform these algorithms: * Using a special allocator (`fail storage`) which throws after N calls to allocate, attempt to parse the input in a loop where N starts out at 1 and is incremented on each allocation failure. The test succeeds if the loop exits after a maximum number of iterations and the output is verified correct. Code for this failing allocator is here: https://github.com/vinniefalco/json/blob/cb348218345cfe2bea09d4a8ca8ea4c0f13... * Using a special parser (`fail_parser`) which returns an error after N calls to the parser's SAX API, attempt to parse the input in a loop where N starts out as 1 and is incremented on each failure. The test succeeds if the loop exits after a maximum number of iterations and the output is verified correct. Code for this failing allocator is here: https://github.com/vinniefalco/json/blob/cb348218345cfe2bea09d4a8ca8ea4c0f13... These tests are run under valgrind, address sanitizer, undefined behavior sanitizer, and code coverage. Then I look at the code coverage to find uncovered or partially covered lines, and devise individual tests to ensure that code is exercised. By now, there are only a handful of such lines if that. With these techniques I achieve close to 100% code coverage and very high confidence that every path through the parser is correct. After a bunch of testing (which consists of telling users it is "ready" and seeing what they report back) I submit it to the external code auditing company to get a report. After fixing any issues they raise in the report, my strategy changes: touch the code as little as possible. If this code used an external dependency, and that upstream code changed, then transitively it means my code changed - for this reason I avoid using external code like Spirit (or regex) even if it means I have to duplicate stuff. Thanks
On 10/22/19 1:21 PM, Vinnie Falco via Boost wrote:
On Tue, Oct 22, 2019 at 12:55 PM Mateusz Loskot via Boost
wrote: I'd consider covering the thing with https://google.github.io/oss-fuzz/ instead.
My strategy for ensuring correctness is two-fold.
<snip> I read all that. I'm not convinced but I'm sure that's just me. The basic problem is that there's huge amount of manual labor to verify that the test strings are parsed correctly. This issue isn't addressed by giving it to some auditing company or running more test cases. With my method, one specifies and verifies the grammar as a separate entity. It's not a huge job for something like JSON. (for xml - a different kettle of fish). Then for each test string - wherever you get them, the results of the hand generated parse MUST be identical to the spirit generated one. If in the future some question arises regarding a particular case, just at it to the list and re-run tests. If someone "fixes" the hand rolled parser, it's trivial to re-run all the tests again. My real point is that specifying one's grammar using a version of spirit is an activity which can be done in a way that is in a real sense provably correct - which the other methods you cite aren't. I really believe that using this approach will save thousands of man hours. And it's not difficult to try out - probably a couple of days learning spirit and one day writing the JSON grammar in spirit template code. If that - it's quite likely that someone has already writen the parser for JSON in terms of spirit. My argument is basically one of the most economical method to guarantee correct results. Robert Ramey
On 2019-10-22 23:21, Vinnie Falco via Boost wrote:
On Tue, Oct 22, 2019 at 12:55 PM Mateusz Loskot via Boost
wrote: I'd consider covering the thing with https://google.github.io/oss-fuzz/ instead.
My strategy for ensuring correctness is two-fold. First, as with Beast, it will be reviewed by an external company (they will do the fuzzing). Second, are the special tests I write so that I have confidence everything works.
[snip]
With these techniques I achieve close to 100% code coverage and very high confidence that every path through the parser is correct. After a bunch of testing (which consists of telling users it is "ready" and seeing what they report back) I submit it to the external code auditing company to get a report. After fixing any issues they raise in the report, my strategy changes: touch the code as little as possible. If this code used an external dependency, and that upstream code changed, then transitively it means my code changed - for this reason I avoid using external code like Spirit (or regex) even if it means I have to duplicate stuff.
Although I appreciate that you achieve a great degree of code stability with your approach, I find it counter-productive since it encourages constantly reinventing the wheel and monolithic design. Imagine if every library in Boost had its own copy of e.g. shared_ptr. No matter how well tested and reliable each copy is, that would be a nightmare for users. IMO, we should reuse the well designed, reviewed and tested code that we already have (and Boost.Spirit is an example of such). Otherwise all that work someone put in that code was done in vain.
On Tue, Oct 22, 2019 at 1:50 PM Andrey Semashev via Boost
it encourages constantly reinventing the wheel and monolithic design
We're talking specifically about the JSON parser not the other components, and yes, this is intentional. "monolithic design" is a desirable feature of algorithms which are exposed to untrusted inputs.
Imagine if every library in Boost had its own copy of e.g. shared_ptr.
That's not the same scenario. `shared_ptr` is not exposed to untrusted inputs. If JSON were to use Spirit, then the amount code requiring audit would be much larger. The tradeoff here makes it worth eschewing Spirit even at the expense of some redundancy. Furthermore json::basic_parser is optimized to work well for incremental inputs, Spirit on the other hand is not. But I'm no Spirit expert so if someone wants to show me how a Spirit based parser can match the hand-rolled version feature for feature at the same or better performance, that would be quite interesting. Thanks
On 10/22/19 2:03 PM, Vinnie Falco via Boost wrote:
Furthermore json::basic_parser is optimized to work well for incremental inputs, Spirit on the other hand is not. But I'm no Spirit expert so if someone wants to show me how a Spirit based parser can match the hand-rolled version feature for feature at the same or better performance, that would be quite interesting.
First of all I'm not sure what you by incremental inputs. Maybe it means the same as in the serialization xml parse which I think does one phrase on demand rather than setting it up and letting'er rip. I've conceded to your decision about not using spirit rather than your hand rolled version as a component of your JSON parser. My suggestion is that you consider it for use in your test suite where dependencies are not an issue but provable correctness is. It's much easier to verify the correctness of a grammar in bnf or peg than it is to verify the correctness of a hand rolled parser. So you can make a test program in spirit which whose correctness can be statically verified and use it test same strings you use to test your hand rolled version and verify you get the exact same results. This avoids having to manually verify the correct results for every test string. And makes the process of adding a new test string trivial rather than onerous. You've described in detail your plans for ensuring that your parser is correct. The way you do it it's a huge amount of work involving many people and subject to human frailties. And the correctness of your checking can't really be verified by someone other than the person that does it. Adding more people and more work doesn't really increase confidence. Using an alternative method which is more provably correct is much more likely to smoke out a bug. And it's a lot less work too! Robert Ramey
On Tue, Oct 22, 2019 at 2:42 PM Robert Ramey via Boost
...I'm not sure what you by incremental inputs.
It means that the parser is an "online algorithm": https://en.wikipedia.org/wiki/Online_algorithm Thanks
You guys are overthinking it. You know how fast this JSON parser is, you like the API (or you don't), it works correctly => you use it (or you don't). If correctness is critical for you, wait a few months until the rate of commits declines. File bugs if you find them. Who cares _how_ it works? On Tue, Oct 22, 2019 at 2:42 PM Robert Ramey via Boost < boost@lists.boost.org> wrote:
On 10/22/19 2:03 PM, Vinnie Falco via Boost wrote:
Furthermore json::basic_parser is optimized to work well for incremental inputs, Spirit on the other hand is not. But I'm no Spirit expert so if someone wants to show me how a Spirit based parser can match the hand-rolled version feature for feature at the same or better performance, that would be quite interesting.
First of all I'm not sure what you by incremental inputs. Maybe it means the same as in the serialization xml parse which I think does one phrase on demand rather than setting it up and letting'er rip.
I've conceded to your decision about not using spirit rather than your hand rolled version as a component of your JSON parser. My suggestion is that you consider it for use in your test suite where dependencies are not an issue but provable correctness is. It's much easier to verify the correctness of a grammar in bnf or peg than it is to verify the correctness of a hand rolled parser. So you can make a test program in spirit which whose correctness can be statically verified and use it test same strings you use to test your hand rolled version and verify you get the exact same results. This avoids having to manually verify the correct results for every test string. And makes the process of adding a new test string trivial rather than onerous.
You've described in detail your plans for ensuring that your parser is correct. The way you do it it's a huge amount of work involving many people and subject to human frailties. And the correctness of your checking can't really be verified by someone other than the person that does it. Adding more people and more work doesn't really increase confidence. Using an alternative method which is more provably correct is much more likely to smoke out a bug. And it's a lot less work too!
On 10/22/19 2:49 PM, Emil Dotchevski via Boost wrote:
You guys are overthinking it. You know how fast this JSON parser is, you like the API (or you don't), it works correctly => you use it (or you don't). If correctness is critical for you, wait a few months until the rate of commits declines. File bugs if you find them. Who cares _how_ it works?
With all due respect, this is exactly the wrong way of going about making a non-trivial program that one knows is bug free. It's the plague of software development since it's inception. It has been commented on extensively for 40 years. Dykstra is famous for making the most strident case against it, closely followed by Hoare and many, many others. These "theoreticians" have never gotten any respect from "real programmers" such as ourselves. But they are right and we are wrong. The problem is that they were never really to specify practical methods to implement their ideas of provable correctness - except for trivial toy like problems. So here we are with programs of continuously increasing complexity and continuously decreasing reliability. What you're describing above is what a lot of people call "Agile" development - which effectively means pass your bugs on to your users and let them suffer the consequences. Then when they complain - tell them it's their fault for not sending a bug report. (Or in the case of C++ - write a paper). In my humble view, we've failed as a profession. What if this approach were applied to thinks like flight computer with 100's of passengers on board? Whoops it has been. Now this particular case - providing a provable correct parser for a simple language - IS a toy problem. Implementing a parser derived from a formal grammar CAN be done quite easily with boost spirit - if only for the the testing portion of the project. (Of course if spirit were used in the main product, much less testing would be required!). both Vinnies method and your method create huge amount of extra work for various persons involved in the process. I'm not over thinking it, you're not thinking big enough. It's not just use cranking up some code it's sucking in all the other participants to do our work for us. Sorry I sort of got off on a tangent. Robert Ramey
On Tue, Oct 22, 2019 at 3:13 PM Robert Ramey via Boost < boost@lists.boost.org> wrote:
Now this particular case - providing a provable correct parser for a simple language - IS a toy problem.
It does not follow that the only valid reason to do it is to play games.
Implementing a parser derived from a formal grammar CAN be done quite easily with boost spirit - if only for the the testing portion of the project. (Of course if spirit were used in the main product, much less testing would be required!).
both Vinnies method and your method create huge amount of extra work for various persons involved in the process. I'm not over thinking it, you're not thinking big enough. It's not just use cranking up some code it's sucking in all the other participants to do our work for us.
I get your point that it's easier to reason about correctness if you can trust the individual components. This reflects a "white box testing" mentality. But if Spirit is bug-free, logically it does not follow that a library that uses it has better chance of being bug-free, compared to a library that does not. Either way, you ought to test, ignoring any and all knowledge of its internals. Consider also even if using tested components can get you to bug-free implementation quicker (which isn't proven in this case), on balance it is better if you don't depend on another library. I'd be surprised if you think otherwise.
On 10/22/19 3:42 PM, Emil Dotchevski via Boost wrote:
I get your point that it's easier to reason about correctness if you can trust the individual components. Right. This reflects a "white box testing" mentality. I have heard the term - but I never knew what it meant But if Spirit is bug-free, logically it does not follow that a library that uses it has better chance of being bug-free, compared to a library that does not.
Either way, you ought to test, ignoring any and all knowledge of its internals. I don't think anyone has suggested otherwise. But some approaches require much more testing than others.
Consider also even if using tested components can get you to bug-free implementation quicker (which isn't proven in this case),
Hmmm - it's proven to my satisfaction. This is the key point. Using spirit to generate parser is a fundamentally different task than crafting a sequential procedure to parse arbitrary text to invoke events when grammatical elements are recognized. It's very different than what one is used to so it takes more time to learn than one would think. BUT it reduces the job to specifying the grammar and specifying the actions. There is no coding in the sense that writing a hand rolled parser is. It's totally static - there is no notion of changing state when specifying the parser. So it's inherently verifiable - unlike code which changes state as the instruction pointer passes over the code. This is why I believe that this approach will get one to a bug free implementation faster and that one will be able to verify the that the implementation is correct. I realize that not everyone buys this, but it's what I believe. Of course the question is raised about compile time/runtime efficiency. The above says nothing about these questions. But I would like these questions to be considered separately from those of correctness. But in many instances - correctness cannot be compromised and compile/run time efficiency has to be second priority.
on balance it is better if you don't depend on another library.
Also I disagree with this. All things being equal, I'd rather depend on someone else's work than my own. I want to focus on only those things that are unique to my situation. And I get no satisfaction from doing something that someone else has already done better. Of course the rub is "all things being equal" which of course is an idealization. Some times it's mostly true while other times it not. We get to decide - that's why we make the big bucks. As time goes on, the problem of C++ code dependencies have become bigger issue. Part of this is that there is a whole generation of programmers who don't need to know about linkers and libraries and that header only works fine for them. An header only is more convenient. But it does prejudice code reuse - which has it's own downsides. I would very much like to see C++ make progress toward making better use of linkers, separate compilation, better support of control of visibility etc. But we're stuck here. Robert Ramey
On Tue, Oct 22, 2019 at 8:39 PM Robert Ramey via Boost < boost@lists.boost.org> wrote:
On 10/22/19 3:42 PM, Emil Dotchevski via Boost wrote:
I get your point that it's easier to reason about correctness if you can trust the individual components. Right. This reflects a "white box testing" mentality. I have heard the term - but I never knew what it meant
It means, you approach testing with knowledge of how the program works. In this case, you reason about the correctness of a JSON parser based on the correctness of Spirit. For example, you might skip testing certain input sets because you'd reason that their impact is limited to Spirit, and you *know* it works. In contrast, "black box" testing specifically ignores such reasoning and focuses on testing the correct behavior not knowing how the program works. Ideally, whether or not you can rely on a component to be bug-free should have no impact on the thoroughness of the tests.
But if Spirit is bug-free, logically it does not follow that a library that uses it has better chance of being bug-free, compared to a library that does not.
on balance it is better if you don't depend on another library.
Also I disagree with this. All things being equal, I'd rather depend on someone else's work than my own. I want to focus on only those things that are unique to my situation. And I get no satisfaction from doing something that someone else has already done better.
What I mean is that your users would appreciate that your library has one fewer dependencies.
On 10/22/19 10:25 PM, Emil Dotchevski via Boost wrote:
What I mean is that your users would appreciate that your library has one fewer dependencies.
I understand this an agree with this. But I think it's because our management of dependencies needs more insight, work and new ideas. Huge progress has been made with source control, online code repositories, better documentation (spotty) etc. But we still have problems in managing linkage, abi compatibiilty, versioning, compile switches, portability etc. which makes it seem more convenient to just copy the code right into one's project. So the users aren't wrong, we just haven't been able to give them what they need to convince them to avoid doing this. Robert Ramey
On Tue, Oct 22, 2019 at 1:50 PM Andrey Semashev via Boost < boost@lists.boost.org> wrote:
IMO, we should reuse the well designed, reviewed and tested code that we already have (and Boost.Spirit is an example of such). Otherwise all that work someone put in that code was done in vain.
Not always. The exception is when a library can have zero dependencies, which is very valuable for low level libraries. For example, in LEAF I need a piece of mp11, but I've copied that piece over in LEAF so there's no dependency on anything. If you have one dependency, you might as well have 2 or 3.
On 2019-10-23 00:30, Emil Dotchevski via Boost wrote:
On Tue, Oct 22, 2019 at 1:50 PM Andrey Semashev via Boost < boost@lists.boost.org> wrote:
IMO, we should reuse the well designed, reviewed and tested code that we already have (and Boost.Spirit is an example of such). Otherwise all that work someone put in that code was done in vain.
Not always. The exception is when a library can have zero dependencies, which is very valuable for low level libraries. For example, in LEAF I need a piece of mp11, but I've copied that piece over in LEAF so there's no dependency on anything.
If you have one dependency, you might as well have 2 or 3.
Well, it depends on the amount of code you duplicate, but in general by duplicating code you're not removing code from user, you're adding it and hiding it from user. In other words, the user will have to compile your duplicated code and any other duplicates he may use directly or indirectly. In the end, you made it worse for the user.
On 10/22/19 12:54 PM, Mateusz Loskot via Boost wrote:
I'd consider covering the thing with https://google.github.io/oss-fuzz/ instead.
That's a separate issue. That can generate test cases, but can't check that the test is parsed correctly.
Best regards,
On Tue, 22 Oct 2019 at 22:26, Robert Ramey via Boost
On 10/22/19 12:54 PM, Mateusz Loskot via Boost wrote:
I'd consider covering the thing with https://google.github.io/oss-fuzz/ instead.
That's a separate issue. That can generate test cases, but can't check that the test is parsed correctly.
Yes, correct. My interpretation of the "option of writing 1000++ test cases" was too far fetched. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
participants (21)
-
Andrey Semashev
-
Bjorn Reese
-
Daniel Frey
-
David Bellot
-
Dominique Devienne
-
Emil Dotchevski
-
Gavin Lambert
-
Glen Fernandes
-
JF
-
Julien Blanc
-
julien.blanc@tgcm.eu
-
Mateusz Loskot
-
Mathias Gaunard
-
Michael Caisse
-
Niall Douglas
-
Phil Endecott
-
Rene Rivera
-
Robert Ramey
-
Vicram Rajagopalan
-
Vinnie Falco
-
Vinícius dos Santos Oliveira