Re: [boost] Potential Boost SAX library
On 01/09/18 18:36, Oliver Adams via Boost wrote:
I was wondering if a library I'm developing would be of value to the Boost community. It is basically an event-driven parsing/serialization library for common formats using a standard internal representation or simple pass-through conversions. Would anyone be interested in something like this being added to Boost?
There are two kinds of incremental parsers: push parsers (SAX) and pull parsers (approximately StAX.) Briefly put, push parsers traverses the input automatically and generates events for each token it finds, whereas pull parsers traverses the input manually like an iterator and the current token can be queried. Pull parsers have some significant advantages over push parser: * It is straight-forward to implement a push parser on top of a pull parser. This involves a loop and a switch statement (see [1] for a complete example.) Going in the other direction involves the use of coroutines; most likely stateful coroutines. * Contextual parsing can be done directly, unlike push parsers where you have to maintain contextual state in the event handler. * Push parsers can be used directly in Boost.Serialization archives. * Pull parsers are composable. For instance, you could insert a URL pull parser directly into an HTTP pull parser. For a pull parser framework see: https://github.com/breese/trial.protocol The documentation is a bit old though. [1] http://breese.github.io/trial/protocol/trial_protocol/json/tutorial/push_par...
There are two kinds of incremental parsers: push parsers (SAX) and pull parsers (approximately StAX.) Briefly put, push parsers traverses the input automatically and generates events for each token it finds, whereas pull parsers traverses the input manually like an iterator and the current token can be queried.
My library is kind of a push-pull framework. You can request the parser to parse one event (one event is considered the smallest parse the input format is capable of) and the parser then pushes the result to the output handler as one or more writes. Trouble is, where the parser stops parsing is format-dependent. This kind of limits the pull framework to just "event-loop" style parsing right now.
Pull parsers have some significant advantages over push parser:
* It is straight-forward to implement a push parser on top of a pull parser. This involves a loop and a switch statement (see [1] for a complete example.) Going in the other direction involves the use of coroutines; most likely stateful coroutines.
Most of these features are not currently available in cppdatalib because individual tokens are not accessible as a pull parser. If I refactored a few things, I might be able to get a full pull parser framework.
* Contextual parsing can be done directly, unlike push parsers where you have to maintain contextual state in the event handler.
Right now, contextual parsing is implemented in a base class of the output handler, so it's still isolated from the end user. Kind of hackish, though, since the parser queries the output handler for the structure of the data it's already read.
* Push parsers can be used directly in Boost.Serialization archives.
* Pull parsers are composable. For instance, you could insert a URL pull parser directly into an HTTP pull parser.
Composability is a big issue with push parsers, so removing obstacles to
that would greatly simplify some things. For certain types of information,
though, it doesn't seem like composition is important.
On Jan 13, 2018 5:05 AM, "Bjorn Reese via Boost"
community. It is basically an event-driven parsing/serialization library for common formats using a standard internal representation or simple pass-through conversions. Would anyone be interested in something like this being added to Boost?
There are two kinds of incremental parsers: push parsers (SAX) and pull parsers (approximately StAX.) Briefly put, push parsers traverses the input automatically and generates events for each token it finds, whereas pull parsers traverses the input manually like an iterator and the current token can be queried. Pull parsers have some significant advantages over push parser: * It is straight-forward to implement a push parser on top of a pull parser. This involves a loop and a switch statement (see [1] for a complete example.) Going in the other direction involves the use of coroutines; most likely stateful coroutines. * Contextual parsing can be done directly, unlike push parsers where you have to maintain contextual state in the event handler. * Push parsers can be used directly in Boost.Serialization archives. * Pull parsers are composable. For instance, you could insert a URL pull parser directly into an HTTP pull parser. For a pull parser framework see: https://github.com/breese/trial.protocol The documentation is a bit old though. [1] http://breese.github.io/trial/protocol/trial_protocol/json/t utorial/push_parser.html _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman /listinfo.cgi/boost
2018-01-13 10:00 GMT-03:00 Oliver Adams via Boost
* Pull parsers are composable. For instance, you could insert a URL pull parser directly into an HTTP pull parser.
Composability is a big issue with push parsers, so removing obstacles to that would greatly simplify some things. For certain types of information, though, it doesn't seem like composition is important.
Composability is not always important. I've written a HTTP pull parser[1], but you won't always use the power to compose abstractions. However, it's really impressive what you can do with pull parsers. For instance, HTTP is the type of format where you parse incomplete messages. The idea of reparsing from the beginning is not really feasible because you usually won't maintain past data. Given these conditions, look at how powerful an HTTP pull parser is, where you can copy the parser and use it to "look ahead" (or backtracking if you will). I've wrote an example where I compose the parser and you can kind of assume field names and field values are always present together in the stream: https://github.com/BoostGSoC14/boost.http/commit/9908fe06d4b2364ce18ea9b4162... I don't really care about this specific example. I just like to notice how powerful this style really is. If I am to advertise for one style or another (and we're talking about HTTP parsers), I'll emphasize other characteristics. [1] https://vinipsmaker.github.io/asiohttpserver/ -- Vinícius dos Santos Oliveira https://vinipsmaker.github.io/
participants (3)
-
Bjorn Reese
-
Oliver Adams
-
Vinícius dos Santos Oliveira