[xpressive] Using it with input iterators or how far the backtracing will go back
Greetings. There exists a fairly common need to apply regexp matching to streams, rather than to blocks of data. Streams are represented by input iterators, which have no ability to decrement their value. On the other hand, xpressive requires a bidirectional iterator, to go back and forth as required by backtracking (if I understand correctly). One approach to work around this limitation is to create an iterator adapter, which will store a copy of every character delivered by input iterator and provide expressive with bidi iterator to work with. So the question is, how can one establish the size for this iterator adapter's internal storage so as to avoid match failures due to incomplete storage and avoid storing too much unnecessary data? I suppose, the storage can be reset when the match is reported. This, however, still leaves an open problem for situations where matches are separated by, possibly, gigabytes of non-matching data. Alternatively, may there be a way to relax the iterator requirements of the xpressive library itself? If xpressive could be made to store partial match data within its data structures, bidi iterators would not be strictly necessary. Thanks.
On Fri, Nov 5, 2010 at 3:31 AM, Alex Dubov
So the question is, how can one establish the size for this iterator adapter's internal storage so as to avoid match failures due to incomplete storage and avoid storing too much unnecessary data? I suppose, the storage can be reset when the match is reported. This, however, still leaves an open problem for situations where matches are separated by, possibly, gigabytes of non-matching data.
In the case of single-line matching, backtracking need only going back to the start of the current line, so I think the buffering only needs to be line based. For arbitrarily long multi-line matching, I don't know... --DD
On 11/5/2010 10:21 AM, Dominique Devienne wrote:
On Fri, Nov 5, 2010 at 3:31 AM, Alex Dubov
wrote: So the question is, how can one establish the size for this iterator adapter's internal storage so as to avoid match failures due to incomplete storage and avoid storing too much unnecessary data? I suppose, the storage can be reset when the match is reported. This, however, still leaves an open problem for situations where matches are separated by, possibly, gigabytes of non-matching data.
In the case of single-line matching, backtracking need only going back to the start of the current line, so I think the buffering only needs to be line based. For arbitrarily long multi-line matching, I don't know... --DD
In that case, the iterator may be backed up all the way to the beginning of the match, so if the iterator is caching the input, it would need to cache it all. There is no way to relax xpressive's iterator requirements. If you're using file streams, you could probably write an iterator that manipulates the stream's buffer, adjusting it appropriately when the buffer underflows on backtracking. Just a thought. -- Eric Niebler BoostPro Computing http://www.boostpro.com
Eric Niebler
On 11/5/2010 10:21 AM, Dominique Devienne wrote:
On Fri, Nov 5, 2010 at 3:31 AM, Alex Dubov
wrote: So the question is, how can one establish the size for this iterator
adapter's
internal storage so as to avoid match failures due to incomplete storage and avoid storing too much unnecessary data? I suppose, the storage can be reset when the match is reported. This, however, still leaves an open problem for situations where matches are separated by, possibly, gigabytes of non- data.
In the case of single-line matching, backtracking need only going back to the start of the current line, so I think the buffering only needs to be line based. For arbitrarily long multi-line matching, I don't know... --DD
In that case, the iterator may be backed up all the way to the beginning of the match, so if the iterator is caching the input, it would need to cache it all.
That's exactly the approach I would like to follow. Unfortunately, xpressive currently has no way to communicate the necessary information out. What could be nice to have is a notification of some sort for three events: 1. First possibly matching character is encountered (at which point iterator adapter can start caching). 2. All possible matches abandoned (iterator cache can be truncated). 3. Match is found (well, it's already there:-). The not-so-exciting alternative to having this functionality in xpressive is going back to C-style matching engines for stream filtering.
On 11/6/2010 4:58 AM, Alex Dubov wrote:
Eric Niebler
writes: On 11/5/2010 10:21 AM, Dominique Devienne wrote:
On Fri, Nov 5, 2010 at 3:31 AM, Alex Dubov
wrote: So the question is, how can one establish the size for this iterator adapter's internal storage so as to avoid match failures due to incomplete storage and avoid storing too much unnecessary data? I suppose, the storage can be reset when the match is reported. This, however, still leaves an open problem for situations where matches are separated by, possibly, gigabytes of non- data.
In the case of single-line matching, backtracking need only going back to the start of the current line, so I think the buffering only needs to be line based. For arbitrarily long multi-line matching, I don't know... --DD
In that case, the iterator may be backed up all the way to the beginning of the match, so if the iterator is caching the input, it would need to cache it all.
That's exactly the approach I would like to follow. Unfortunately, xpressive currently has no way to communicate the necessary information out.
What could be nice to have is a notification of some sort for three events: 1. First possibly matching character is encountered (at which point iterator adapter can start caching).
I thought about this and don't think it's practical. Xpressive's state machine is built from matcher components. What you're asking for is a way to signal that you are *exactly* one character into a match. That would require changing *all* matcher components so that (a) they know if they are the first non-zero-width matcher in the state machine, and (b) they signal when they successfully match the first character. It could be done, but I'm not going to maintain that. Sorry.
2. All possible matches abandoned (iterator cache can be truncated). 3. Match is found (well, it's already there:-).
The not-so-exciting alternative to having this functionality in xpressive is going back to C-style matching engines for stream filtering.
Yep. :-( But like I said, if your stream is like a file stream, then you can manipulate the underlying buffers to give you a sliding window, paging data in and out as needed. Good luck. -- Eric Niebler BoostPro Computing http://www.boostpro.com
Eric Niebler
The not-so-exciting alternative to having this functionality in xpressive
is
going back to C-style matching engines for stream filtering.
Yep. But like I said, if your stream is like a file stream, then you can manipulate the underlying buffers to give you a sliding window, paging data in and out as needed.
Infinite streams (as in presently discussed issue) have an unhealthy tendency to come from the network and end up there as well (and the memory at the filtering node is not that abundant). Files, of course, seldom require such complications.
On Sun, Nov 7, 2010 at 4:32 PM, Eric Niebler
Eric Niebler
writes: On 11/5/2010 10:21 AM, Dominique Devienne wrote:
On Fri, Nov 5, 2010 at 3:31 AM, Alex Dubov
wrote: So the question is, how can one establish the size for this iterator adapter's internal storage so as to avoid match failures due to incomplete storage and avoid storing too much unnecessary data?
In the case of single-line matching, backtracking need only going back to the start of the current line, so I think the buffering only needs to be line based. For arbitrarily long multi-line matching, I don't know... --DD
In that case, the iterator may be backed up all the way to the beginning of the match, so if the iterator is caching the input, it would need to cache it all.
like I said, if your stream is like a file stream, then you can manipulate the underlying buffers to give you a sliding window, paging data in and out as needed.
Once upon a time, the Serialization library came with backtracking input iterators used to parse XML input files with Spirit. I briefly glanced over more recent Serialization documentation without spotting them. Do they still exist?
Nat Linden
Once upon a time, the Serialization library came with backtracking input
iterators used to parse XML input files with Spirit. I briefly glanced over more recent Serialization documentation without spotting them. Do they still exist?
Yes, in a way. Spirit only requires forward iterators, not bidi ones, like xpressive. Input iterators can be adapted to become forward ones in a fairly straightforward way, and indeed, spirit has a customizable multi_pass iterator adapter to do just that. Up until now I thought spirit to be an overkill for my stuff, but I'm considering using it now.
On Nov 8, 2010, at 12:36 PM, Alex Dubov
Nat Linden
writes: Once upon a time, the Serialization library came with backtracking input
iterators used to parse XML input files with Spirit.
indeed, spirit has a customizable multi_pass iterator adapter to do just that.
Aha, so the multi-pass iterator migrated to Spirit itself. Cool.
(cross-posting to spirit-devel) On 11/8/2010 12:36 PM, Alex Dubov wrote:
Nat Linden
writes: Once upon a time, the Serialization library came with backtracking input iterators used to parse XML input files with Spirit. I briefly glanced over more recent Serialization documentation without spotting them. Do they still exist?
Yes, in a way. Spirit only requires forward iterators, not bidi ones, like xpressive. Input iterators can be adapted to become forward ones in a fairly straightforward way, and indeed, spirit has a customizable multi_pass iterator adapter to do just that.
Up until now I thought spirit to be an overkill for my stuff, but I'm considering using it now.
Just hazarding a guess here, but the multi_pass_iterator probably works by caching all the read data so far. Otherwise, how would it support making a copy of the begin iterator and dereferencing it at a later point in time? If that solution wasn't good enough for you before, you probably don't want multi_pass_iterator+Spirit. Can the Spirit guys chime in on this? -- Eric Niebler BoostPro Computing http://www.boostpro.com
On 11/9/2010 3:03 AM, Eric Niebler wrote:
(cross-posting to spirit-devel)
On 11/8/2010 12:36 PM, Alex Dubov wrote:
Nat Linden
writes: Once upon a time, the Serialization library came with backtracking input iterators used to parse XML input files with Spirit. I briefly glanced over more recent Serialization documentation without spotting them. Do they still exist?
Yes, in a way. Spirit only requires forward iterators, not bidi ones, like xpressive. Input iterators can be adapted to become forward ones in a fairly straightforward way, and indeed, spirit has a customizable multi_pass iterator adapter to do just that.
Up until now I thought spirit to be an overkill for my stuff, but I'm considering using it now.
Just hazarding a guess here, but the multi_pass_iterator probably works by caching all the read data so far. Otherwise, how would it support making a copy of the begin iterator and dereferencing it at a later point in time? If that solution wasn't good enough for you before, you probably don't want multi_pass_iterator+Spirit.
Can the Spirit guys chime in on this?
The multi_pass iterator uses a sliding window that can be flushed when you reach a deterministic point, a point of no return where back-tracking is no longer an option. In this case, saved iterators are invalidated. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net
Joel de Guzman
On 11/9/2010 3:03 AM, Eric Niebler wrote:
(cross-posting to spirit-devel)
Just hazarding a guess here, but the multi_pass_iterator probably works by caching all the read data so far. Otherwise, how would it support making a copy of the begin iterator and dereferencing it at a later point in time? If that solution wasn't good enough for you before, you probably don't want multi_pass_iterator+Spirit.
Can the Spirit guys chime in on this?
The multi_pass iterator uses a sliding window that can be flushed when you reach a deterministic point, a point of no return where back-tracking is no longer an option. In this case, saved iterators are invalidated.
I was under impression that for most grammars explicit flush is not even necessary. multi_pass maintains (at least, this follows from documentation) reference counts for all of its copies, so the moment specific "past" iterator position is not longer referenced by the parser the storage will be truncated automatically (when employing "split_std_deque" storage policy). There's also a faster "fixed_size_queue" policy, but it requires knowing in advance how far the backtracking can possibly go for a given grammar.
On 11/9/2010 3:25 PM, Alex Dubov wrote:
Joel de Guzman
writes: On 11/9/2010 3:03 AM, Eric Niebler wrote:
(cross-posting to spirit-devel)
Just hazarding a guess here, but the multi_pass_iterator probably works by caching all the read data so far. Otherwise, how would it support making a copy of the begin iterator and dereferencing it at a later point in time? If that solution wasn't good enough for you before, you probably don't want multi_pass_iterator+Spirit.
Can the Spirit guys chime in on this?
The multi_pass iterator uses a sliding window that can be flushed when you reach a deterministic point, a point of no return where back-tracking is no longer an option. In this case, saved iterators are invalidated.
I was under impression that for most grammars explicit flush is not even necessary. multi_pass maintains (at least, this follows from documentation) reference counts for all of its copies, so the moment specific "past" iterator position is not longer referenced by the parser the storage will be truncated automatically (when employing "split_std_deque" storage policy). There's also a faster "fixed_size_queue" policy, but it requires knowing in advance how far the backtracking can possibly go for a given grammar.
Yep. That is true. You know spirit and your multi_pass well :-) Also, with Qi, we rigged it up such that we automatically flush on expectation points; e.g. *(a > b). This is very efficient and a must especially when parsing multi-gigabytes of data. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net
participants (6)
-
Alex Dubov
-
Dominique Devienne
-
Eric Niebler
-
Joel de Guzman
-
Nat Goodspeed
-
Nat Linden