[regex] Support for Perl's (*SKIP)
I'd like to see support for Perl's (*SKIP) regex verb in Boost. There are a number of verbs, but that one has an interesting and frequent use case: it enables searching for an expression but only outside of some contexts. There is a page called "The best regex trick" with details about the process and a number of examples. I can't link to it as this is my first message (I attempted before but it was rejected). It explains how to do it with and without (*SKIP). I've seen several questions in Stack Overflow asking how to accomplish that task, so it seems it's quite frequent to run into that need. Let's say for example that we want to find the string 'foo' as an identifier in C. This is a crude example of a Perl regex that does it (a real one might need to be more elaborate; in particular, backslashes for line continuation are not considered): (?x-s) (?# free spacing, dot doesn't match newline) (?://.*+ (?# eat single-line comment text) |/\*[\S\s]*?\*/ (?# eat multi-line comment text) |"(?:\\.|[^"\n])*+" (?# eat string text) )(*SKIP)(?!) (?# skip these) |\bfoo\b (?# match this) regex::search will match that expression only when foo is present outside of a string or comment. Without (*SKIP), it can be done only by calling regex::search multiple times, using an expression like this: (?-s)//.*+|/\*[\S\s]*?\*/|"(?:\\.|[^"\n])*+"|(\bfoo\b) and ignoring every match where group 1 wasn't matched. That's presumed to be slower, and certainly more inconvenient for the programmer. Support for this particular use case would be a great feature to have in the regex engine. Sei
Without (*SKIP), it can be done only by calling regex::search multiple times, using an expression like this:
(?-s)//.*+|/\*[\S\s]*?\*/|"(?:\\.|[^"\n])*+"|(\bfoo\b)
and ignoring every match where group 1 wasn't matched. That's presumed to be slower, and certainly more inconvenient for the programmer.
I've added a bug report for this: https://svn.boost.org/trac/boost/ticket/11205 However, in the mean time, there is a much simpler workaround, if you use: (?x-s) (?# free spacing, dot doesn't match newline) (?://.*+ (?# eat single-line comment text) |/\*[\S\s]*?\*/ (?# eat multi-line comment text) |"(?:\\.|[^"\n])*+" (?# eat string text) ) (?# skip these) |\b(foo)\b (?# match this) And if $1 matched, then you have what you were looking for, otherwise discard. It's not as "neat" as the original, but is no less/more efficient. HTH, John.
John Maddock wrote, On 2015-04-19 12:55:
I've added a bug report for this: https://svn.boost.org/trac/boost/ticket/11205
Thank you very much.
However, in the mean time, there is a much simpler workaround, if you use:
(?x-s) (?# free spacing, dot doesn't match newline) (?://.*+ (?# eat single-line comment text) |/\*[\S\s]*?\*/ (?# eat multi-line comment text) |"(?:\\.|[^"\n])*+" (?# eat string text) ) (?# skip these) |\b(foo)\b (?# match this)
And if $1 matched, then you have what you were looking for, otherwise discard. It's not as "neat" as the original, but is no less/more efficient.
Isn't that basically the same that I said, just changing the grouping?
Without (*SKIP), it can be done only by calling regex::search multiple >> times, using an expression like this:
(?-s)//.*+|/\*[\S\s]*?\*/|"(?:\\.|[^"\n])*+"|(\bfoo\b)
and ignoring every match where group 1 wasn't matched. That's presumed to be slower, and certainly more inconvenient for the programmer.
The match shouldn't be given up if a comment or string is found; the programmer needs to keep searching until either there's no match, or group 1 is matched. It's what I ended up doing. I presumed it would be less efficient because it's creating an additional match group per call, and because there's the set-up time and function call overhead over the multiple calls that are necessary this way. And not sure but it's possible there are more cache misses. Sei
Without (*SKIP), it can be done only by calling regex::search multiple >> times, using an expression like this:
(?-s)//.*+|/\*[\S\s]*?\*/|"(?:\\.|[^"\n])*+"|(\bfoo\b)
and ignoring every match where group 1 wasn't matched. That's presumed to be slower, and certainly more inconvenient for the programmer. The match shouldn't be given up if a comment or string is found; the
Isn't that basically the same that I said, just changing the grouping? Apologies, I misread what you wrote: my bad! programmer needs to keep searching until either there's no match, or group 1 is matched. It's what I ended up doing.
I presumed it would be less efficient because it's creating an additional match group per call, and because there's the set-up time and function call overhead over the multiple calls that are necessary this way. And not sure but it's possible there are more cache misses. It should be exactly the same - the regex engine is doing *exactly* the same work either which way. It's true that you exit all the way back to user-code from the regex engine. I can see that being an issue in Perl when you're dropping back to interpreted code, but should be a non-issue in C++ compared to the complexity of doing the regex matching. The extra capturing group adds a little to the memory allocated inside match_results, but again, I'd be very surprised if that was detectable.
HTH, John.
participants (2)
-
John Maddock
-
Sei Lisa