Inconsistent regexp matching when using quantifiers?
Hi all, I'm using regexp from boost-1.28 and experience the following behavior. Consider the following (somewhat artificial) regexp pattern: a{1}b As I expected, that pattern is found in "ab" but not "aab". To my suprise however, the same pattern *is* found in "aaab", "aaaaab", and any other string consisting of an odd number of "a"s followed by a "b". It is not found in strings consisting of an even number of "a"s followed by a "b". This seems odd (no pun intended). I see the same sort of behavior with quantifiers other than "{1}" and where the quantified expression matches other single characters. (Oddly enough, the behavior changes when using a quantified expression that matches multiple characters. "(ab){1}c" is found in "abc", "ababc", "abababc", and any other string containing "abc".) FWIW, I first observed the behavior when trying to find social security numbers with the following pattern: \d{3}-\d{2}-\d{4} As expected, that pattern was found in "123-12-1234" but not in "1234- 12-1234". However it *was* found in "1234567-12-1234". Is this behavior by design or is it a bug? If it's a bug, has it been fixed in a subsequent boost release? Also what is the correct behavior? Should "a{1}b" be found in "aab" (albeit starting at the second character)? FWIW, it's easy enough for me to workaround the current behavior with a pattern like this: (^|[^a])a{1}b Thanks, --Dean
On Thu, Apr 24, 2003 at 11:37:34PM -0000, Dean wrote:
Hi all,
... snip ...
\d{3}-\d{2}-\d{4}
As expected, that pattern was found in "123-12-1234" but not in "1234- 12-1234". However it *was* found in "1234567-12-1234".
Is this behavior by design or is it a bug?
It is (probably) design. Intervals can specify a min and a max, for example: \d{3,3}-\d{2,2}-\d{4,4} will match "123-12-1234" but NOT "1234567-12-1234". \d{3}-\d{2}-\d{4} will match "123-12-1234" but NOT "1234567-12-1234" also. It will, however, yeild a correct search (there is a difference between search and match). You didn't mention if you were doing a regex_search or regex_match ? Begin Stolen from the boost regex docs: The algorithm regex_match determines whether a given regular expression matches a given sequence denoted by a pair of bidirectional-iterators, the algorithm is defined as follows, note that the result is true only if the expression matches the whole of the input sequence, the main use of this function is data input validation: End Stolen from the boost regex docs
If it's a bug, has it been fixed in a subsequent boost release? Also what is the correct behavior? Should "a{1}b" be found in "aab" (albeit starting at the second character)?
See above. Yes, a{1}b should be found in a search, but not a match.
FWIW, it's easy enough for me to workaround the current behavior with a pattern like this:
(^|[^a])a{1}b
You could use this, but I wouldn't recomend it (but that's just me...regex construction is deeply personal :) ). HTH. -jbs
--- In Boost-Users@yahoogroups.com, "Joshua B. Smith"
On Thu, Apr 24, 2003 at 11:37:34PM -0000, Dean wrote:
Hi all,
... snip ...
\d{3}-\d{2}-\d{4}
As expected, that pattern was found in "123-12-1234" but not in "1234- 12-1234". However it *was* found in "1234567-12-1234".
Is this behavior by design or is it a bug?
It is (probably) design. Intervals can specify a min and a max, for example:
\d{3,3}-\d{2,2}-\d{4,4} will match "123-12-1234" but NOT "1234567- 12-1234". \d{3}-\d{2}-\d{4} will match "123-12-1234" but NOT "1234567-12- 1234" also.
I'm not sure what you were trying to say above, but my understanding is that the 2 patterns you just mentioned are equivalent. The docs say "{3}" is equivalent to "{3,3}" not "{3,}".
It will, however, yeild a correct search (there is a difference
between search
and match). You didn't mention if you were doing a regex_search or regex_match ?
I'm doing a search because I don't want to know whether the whole string matches but whether the regex is found in the string. Specifically, I'm doing: m_regex.Search( sampleBody, boost::match_default | boost::match_any) While I can believe that the design intention was that "\d{3}-" should be found in "1234567-" (at the fifth character), it seems inconsistent that it is *not* also found in "123456-" and "12345678- ". I'm seeing that inconsistent behavior.
FWIW, it's easy enough for me to workaround the current behavior with a pattern like this:
(^|[^a])a{1}b
You could use this, but I wouldn't recomend it (but that's just me...regex construction is deeply personal :) ). HTH.
I realize there is more than one way to do it, and I'd be interested in what you'd recommend. FWIW, in our SSN-matching case, we'll probably just use "\b\d{3}-\d {2}-\d{4}\b". Thanks for the reply! --Dean
On Fri, Apr 25, 2003 at 05:35:06PM -0000, Dean wrote:
--- In Boost-Users@yahoogroups.com, "Joshua B. Smith"
wrote: I'm not sure what you were trying to say above, but my understanding is that the 2 patterns you just mentioned are equivalent. The docs say "{3}" is equivalent to "{3,3}" not "{3,}".
That is what I was trying to say, just not very clearly :)
I'm doing a search because I don't want to know whether the whole string matches but whether the regex is found in the string. Specifically, I'm doing:
m_regex.Search( sampleBody, boost::match_default | boost::match_any)
OK. That's kinda what I figured.
While I can believe that the design intention was that "\d{3}-" should be found in "1234567-" (at the fifth character), it seems inconsistent that it is *not* also found in "123456-" and "12345678- ". I'm seeing that inconsistent behavior.
It is not inconsistant because it fails to match then keeps going. It's all about greediness. For example: searching for a{1}b in strings 1) ab 2) aab 3) aaab searches correctly on 1 and incorrectly on 3 but not on 2 because a{1}b ab searches (correct) a{1}b aab Fails because it matched the two a's and then stopped because the string is done a{1}b aaab Fails on aa then begins to scan again and finds ab which fits the regex a{1}b Makes sense?
I realize there is more than one way to do it, and I'd be interested in what you'd recommend.
FWIW, in our SSN-matching case, we'll probably just use "\b\d{3}-\d {2}-\d{4}\b".
I too would probably use boundries. Or, you can use a regex_match on the the string returned on the regex_search. Or do both, it depends on how much I wanted to test the data for correctness. I tend do a search then match when I'm using hairy inputs. You can also use spaces like: \s*\d{3}-\d{2}-\d{4}\s* I tend to not use \b for no good reason or something like this maybe [\s,\.]*\d{3}-\d{2}-\d{4}[\s,\.]* -jbs
--- In Boost-Users@yahoogroups.com, "Joshua B. Smith"
On Fri, Apr 25, 2003 at 05:35:06PM -0000, Dean wrote: <snip>
While I can believe that the design intention was that "\d{3}-" should be found in "1234567-" (at the fifth character), it seems inconsistent that it is *not* also found in "123456-" and "12345678- ". I'm seeing that inconsistent behavior.
It is not inconsistant because it fails to match then keeps going. It's all about greediness. For example:
searching for a{1}b in strings
1) ab 2) aab 3) aaab
searches correctly on 1 and incorrectly on 3 but not on 2 because
a{1}b ab searches (correct) a{1}b aab Fails because it matched the two a's and then stopped because the string is done a{1}b aaab Fails on aa then begins to scan again and finds ab which fits the regex a{1}b
Makes sense? <snip>
That's what I (eventually) guessed was happening. Thanks for confirming my suspicion. However, it still seems possible that this was not the original design intention. I suppose only John Maddock can answer that question... It seems to me that when the code finds more than 1 "a", it should either: 1) skip past all subsequent "a"s before starting the scan again. This would cause "a{1}b" to be found in "ab" but not "aab", "aaab", "aaaab", etc. This would be very "greedy". :-) Or: 2) restart the scan 1 character after where the previous scan started. This would cause "a{1}b" to be found in "ab", "aab", "aaab", "aaaab", etc. FWIW, I'm told that the regex searcher in the .NET Framework exhibits behavior #1. I mention that only as a point of reference -- I realize that different implementations can have somewhat different correct behaviors. Anyway, it is either a bug or a "gotcha". I've been using regexs occasionaly for over 10 years and it "got" me. :-) Thanks again for the help! --Dean
On Fri, Apr 25, 2003 at 08:27:54PM -0000, Dean wrote:
That's what I (eventually) guessed was happening. Thanks for confirming my suspicion. However, it still seems possible that this was not the original design intention. I suppose only John Maddock can answer that question...
Does he read this list?
It seems to me that when the code finds more than 1 "a", it should either:
1) skip past all subsequent "a"s before starting the scan again. This would cause "a{1}b" to be found in "ab" but not "aab", "aaab", "aaaab", etc. This would be very "greedy". :-)
Or:
2) restart the scan 1 character after where the previous scan started. This would cause "a{1}b" to be found in "ab", "aab", "aaab", "aaaab", etc. FWIW, I'm told that the regex searcher in the .NET Framework exhibits behavior #1. I mention that only as a point of reference -- I realize that different implementations can have somewhat different correct behaviors.
Perl and Python both exhibit behavior #2. I think emacs does too. and it doesn't surprise me that .Net is the greediest. :P I think a lot of regex engines have been converging on a perlish implementation. In fact, I'd never taken seriously the thought of another way, but most of my regex work is done in python/perl (until recently at any rate, I like to boost regex lib a lot).
Anyway, it is either a bug or a "gotcha". I've been using regexs occasionaly for over 10 years and it "got" me. :-)
I wouldn't say it "got" you. Regex's are still 50% voodoo 50% trick and 2% butterscotch ripple. -jbs
Just a correction regarding an earlier statement I made regarding searching for a{1}b. I said:
It seems to me that when the code finds more than 1 "a", it should either:
1) skip past all subsequent "a"s before starting the scan again. This would cause "a{1}b" to be found in "ab" but not "aab", "aaab", "aaaab", etc. This would be very "greedy". :-)
Or:
2) restart the scan 1 character after where the previous scan started. This would cause "a{1}b" to be found in "ab", "aab", "aaab", "aaaab", etc.
FWIW, I'm told that the regex searcher in the .NET Framework exhibits behavior #1.
I misunderstood what I was told about the .NET Framework regexp implementation. It exhibits behavior #2 above. So, boost::regexp will behave the same as .NET Framework regexp (and probably most other regexp implementations) once it is fixed. Cheers, --Dean
I'm using regexp from boost-1.28 and experience the following behavior. Consider the following (somewhat artificial) regexp pattern:
a{1}b
As I expected, that pattern is found in "ab" but not "aab".
To my suprise however, the same pattern *is* found in "aaab", "aaaaab", and any other string consisting of an odd number of "a"s followed by a "b". It is not found in strings consisting of an even number of "a"s followed by a "b". This seems odd (no pun intended).
I see the same sort of behavior with quantifiers other than "{1}" and where the quantified expression matches other single characters. (Oddly enough, the behavior changes when using a quantified expression that matches multiple characters. "(ab){1}c" is found in "abc", "ababc", "abababc", and any other string containing "abc".)
I can verify that - I'll try and get it fixed in the next version as it is a very definite bug :-( Thanks for reporting that one, John Maddock.
Hi, Out of curiosity, will the result of your fix be: 1. "a{1}b" is *not* found in any string consisting of more than 1 "a" followed by a "b", Or, 2. "a{1}b" *is* found (albeit at the next to last position) in every string consisting of 1 or more "a"s followed by a "b". Thanks for all your hard work on boost::regex! --Dean ----- Original Message ----- From: John Maddock To: Boost-Users@yahoogroups.com Sent: Saturday, April 26, 2003 5:15 AM Subject: Re: [Boost-Users] Inconsistent regexp matching when using quantifiers?
I'm using regexp from boost-1.28 and experience the following behavior. Consider the following (somewhat artificial) regexp pattern:
a{1}b
As I expected, that pattern is found in "ab" but not "aab".
To my suprise however, the same pattern *is* found in "aaab", "aaaaab", and any other string consisting of an odd number of "a"s followed by a "b". It is not found in strings consisting of an even number of "a"s followed by a "b". This seems odd (no pun intended).
I see the same sort of behavior with quantifiers other than "{1}" and where the quantified expression matches other single characters. (Oddly enough, the behavior changes when using a quantified expression that matches multiple characters. "(ab){1}c" is found in "abc", "ababc", "abababc", and any other string containing "abc".)
I can verify that - I'll try and get it fixed in the next version as it is a very definite bug :-( Thanks for reporting that one, John Maddock. Yahoo! Groups Sponsor Info: http://www.boost.org Wiki: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl Unsubscribe: mailto:boost-users-unsubscribe@yahoogroups.com Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Out of curiosity, will the result of your fix be:
1. "a{1}b" is *not* found in any string consisting of more than 1 "a" followed by a "b",
Efficient but dumb.
Or,
2. "a{1}b" *is* found (albeit at the next to last position) in every string consisting of 1 or more "a"s followed by a "b".
The right answer IMO. John.
participants (4)
-
Dean
-
Dean Brettle
-
John Maddock
-
Joshua B. Smith