Re: [Boost-Users] Re: Inconsistent regexp matching when using quantifiers?

25 Apr 2003

      On Fri, Apr 25, 2003 at 05:35:06PM -0000, Dean wrote:
...
--- In Boost-Users@yahoogroups.com, "Joshua B. Smith" <josh@n...> 
wrote:
I'm not sure what you were trying to say above, but my understanding 
is that the 2 patterns you just mentioned are equivalent.  The docs 
say "{3}" is equivalent to "{3,3}" not "{3,}".
That is what I was trying to say, just not very clearly :)
...
I'm doing a search because I don't want to know whether the whole 
string matches but whether the regex is found in the string.  
Specifically, I'm doing:
m_regex.Search( sampleBody, boost::match_default | boost::match_any)
OK. That's kinda what I figured.
...
While I can believe that the design intention was that "\d{3}-" 
should be found in "1234567-" (at the fifth character), it seems 
inconsistent that it is *not* also found in "123456-" and "12345678-
".  I'm seeing that inconsistent behavior.
It is not inconsistant because it fails to match then keeps going.
It's all about greediness. For example:

searching for a{1}b in strings

1) ab
2) aab
3) aaab

searches correctly on 1 and incorrectly on 3 but not on 2 because 

a{1}b ab searches (correct)
a{1}b aab Fails because it matched the two a's and then stopped because the
      string is done
a{1}b aaab Fails on aa then begins to scan again and finds ab which
      fits the regex a{1}b

Makes sense?
...
I realize there is more than one way to do it, and I'd be interested 
in what you'd recommend.
FWIW, in our SSN-matching case, we'll probably just use "\b\d{3}-\d
{2}-\d{4}\b".
I too would probably use boundries.  Or, you can use a regex_match on the
the string returned on the regex_search. Or do both, it depends on how 
much I wanted to test the data for correctness. I tend do a search then match
when I'm using hairy inputs.  You can also use spaces
like:

\s*\d{3}-\d{2}-\d{4}\s*

I tend to not use \b for no good reason
or something like this maybe [\s,\.]*\d{3}-\d{2}-\d{4}[\s,\.]*

-jbs