Re: [Boost-Users] bug in regex 3.31 (boost 1.28)?

16 Aug 2002

      ...
I sent this originally to James Maddock, but realized this is
probably a better place to post it.
Sorry it took me a while to get around to it.
...
At work we've been testing out regex (nice work BTW) in some of our
code, and appear to have found a bug.  We ran into it parsing HTML,
and I've written a test C++ app to reproduce it.
In the program below, the output should be the same for both searches
as far as I can tell, but it's not.  I don't know if it's some
interaction with the quote character or something like that.  We
attempted to use other quantifiers (after '?', we tried '*', '{0,1}',
["]?) to no avail. I'm confident this is not user error.  The extra
grouping is annoying (in "goodPatternStr"), but is an acceptable
workaround.  The strange thing is that a non-capturing group doesn't
fix it.
Ideas?
I have an answer for you, but I don't think you're going to like it: it
comes down to how the "leftmost longest" rules are applied:

what's happening here is that $1 is being matched, but it's matching the
null string just before the \" (at character 26 I think it was), the
alternative (the one you expected), would have matched starting at character
27 (just one to the right of the \").  So the match found is in some sense
"better" (further to the left) that the one you expected.  I think I'm going
to have to switch to perl matching rules so I can stop explaining this...
:-)

A simpler solution to your problem is to use a + quantifier rather than a *,
so that it can't match the null string:

  const char* badPatternStr  = "<input[^>]*name=\"?([^>
\"]+)[^>]*value=\"?([^> \"]+)";

Hope this helps,

John Maddock
http://ourworld.compuserve.com/homepages/john_maddock/index.htm