I sent this originally to James Maddock, but realized this is probably a better place to post it.
Sorry it took me a while to get around to it.
At work we've been testing out regex (nice work BTW) in some of our code, and appear to have found a bug. We ran into it parsing HTML, and I've written a test C++ app to reproduce it.
In the program below, the output should be the same for both searches as far as I can tell, but it's not. I don't know if it's some interaction with the quote character or something like that. We attempted to use other quantifiers (after '?', we tried '*', '{0,1}', ["]?) to no avail. I'm confident this is not user error. The extra grouping is annoying (in "goodPatternStr"), but is an acceptable workaround. The strange thing is that a non-capturing group doesn't fix it.
Ideas?
I have an answer for you, but I don't think you're going to like it: it comes down to how the "leftmost longest" rules are applied: what's happening here is that $1 is being matched, but it's matching the null string just before the \" (at character 26 I think it was), the alternative (the one you expected), would have matched starting at character 27 (just one to the right of the \"). So the match found is in some sense "better" (further to the left) that the one you expected. I think I'm going to have to switch to perl matching rules so I can stop explaining this... :-) A simpler solution to your problem is to use a + quantifier rather than a *, so that it can't match the null string: const char* badPatternStr = "]*name=\"?([^> \"]+)[^>]*value=\"?([^> \"]+)"; Hope this helps, John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm