Boost Regex Back Reference Issue

newer
[Boost.Container] Boost.Container...

older
[gsoc] Change in Google Summer of...

Nick

13 Jul 2017 13 Jul '17

2:03 p.m.

According to the Boost documentation for Back references (http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/...) which seems to be the same at least from 1.37 - 1.64. Quoting a piece of the documentation: ----- For example the expression: ^(a*).*\1$ Will match the string: aaabbaaa But not the string: aaabba ----- I'm finding two issues with the example cited. 1. It seems to me that the example of the string which should not match, should actually match. Ultimately, shouldn't the engine match the marked sub-expression with the first 'a' in order to satisfy the backreference? I tested this example with Oniguruma and with PHP's PCRE and they both matched the string noted here to not match. But also, since the marked sub-expression is (a*) then I wonder what the behavior would be if it couldn't make a match on 'a', since the '*' will allow for zero matches. In fact, it seems like everything in the pattern is effectively "optional" due to the '*' operator. I'm a novice with Perl, but unless I made a mistake, it will match unconditionally: print "It matches\n" if "aaabba" =~ /^(a*).*\1$/; It matches print "It matches\n" if "aaabbac" =~ /^(a*).*\1$/; It matches print "It matches\n" if "" =~ /^(a*).*\1$/; It matches print "It matches\n" if "x" =~ /^(a*).*\1$/; It matches print "It matches\n" if "xyz" =~ /^(a*).*\1$/; It matches print "It matches\n" if "123" =~ /^(a*).*\1$/; It matches 2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug? Nick

Show replies by date

Mark Stallard

13 Jul 13 Jul

5:51 p.m.

Nick wrote:

...

According to the Boost documentation for Back references [...]

I use regex's frequently in Perl, C++ and other languages. I use Boost Regex exclusively in C++, even though I probably should be using std::regex instead. I'm much happier using Boost Regex and std::string than I was with PCRE and char*. I only read the Boost regex docs for the API, not for regex syntax.

...

For example the expression:

^(a*).*\1$

Will match the string:

aaabbaaa

But not the string:

aaabba

This example doesn't make sense to me, either, for same reasons that you mentioned: Both strings begin and end with "a", and a* can match 0 or more characters. I tested this example on Perl 5.22, adding the string "unmatchable" to the test input: $ perl -nle 'print "$_ ($1)" if /^(a*).*\1$/' <<HERE > aaabbaaa > aaabba > unmatchable > HERE The output shows both the input string and the submatch captured by $1 and indicates that the pattern will match any input string: aaabbaaa (aaa) aaabba (a) unmatchable () I haven't tested this with Boost Regex or any other C++ library, but I believe that this example should be edited or replaced. I would suggest changing (a*) to (a+) and "aaabba" to "aaabbb". Alternatively, you could change (a*) to (a{2}) for the original input strings, but (a+) makes simpler example.

...

2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug?

Ouch! I'm glad nothing like that has happened to me. It's generally a bad idea to use * too frequently in a regex, especially .* . Adding a back reference to a* only aggravates the issue. I had an experience years ago where too many .*'s killed the performance of a Perl script, and probably consumed way too much memory. |+| M a r k |+| Mark Stallard Engineering & Operations Application Development Business Application Services Global Business Services Information Technology Raytheon Company Billerica, MA (US) This message contains information that may be confidential and privileged. Unless you are the addressee (or authorized to receive mail for the addressee), you should not use, copy or disclose to anyone this message or any information contained in this message. If you have received this message in error, please so advise the sender by reply e-mail and delete this message. Thank you for your cooperation. -----Original Message----- From: Boost-users [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Nick via Boost-users Sent: Thursday, July 13, 2017 10:04 AM To: boost-users@lists.boost.org Cc: Nick <nospam@codesniffer.com> Subject: [Boost-users] Boost Regex Back Reference Issue According to the Boost documentation for Back references (http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/...) which seems to be the same at least from 1.37 - 1.64. Quoting a piece of the documentation: ----- For example the expression: ^(a*).*\1$ Will match the string: aaabbaaa But not the string: aaabba ----- I'm finding two issues with the example cited. 1. It seems to me that the example of the string which should not match, should actually match. Ultimately, shouldn't the engine match the marked sub-expression with the first 'a' in order to satisfy the backreference? I tested this example with Oniguruma and with PHP's PCRE and they both matched the string noted here to not match. But also, since the marked sub-expression is (a*) then I wonder what the behavior would be if it couldn't make a match on 'a', since the '*' will allow for zero matches. In fact, it seems like everything in the pattern is effectively "optional" due to the '*' operator. I'm a novice with Perl, but unless I made a mistake, it will match unconditionally: print "It matches\n" if "aaabba" =~ /^(a*).*\1$/; It matches print "It matches\n" if "aaabbac" =~ /^(a*).*\1$/; It matches print "It matches\n" if "" =~ /^(a*).*\1$/; It matches print "It matches\n" if "x" =~ /^(a*).*\1$/; It matches print "It matches\n" if "xyz" =~ /^(a*).*\1$/; It matches print "It matches\n" if "123" =~ /^(a*).*\1$/; It matches 2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug? Nick _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users

John Maddock

6:20 p.m.

On 13/07/2017 15:03, Nick via Boost-users wrote:

...

According to the Boost documentation for Back references (http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/...) which seems to be the same at least from 1.37 - 1.64.

Quoting a piece of the documentation:

-----

For example the expression:

^(a*).*\1$

Will match the string:

aaabbaaa

But not the string:

aaabba

-----

I'm finding two issues with the example cited.

1. It seems to me that the example of the string which should not match, should actually match. Ultimately, shouldn't the engine match the marked sub-expression with the first 'a' in order to satisfy the backreference? I tested this example with Oniguruma and with PHP's PCRE and they both matched the string noted here to not match.

Congratulations, you've just spotted a... ahem... 20 year old bug in the documentation!

...

2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug?

Works just fine for me here, if you have a complete self-contained test case (code, plus compiler version, platform and boost version please) I'll look into it. Best, John. --- This email has been checked for viruses by AVG. http://www.avg.com

Nick

14 Jul 14 Jul

2:55 p.m.

On Thu, 2017-07-13 at 19:20 +0100, John Maddock via Boost-users wrote:

...

Congratulations, you've just spotted a... ahem... 20 year old bug in the documentation!

Should I do something explicit for it to get fixed (bug ticket, etc), or has that already been set in motion from this e-mail?

...

...
2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug?

Works just fine for me here, if you have a complete self-contained test case (code, plus compiler version, platform and boost version please) I'll look into it.

If it's working for you I'll take a closer look and try to get a small repro. This is happening with GCC 4.9 on Linux (Ubuntu 14 & Gentoo) with Boost 1.61. Nick

Nick

4:26 p.m.

Apologies for replying to my own email. On Fri, 2017-07-14 at 10:55 -0400, Nick wrote:

...

On Thu, 2017-07-13 at 19:20 +0100, John Maddock via Boost-users wrote:

...
...
2. The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error. (seems like it might be an endless loop of some sort). Is that a bug?

Works just fine for me here, if you have a complete self-contained test case (code, plus compiler version, platform and boost version please) I'll look into it.

If it's working for you I'll take a closer look and try to get a small repro. This is happening with GCC 4.9 on Linux (Ubuntu 14 & Gentoo) with Boost 1.61.

So I found the cause. It seems to be caused by the regex pattern and my code's design. My pseudo code: while boost::regex_search { // Store match(es) // Move start iterator to end of matches[0] } Well, given the regex: ^(a*).*\1$ we've already agreed that this regex is effectively all optional which means it will match anything, including ^$ (empty string), and I've confirmed this behavior with Perl. So the first iteration matches as we'd expect: [0] = aaabba, [1] = a Then all subsequent iterations match on the empty string because the starting iterator was moved to the end of the first match (which is also the end of the string) so the reduced regex ^$ matches indefinitely: [0] = "", [1] = "" Since my loop is adding the matches for each iteration, it eventually hits a bad_alloc exception. So I guess this is by design for Boost Regex. Obviously my loop can be adjusted to bail if start_iterator == end_iterator. But I wonder if there's something else my loop should be accounting for? I suppose there's nothing that boost::regex_search can do for this situation? Nick

2921

Age (days ago)

2922

Last active (days ago)

List overview

Download

2 comments

2 participants

participants (2)

John Maddock
Mark Stallard
Nick