Regexp performance guarantees

newer
Re: [boost] Reminder: The Master...

Phil Endecott

21 Jul 2021 21 Jul '21

3:19 p.m.

Dear Experts, I recently wanted to add some input-sanitising code to a C++ web application, and as I had some existing Javascript code I decided to port its regex-based input checking to the C++ code. This didn't go as well as I had expected! Here's an example of the sort of thing I have: if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput(); The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast. On investigation it seems that this is due to there being two different algorithms, one of which (DFA) is linear in the string size but exponential in the pattern size (which in this case seems to mean exponential in the max repeat value of 8192), and the other (backtracking) is linear in the pattern size but exponential in the string size. It seems that the standard deliberately doesn't specify complexity guarantees for std::regex, leaving the choice of algorithm to the implementation. I find this a bit surprising. It's true that both algorithms have their strengths, but not knowing which you are going to get is the worst of both worlds - you would have to assume that it will be exponential in both pattern and string sizes in portable code! (Imagine a container that doesn't disclose whether it has vector or list complexity!) So I'm wondering if it would be useful for regex libraries to somehow be explicit about which algorithm they are and what the complexity is. Compare, for example, with the text search algorithms (Boyer-Moore etc) which indicate the algorithm in their names. Could we have regex_dfa / regex_backtracking, or something? Now having said that, I think that what I really need for my application is something that is worst-case linear in both the pattern size and the input string. I'm not sure if the apparently-fast Boost.Regex algorithm has a worst-case input that is slow. ("Regular Expression Denial Of Service"). It's clear that my example can be written without a regex as a simple linear loop, and maybe some regex libraries have heuristics that do that. Thoughts anyone? Regards, Phil.

Show replies by date

Andrey Semashev

21 Jul 21 Jul

4:34 p.m.

On 7/21/21 6:19 PM, Phil Endecott via Boost wrote:

...

Dear Experts,

I recently wanted to add some input-sanitising code to a C++ web application, and as I had some existing Javascript code I decided to port its regex-based input checking to the C++ code. This didn't go as well as I had expected!

Here's an example of the sort of thing I have:

if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();

The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast.

I haven't benchmarked regex implementations for a few years, but my past experience was that libstdc++ std::regex indeed was very slow. I don't remember the numbers, but my general impression was that all std::regex implementations were slow compared to Boost.Regex, which seems to be one of the fastest implementations. Boost.Xpressive is slower than Boost.Regex, but faster than std::regex. Its main advantage was that it is header-only. Basically, use Boost.Regex unless you really can't.

...

On investigation it seems that this is due to there being two different algorithms, one of which (DFA) is linear in the string size but exponential in the pattern size (which in this case seems to mean exponential in the max repeat value of 8192), and the other (backtracking) is linear in the pattern size but exponential in the string size.

It seems that the standard deliberately doesn't specify complexity guarantees for std::regex, leaving the choice of algorithm to the implementation. I find this a bit surprising. It's true that both algorithms have their strengths, but not knowing which you are going to get is the worst of both worlds - you would have to assume that it will be exponential in both pattern and string sizes in portable code! (Imagine a container that doesn't disclose whether it has vector or list complexity!)

So I'm wondering if it would be useful for regex libraries to somehow be explicit about which algorithm they are and what the complexity is.

Compare, for example, with the text search algorithms (Boyer-Moore etc) which indicate the algorithm in their names. Could we have regex_dfa / regex_backtracking, or something?

Now having said that, I think that what I really need for my application is something that is worst-case linear in both the pattern size and the input string. I'm not sure if the apparently-fast Boost.Regex algorithm has a worst-case input that is slow. ("Regular Expression Denial Of Service"). It's clear that my example can be written without a regex as a simple linear loop, and maybe some regex libraries have heuristics that do that.

Thoughts anyone?

Have you tried to cache the regex object? I suspect, parsing the regex might have a non-negligible cost. As for the complexity specification, it obviously depends on the regex. I don't think it is possible to specify an abstract complexity estimate for a regex library. The best you could have is to specify the complexity of various elements of the expression - both when constructing a regex and when applying it. Complexity is not everything you need to know when you care about performance. You should benchmark it anyway. It is not unusual to have a higher complexity algorithm outperform a lower complexity one in some conditions, which may happen to match yours. Even two algorithms having the same formal complexity estimate may perform very differently in practice. I remember Boost.Regex docs had some benchmark results, although by now they are probably outdated.

John Maddock

4:47 p.m.

On 21/07/2021 16:19, Phil Endecott via Boost wrote:

...

Dear Experts,

I recently wanted to add some input-sanitising code to a C++ web application, and as I had some existing Javascript code I decided to port its regex-based input checking to the C++ code. This didn't go as well as I had expected!

Here's an example of the sort of thing I have:

if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();

I would definitely cache the regex object in a case like that - constructing and analyzing the expression is quite a significant task.

...

The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast.

On investigation it seems that this is due to there being two different algorithms, one of which (DFA) is linear in the string size but exponential in the pattern size (which in this case seems to mean exponential in the max repeat value of 8192), and the other (backtracking) is linear in the pattern size but exponential in the string size.

It seems that the standard deliberately doesn't specify complexity guarantees for std::regex, leaving the choice of algorithm to the implementation. I find this a bit surprising. It's true that both algorithms have their strengths, but not knowing which you are going to get is the worst of both worlds - you would have to assume that it will be exponential in both pattern and string sizes in portable code! (Imagine a container that doesn't disclose whether it has vector or list complexity!)

Complexity for regular expression is really really hard to specify - in particular matching Perl style regular expressions are known to NP-complete in the worst case, but fortunately the worst case is rarely encountered, and in the most normal usage, Perl regexes are reasonably performant.

...

So I'm wondering if it would be useful for regex libraries to somehow be explicit about which algorithm they are and what the complexity is.

Compare, for example, with the text search algorithms (Boyer-Moore etc) which indicate the algorithm in their names. Could we have regex_dfa / regex_backtracking, or something?

Now having said that, I think that what I really need for my application is something that is worst-case linear in both the pattern size and the input string. I'm not sure if the apparently-fast Boost.Regex algorithm has a worst-case input that is slow. ("Regular Expression Denial Of Service").

What Boost.Regex has that other libraries do not (so far as I know), is a cap on the maximum number of machine states which may be visited while attempting a match. This is quite deliberately to prevent DOS attacks, and was built into the library from the start. The limit is set at approximately O(N^2) before a runtime error is triggered. HTH, John.

...

It's clear that my example can be written without a regex as a simple linear loop, and maybe some regex libraries have heuristics that do that.

Thoughts anyone?

Regards, Phil.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

Dominique Devienne

5:09 p.m.

Have you tried Russ Cox's RE2 from https://github.com/google/re2 ?

...

From https://swtch.com/~rsc/regexp/regexp3.html :

The result, RE2, provides most of the functionality of PCRE using a C++ interface very close to PCRE's, but it *guarantees linear time execution* and a fixed stack footprint

Gavin Lambert

10:57 p.m.

On 22/07/2021 3:19 am, Phil Endecott wrote:

...

Here's an example of the sort of thing I have:

if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();

The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast.

In pretty much all regexp languages, if you want to match '-' inside a character set then you must specify it as the first character (immediately following the '['), otherwise it is confused as a range specifier.

Soronel Haetir

22 Jul 22 Jul

5:56 a.m.

I would have thought that '-' would only get confused as a range specifier when it follows an opening atom. Here it follows a closing atom (the '9' in 0-9'). I did not think for example that "a-g-z" could possibly be equivalent to "a-z", that it should only be able to match a, b, c, d, e,f ,g '-' and 'z'. On 7/21/21, Gavin Lambert via Boost <boost@lists.boost.org> wrote:

...

On 22/07/2021 3:19 am, Phil Endecott wrote:

...
Here's an example of the sort of thing I have:

if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();

The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast.

In pretty much all regexp languages, if you want to match '-' inside a character set then you must specify it as the first character (immediately following the '['), otherwise it is confused as a range specifier.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Soronel Haetir soronel.haetir@gmail.com

Gavin Lambert

6:43 a.m.

On 22/07/2021 5:56 pm, Soronel Haetir wrote:

...

I would have thought that '-' would only get confused as a range specifier when it follows an opening atom. Here it follows a closing atom (the '9' in 0-9').

I did not think for example that "a-g-z" could possibly be equivalent to "a-z", that it should only be able to match a, b, c, d, e,f ,g '-' and 'z'.

That's not unreasonable, but it's not how the specification is worded. So you might find that it works on a particular implementation, but it's risky. The text of most regex specifications says that the only valid positions for a minus character that is intended to represent itself is either immediately following the [ or immediately preceding the ]. Of those, the former is a bit more traditional and hence safer. (Although if you want to include ] as well, then ] must be first and so - must be last.) But there's lots of implementation-defined holes in regexes, so YMMV. For example, some will accept it anywhere if you escape it with a backslash. Others don't support backslash escapes inside character sets at all. https://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html specifically calls out a construct such as "a-g-z" as undefined behaviour.

Phil Endecott

23 Jul 23 Jul

2:45 p.m.

Thanks all for your replies. John Maddock wrote:

...

Complexity for regular expression is really really hard to specify

I disagree. Pretty much by definition, Regular Expressions can be matched in time linear in the length of the string and that's what I'd expect a std::c++ spec to require, just as it requires sorting to be N log N etc. etc. The fact that Perl >bleurgh< chose to provide what I'd call "irregular expressions" 25 years ago doesn't mean that we had to copy that. I reject the idea that we have to support back-references in patterns because "people expect that" - I've never used them. Dominique Devienne wrote:

...

Have you tried Russ Cox's RE2 from https://github.com/google/re2

Thanks for the pointer. He has a series of three really good documents explaining the history of regular expression implementations and how we've slipped into this hole where the most common implementations have poor complexity guarantees. See: https://swtch.com/~rsc/regexp/ This also prompted me to look at CTRE, which compiles regular expressions from strings at compile time and even works pre-C++20. See: https://github.com/hanickadot/compile-time-regular-expressions Gavin Lambert wrote:

...

In pretty much all regexp languages, if you want to match '-' inside a character set then you must specify it as the first character

I was looking at the cppreference docs here: https://en.cppreference.com/w/cpp/regex/ecmascript The grammar they give includes: CharacterClass :: [ [ lookahead ∉ {^}] ClassRanges ] [ ^ ClassRanges ] ClassRanges :: [empty] NonemptyClassRanges NonemptyClassRanges :: ClassAtom ClassAtom NonemptyClassRangesNoDash ClassAtom - ClassAtom ClassRanges ClassAtom :: - ClassAtomNoDash ClassAtomExClass(C++ only) ClassAtomCollatingElement(C++ only) ClassAtomEquivalence(C++ only) ClassAtomNoDash :: SourceCharacter but not one of \ or ] or - \ ClassEscape I believe that allows [A-Z0-9-_/], doesn't it? Anyway, all this prompted me to do some more investigation and some benchmarking. The libraries that I have tried are libstdc++ (as supplied with g++ 8.3, so rather old), Boost.Regex, Boost.Xpressive (with run-time expression strings, not the Spirit-like compile time mode) (both Boost version 1.75), RE2, and CTRE. What I'm trying to do is to sanitise the input to an internet- exposed process, to reject malicious input'); drop table users; As an example I'll look at input that is supposed to be base-64 encoded and no more than a couple of kilobytes long. Typical-case performance doesn't matter much as this runs once per process invocation (and hence also caching the compiled regex doesn't help), but I do want to be sure that it doesn't have bad worst-case complexity in the face of pathological input. So my first test is a quick check with a regular expression that should might trigger worst-case behaviour in a non-linear implementation: a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa matched against: aaaaaaaaaaaaaaaaaaaa The execution times were: CTRE: 1.1 us RE2: 148 us Xpressive: 27286 us Boost.Regex: 31 us libstdc++: 88632 us Based on that, Xpressive and libstdc++ can be rejected immediately. (Of course this doesn't prove that the others use exclusively-linear algorithms; they may have heuristics that handle that case or just got lucky; this is why I believe there should be complexity guarantees.) Here are the patterns that I have benchmarked for my base64 test: 1. [A-Za-z0-9/+=]{0,8192} 2. [A-Za-z0-9/+=]* 3. (?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=)) 4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=)) Recall that base64 has chunks of four printable characters with the final chunk using = to pad. Variants 3 & 4 strictly check the padding. Variants 1 and 3 check for excessive length while 2 & 4 require a separate check to do that. Note that I'm using the "non capturing" syntax (?: ) rather than ( ) because I only need the boolean match result. First a note on compatibility. I noted before that expressions like [A-Za-z0-9-_/] were accepted by some libraries but not others. I found two other issues: only libstdc++ would accept [A-Z]{4}*, while the others all required ([A-Z{4})*. Then RE2 rejected the {0,8192} and {0,2048} repeats - it limits them to some smaller value. A note on compile times (g++ 8.3 -O3): there was a substantial variation, with RE2 and CTRE being the fastest, Boost.Regex and libstdc++ in the middle, and Boost.Xpressive slowest. The difference from fastest to slowest was about 10X. It was interesting that the "Compile Time Regular Expression" library CTRE was one of the fastest to compile! Regarding run-time performance, testing with about 3 kbytes of input data: CTRE was fastest. RE2 was second in the two expressions that it did not reject. Boost.Regex was last. My conclusion is that CTRE is the best choice, and I would recommend it unless (a) you need to specify the regular expression at runtime, or (b) you need some of the "irregular" Perl extension syntax. I hope that is of interest. Regards, Phil. P.S. I am subscribed to the list digest, which used to flush whatever had been posted at least once per day; now it doesn't seem to send anything until it has reached its threshold. Do others see this or is it just me? Can it be fixed? I would have replied earlier, and separately to the other replies, if I had received a digest at some point.

Soronel Haetir

3:23 p.m.

Phil Endecott, Match time linear to the length of the string works fine if the regex doesn't have alternations but breaks down completely if they are present. Imagine you have the regex "auto|bool|const|continue|...|([A-Z_][A-Za-z0-9_]*)" set up to match any C token (the ... in this case are missing keywords, integers, etc) but your input is "continuation". This regex will try for each alternate, get almost all the way with "continue" but then have to backtrack and try the other alternates. It's only when you get to the very last that the string will match. And it would be possible to contrive even worse examples (regex match over a dictionary for example) John Maddock was describing regex in general as not being able to specify the complexity rather than your specific example. On 7/23/21, Phil Endecott via Boost <boost@lists.boost.org> wrote:

...

Thanks all for your replies.

John Maddock wrote:

...
Complexity for regular expression is really really hard to specify

I disagree. Pretty much by definition, Regular Expressions can be matched in time linear in the length of the string and that's what I'd expect a std::c++ spec to require, just as it requires sorting to be N log N etc. etc. The fact that Perl >bleurgh< chose to provide what I'd call "irregular expressions" 25 years ago doesn't mean that we had to copy that. I reject the idea that we have to support back-references in patterns because "people expect that" - I've never used them.

Dominique Devienne wrote:

...
Have you tried Russ Cox's RE2 from https://github.com/google/re2

Thanks for the pointer. He has a series of three really good documents explaining the history of regular expression implementations and how we've slipped into this hole where the most common implementations have poor complexity guarantees. See:

https://swtch.com/~rsc/regexp/

This also prompted me to look at CTRE, which compiles regular expressions from strings at compile time and even works pre-C++20. See: https://github.com/hanickadot/compile-time-regular-expressions

Gavin Lambert wrote:

...
In pretty much all regexp languages, if you want to match '-' inside a character set then you must specify it as the first character

I was looking at the cppreference docs here:

https://en.cppreference.com/w/cpp/regex/ecmascript

The grammar they give includes:

CharacterClass ::

[ [ lookahead ∉ {^}] ClassRanges ] [ ^ ClassRanges ]

ClassRanges ::

[empty] NonemptyClassRanges

NonemptyClassRanges ::

ClassAtom ClassAtom NonemptyClassRangesNoDash ClassAtom - ClassAtom ClassRanges

ClassAtom ::

- ClassAtomNoDash ClassAtomExClass(C++ only) ClassAtomCollatingElement(C++ only) ClassAtomEquivalence(C++ only)

ClassAtomNoDash ::

SourceCharacter but not one of \ or ] or - \ ClassEscape

I believe that allows [A-Z0-9-_/], doesn't it?

Anyway, all this prompted me to do some more investigation and some benchmarking. The libraries that I have tried are libstdc++ (as supplied with g++ 8.3, so rather old), Boost.Regex, Boost.Xpressive (with run-time expression strings, not the Spirit-like compile time mode) (both Boost version 1.75), RE2, and CTRE.

What I'm trying to do is to sanitise the input to an internet- exposed process, to reject malicious input'); drop table users; As an example I'll look at input that is supposed to be base-64 encoded and no more than a couple of kilobytes long.

Typical-case performance doesn't matter much as this runs once per process invocation (and hence also caching the compiled regex doesn't help), but I do want to be sure that it doesn't have bad worst-case complexity in the face of pathological input. So my first test is a quick check with a regular expression that should might trigger worst-case behaviour in a non-linear implementation:

a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa

matched against: aaaaaaaaaaaaaaaaaaaa

The execution times were:

CTRE: 1.1 us RE2: 148 us Xpressive: 27286 us Boost.Regex: 31 us libstdc++: 88632 us

Based on that, Xpressive and libstdc++ can be rejected immediately. (Of course this doesn't prove that the others use exclusively-linear algorithms; they may have heuristics that handle that case or just got lucky; this is why I believe there should be complexity guarantees.)

Here are the patterns that I have benchmarked for my base64 test:

1. [A-Za-z0-9/+=]{0,8192} 2. [A-Za-z0-9/+=]* 3. (?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=)) 4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))

Recall that base64 has chunks of four printable characters with the final chunk using = to pad. Variants 3 & 4 strictly check the padding. Variants 1 and 3 check for excessive length while 2 & 4 require a separate check to do that.

Note that I'm using the "non capturing" syntax (?: ) rather than ( ) because I only need the boolean match result.

First a note on compatibility. I noted before that expressions like [A-Za-z0-9-_/] were accepted by some libraries but not others. I found two other issues: only libstdc++ would accept [A-Z]{4}*, while the others all required ([A-Z{4})*. Then RE2 rejected the {0,8192} and {0,2048} repeats - it limits them to some smaller value.

A note on compile times (g++ 8.3 -O3): there was a substantial variation, with RE2 and CTRE being the fastest, Boost.Regex and libstdc++ in the middle, and Boost.Xpressive slowest. The difference from fastest to slowest was about 10X. It was interesting that the "Compile Time Regular Expression" library CTRE was one of the fastest to compile!

Regarding run-time performance, testing with about 3 kbytes of input data: CTRE was fastest. RE2 was second in the two expressions that it did not reject. Boost.Regex was last.

My conclusion is that CTRE is the best choice, and I would recommend it unless (a) you need to specify the regular expression at runtime, or (b) you need some of the "irregular" Perl extension syntax.

I hope that is of interest.

Regards, Phil.

P.S. I am subscribed to the list digest, which used to flush whatever had been posted at least once per day; now it doesn't seem to send anything until it has reached its threshold. Do others see this or is it just me? Can it be fixed? I would have replied earlier, and separately to the other replies, if I had received a digest at some point.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Soronel Haetir soronel.haetir@gmail.com

John Maddock

4:50 p.m.

On 23/07/2021 16:23, Soronel Haetir via Boost wrote:

...

Phil Endecott,

Match time linear to the length of the string works fine if the regex doesn't have alternations but breaks down completely if they are present.

Imagine you have the regex "auto|bool|const|continue|...|([A-Z_][A-Za-z0-9_]*)" set up to match any C token (the ... in this case are missing keywords, integers, etc) but your input is "continuation". This regex will try for each alternate, get almost all the way with "continue" but then have to backtrack and try the other alternates. It's only when you get to the very last that the string will match. And it would be possible to contrive even worse examples (regex match over a dictionary for example)

We're getting well off topic here, but in your example, backtracking is NOT required since the pattern can be converted into a DFA that matches at in linear time. However, the full Perl syntax definitely can not be converted to a DFA (and it's not just the backreferences either). So... complaining that a Perl-compatible engine isn't O(N) is like complaining that your orange doesn't taste much like an apple ;) With regard to performance testing, it's very very hard to make a general comparison between libraries, indeed I would tend to suggest that it's folly to even try - there are just too many variables in play. However, if your problem domain is more limited, then you can (and indeed should as Phil has done) be able to do so, and Phil's results are exactly what *he* needs, while others may come to other conclusions for their particular case. In short, the orange grove is happy that Phil has found a nice apple :) Best, John. -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

Peter Dimov

4:16 p.m.

Phil Endecott wrote:

...

What I'm trying to do is to sanitise the input to an internet- exposed process, to reject malicious input'); drop table users; As an example I'll look at input that is supposed to be base-64 encoded and no more than a couple of kilobytes long.

Typical-case performance doesn't matter much as this runs once per process invocation (and hence also caching the compiled regex doesn't help), but I do want to be sure that it doesn't have bad worst-case complexity in the face of pathological input. So my first test is a quick check with a regular expression that should might trigger worst-case behaviour in a non-linear implementation:

a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa

But that's a separate case. This is a pathological regexp, not a "pathological" input string. If your regexes don't come from an external source, the performance of a pathological regex is not a potential security issue.

Phil Endecott

30 Jul 30 Jul

4:19 p.m.

Hi Peter, Peter Dimov wrote:

...

Phil Endecott wrote:

...
What I'm trying to do is to sanitise the input to an internet- exposed process, to reject malicious input'); drop table users; As an example I'll look at input that is supposed to be base-64 encoded and no more than a couple of kilobytes long.

Typical-case performance doesn't matter much as this runs once per process invocation (and hence also caching the compiled regex doesn't help), but I do want to be sure that it doesn't have bad worst-case complexity in the face of pathological input. So my first test is a quick check with a regular expression that should might trigger worst-case behaviour in a non-linear implementation:

a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa

But that's a separate case. This is a pathological regexp, not a "pathological" input string. If your regexes don't come from an external source, the performance of a pathological regex is not a potential security issue.

There are certain pathological regexp matching algorithms that have certain pathological regexp inputs that have certain pathological input strings. How do I know if a regexp that I've written in my code is a pathological one for the particular regexp library that I'm using?

...

Are these results available? Apologies if you've already posted them.

I was going to post some numbers but then I spotted that they got slower as the test system warmed up! Fundamentally, I'm only interested in the complexity order, not even factors of 10X, so I'm not trying to produce very detailed numbers. Previously I made some positive comments about the compile-time regular expression library CTRE. I had looked at a couple of presentations from CppCon 2018 & 2019 which presented the compile-time conversion of the regexp string into an NFA and then - according to the 2019 presentation - into a DFA, with O(N) matching complexity. But having looked at the code, it seems that unless I'm missing something the NFA-to-DFA conversion is not present! This is yet another backtracking NFA implementation. Boo. My recommendation is therefore RE2. Regards, Phil.

Peter Dimov

23 Jul 23 Jul

4:20 p.m.

Phil Endecott wrote:

...

Here are the patterns that I have benchmarked for my base64 test:

1. [A-Za-z0-9/+=]{0,8192} 2. [A-Za-z0-9/+=]* 3. (?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0- 9/+]{3}=)) 4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))

Are these results available? Apologies if you've already posted them.

Rainer Deyke

7:32 p.m.

On 23.07.21 16:45, Phil Endecott via Boost wrote:

...

What I'm trying to do is to sanitise the input to an internet- exposed process, to reject malicious input'); drop table users; As an example I'll look at input that is supposed to be base-64 encoded and no more than a couple of kilobytes long. I'm going off on a tangent here, but I hope you're not actually trying to prevent SQL injection attacks by validating inputs with regular expressions. That would be a brittle and unnecessarily complex approach which would almost certainly either reject valid input or fail to reject all attacks or both. The only correct way to prevent SQL injection attacks is to always use parametrized statements.

-- Rainer Deyke (rainerd@eldwood.com)

1436

Age (days ago)

1445

Last active (days ago)

List overview

Download

13 comments

8 participants

participants (8)

Andrey Semashev
Dominique Devienne
Gavin Lambert
John Maddock
Peter Dimov
Phil Endecott
Rainer Deyke
Soronel Haetir