Re: [boost] Regexp performance guarantees

21 Jul 2021

      On 7/21/21 6:19 PM, Phil Endecott via Boost wrote:
...
Dear Experts,
I recently wanted to add some input-sanitising code to a C++
web application, and as I had some existing Javascript code
I decided to port its regex-based input checking to the C++
code. This didn't go as well as I had expected!
Here's an example of the sort of thing I have:
if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw 
MalformedInput();
The Javascript equivalent seems to be fine. In C++, with
libstdc++'s regex implementation, it seems to take about 2
seconds to run. Boost.Regex didn't like the '-' after the '9'
but when I fixed that the execution time became negligible.
Boost.Xpressive seems also to be fast.
I haven't benchmarked regex implementations for a few years, but my past 
experience was that libstdc++ std::regex indeed was very slow. I don't 
remember the numbers, but my general impression was that all std::regex 
implementations were slow compared to Boost.Regex, which seems to be one 
of the fastest implementations. Boost.Xpressive is slower than 
Boost.Regex, but faster than std::regex. Its main advantage was that it 
is header-only. Basically, use Boost.Regex unless you really can't.
...
On investigation it seems that this is due to there being two
different algorithms, one of which (DFA) is linear in the string
size but exponential in the pattern size (which in this case seems
to mean exponential in the max repeat value of 8192), and the
other (backtracking) is linear in the pattern size but exponential
in the string size.
It seems that the standard deliberately doesn't specify complexity
guarantees for std::regex, leaving the choice of algorithm to the
implementation. I find this a bit surprising. It's true that
both algorithms have their strengths, but not knowing which
you are going to get is the worst of both worlds - you would
have to assume that it will be exponential in both pattern
and string sizes in portable code! (Imagine a container that
doesn't disclose whether it has vector or list complexity!)
So I'm wondering if it would be useful for regex libraries to
somehow be explicit about which algorithm they are and what
the complexity is.
Compare, for example, with the text search algorithms
(Boyer-Moore etc) which indicate the algorithm in their names.
Could we have regex_dfa / regex_backtracking, or something?
Now having said that, I think that what I really need for my
application is something that is worst-case linear in both the
pattern size and the input string. I'm not sure if the
apparently-fast Boost.Regex algorithm has a worst-case input
that is slow. ("Regular Expression Denial Of Service"). It's
clear that my example can be written without a regex as a
simple linear loop, and maybe some regex libraries have
heuristics that do that.
Thoughts anyone?
Have you tried to cache the regex object? I suspect, parsing the regex 
might have a non-negligible cost.

As for the complexity specification, it obviously depends on the regex. 
I don't think it is possible to specify an abstract complexity estimate 
for a regex library. The best you could have is to specify the 
complexity of various elements of the expression - both when 
constructing a regex and when applying it.

Complexity is not everything you need to know when you care about 
performance. You should benchmark it anyway. It is not unusual to have a 
higher complexity algorithm outperform a lower complexity one in some 
conditions, which may happen to match yours. Even two algorithms having 
the same formal complexity estimate may perform very differently in 
practice. I remember Boost.Regex docs had some benchmark results, 
although by now they are probably outdated.

Re: [boost] Regexp performance guarantees

Andrey Semashev