On 7/21/21 6:19 PM, Phil Endecott via Boost wrote:
Dear Experts,
I recently wanted to add some input-sanitising code to a C++ web application, and as I had some existing Javascript code I decided to port its regex-based input checking to the C++ code. This didn't go as well as I had expected!
Here's an example of the sort of thing I have:
if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();
The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast.
I haven't benchmarked regex implementations for a few years, but my past experience was that libstdc++ std::regex indeed was very slow. I don't remember the numbers, but my general impression was that all std::regex implementations were slow compared to Boost.Regex, which seems to be one of the fastest implementations. Boost.Xpressive is slower than Boost.Regex, but faster than std::regex. Its main advantage was that it is header-only. Basically, use Boost.Regex unless you really can't.
On investigation it seems that this is due to there being two different algorithms, one of which (DFA) is linear in the string size but exponential in the pattern size (which in this case seems to mean exponential in the max repeat value of 8192), and the other (backtracking) is linear in the pattern size but exponential in the string size.
It seems that the standard deliberately doesn't specify complexity guarantees for std::regex, leaving the choice of algorithm to the implementation. I find this a bit surprising. It's true that both algorithms have their strengths, but not knowing which you are going to get is the worst of both worlds - you would have to assume that it will be exponential in both pattern and string sizes in portable code! (Imagine a container that doesn't disclose whether it has vector or list complexity!)
So I'm wondering if it would be useful for regex libraries to somehow be explicit about which algorithm they are and what the complexity is.
Compare, for example, with the text search algorithms (Boyer-Moore etc) which indicate the algorithm in their names. Could we have regex_dfa / regex_backtracking, or something?
Now having said that, I think that what I really need for my application is something that is worst-case linear in both the pattern size and the input string. I'm not sure if the apparently-fast Boost.Regex algorithm has a worst-case input that is slow. ("Regular Expression Denial Of Service"). It's clear that my example can be written without a regex as a simple linear loop, and maybe some regex libraries have heuristics that do that.
Thoughts anyone?
Have you tried to cache the regex object? I suspect, parsing the regex might have a non-negligible cost. As for the complexity specification, it obviously depends on the regex. I don't think it is possible to specify an abstract complexity estimate for a regex library. The best you could have is to specify the complexity of various elements of the expression - both when constructing a regex and when applying it. Complexity is not everything you need to know when you care about performance. You should benchmark it anyway. It is not unusual to have a higher complexity algorithm outperform a lower complexity one in some conditions, which may happen to match yours. Even two algorithms having the same formal complexity estimate may perform very differently in practice. I remember Boost.Regex docs had some benchmark results, although by now they are probably outdated.