I'm doing some performance comparisons between Rogue Wave's Regexp class (http://www.roguewave.com/support/docs/tlsref/rwcregexp.cfm) and boost::regex using simple expressions like "^metal.*" and "[0-9]+" with regex_search(). regex is slower by a factor of 6, measuring cpu time (on a SunBlade 100 runing Solaris 8). Is there anything I can do to speed it up? I tried using different values of boost::regbase::flag_type_ to adjust how the regular expressions are interpreted, but with little effect. Any other ideas? (I'm using boost_1_27_0.) Thanks, -- Paul M. Dubuc
On Sat, 13 Apr 2002 00:38:09 +0200, Paul Dubuc wrote:
Is there anything I can do to speed it up?
In my experience, the run-time performance of the boost regular expression machinery depends a lot on the allocator configured in the boost::match_results you pass to regex_match. I started a thread on this in the developer mailing list some months ago: http://lists.boost.org/MailArchives/boost/msg20512.php [it's kind of hard to search the archive, isn't it]
Thomas Maeder wrote:
On Sat, 13 Apr 2002 00:38:09 +0200, Paul Dubuc wrote:
Is there anything I can do to speed it up?
In my experience, the run-time performance of the boost regular expression machinery depends a lot on the allocator configured in the boost::match_results you pass to regex_match.
I started a thread on this in the developer mailing list some months ago:
http://lists.boost.org/MailArchives/boost/msg20512.php
[it's kind of hard to search the archive, isn't it]
Thanks for the information. I didn't think to search the developer mailing list archives. I didn't see much in the way of replies to your messages. Is your allocator, or another one that improves performance available for use? The performance of regex searching is very important to my application. But it tends to use very simple expresions. If there is any way I can simplify my use of boost::regex to improve performance, I would appreciate the help very much. Otherwise, I will probaby have to look for another alternative :-{ Thanks, Pau Dubuc -- Paul M. Dubuc
Is there anything I can do to speed it up?
Other that use a custom allocator as suggested, probably not. One unfortunate side effect of regular expressions is that you pay for everything that you *may* use rather than just what you do. In particular if you don't want: a) wide character regexes. b) support for backreferences. c) support for marked subexpressions. Then there are significantly faster algorithms that can be used, of course these don't work if you do want those features :-( C based libraries can also use alloca, which generally gives at least a 2x performance increase. As far as I know, Rogue Waves lib uses all of the above - that gives better performance at the expense of features. John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
Is there anything I can do to speed it up?
Other that use a custom allocator as suggested, probably not. One unfortunate side effect of regular expressions is that you pay for everything that you *may* use rather than just what you do. In particular if you don't want:
a) wide character regexes. b) support for backreferences. c) support for marked subexpressions.
John, Speaking as a satisfied user of the regular expression library, always looking to help make it better: I was under the impression that point (a) didn't cost anything in Boost::Regex because it was templatized on the character type. Am I mistaken? Case (b) is fairly rarely used, but (c) is common. In any event, it is certainly true that after compiling the regular expression, you know whether these are needed. So if there are faster algorithms for these special cases, could they be incorporated into the library without much overhead?
C based libraries can also use alloca, which generally gives at least a 2x performance increase.
I know that alloca is not 'officially' available in portable C++. But I think most C++ compilers will handle C-like useages for this construct. I know we use it successfully on the compilers we use (gcc, Sun CC). So if there is someplace it would be useful, you could almost certainly get away with it, probably #ifdef'd around for safety. George Heintzelman georgeh@aya.yale.edu
Speaking as a satisfied user of the regular expression library, always looking to help make it better:
I was under the impression that point (a) didn't cost anything in Boost::Regex because it was templatized on the character type. Am I mistaken? Case (b) is fairly rarely used, but (c) is common. In any event, it is certainly true that after compiling the regular expression, you know whether these are needed. So if there are faster algorithms for these special cases, could they be incorporated into the library without much overhead?
The point is that there are a wide range of differing state machine representations available - to make "automatic" use of these one would have to effectively implement several different regex state machines and switch between them based on run time detection (what kind of expression you have), this is a lot of work as well as adding code bloat. With respect to (a), it is true that narrow character regexes make some optimisations now, but many more are available - mainly in when in combination with (b) and (c).
C based libraries can also use alloca, which generally gives at least a 2x performance increase.
I know that alloca is not 'officially' available in portable C++. But I think most C++ compilers will handle C-like useages for this construct. I know we use it successfully on the compilers we use (gcc, Sun CC). So if there is someplace it would be useful, you could almost certainly get away with it, probably #ifdef'd around for safety.
Point taken, however it means a complete rewrite (and adds to the maintenance a lot - more config options to test etc). Personally I would rather see a separate regex type with limited usefulness, but better performance when it can be used. John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
On Fri, 12 Apr 2002 15:38:09 -0700, Paul Dubuc wrote: Slightly off-topic, but ... ... have a look at GRETA, my own regular expression template library. I have found it to be as much as 7x faster than boost regex. It's free for non-commercial use. You can download the source code from: http://research.microsoft.com/projects/greta I'd love any feedback you might have. The interface is not nearly as nice or as rich as Dr. Maddock's, but it's just about the fastest C/C++ regex library I've tested. Wide chars, backreferences, full Perl 5 syntax, etc., etc. Eric P.S. I work for Microsoft, but my posts are my own and do not express the views of Microsoft.
I'm doing some performance comparisons between Rogue Wave's Regexp class (http://www.roguewave.com/support/docs/tlsref/rwcregexp.cfm) and boost::regex using simple expressions like "^metal.*" and "[0-9]+" with regex_search(). regex is slower by a factor of 6, measuring cpu time (on a SunBlade 100 runing Solaris 8). Is there anything I can do to speed it up? I tried using different values of boost::regbase::flag_type_ to adjust how the regular expressions are interpreted, but with little effect. Any other ideas? (I'm using boost_1_27_0.)
Thanks,
Eric Niebler
Slightly off-topic, but ...
Would you be interested in getting this back on topic? <evil grin>
... have a look at GRETA, my own regular expression template library. I have found it to be as much as 7x faster than boost regex. It's free for non-commercial use.
Does this mean that it (or something derived from it) couldn't be boostified?
I'd love any feedback you might have. The interface is not nearly as nice or as rich as Dr. Maddock's, but it's just about the fastest C/C++ regex library I've tested. Wide chars, backreferences, full Perl 5 syntax, etc., etc.
IMHO, it would be nice to have sort of regex light (less rich functionality, better performance) in boost, and GRETA might be a good starting point. It would be helpful if the parts of the interface common to both libraries were harmonized. Do you think this would be possible?
Just a naive question: If the regular expression parsing performance is of paramount importance to you, why don't you use the native implementation, regcomp(), regexec(), and friends? The API isn't that nice for C++, but doubtlessly you can get along, and doubtlessly, these routines are pretty fast.
Peter Simons
Just a naive question: If the regular expression parsing performance is of paramount importance to you,
What I said is that I might like to see something like "regex light" in Boost. IIRC John Maddock, the creator of the Boost Regular Expressions Library, recently wrote something similar either here or in the developers list.
why don't you use the native implementation, regcomp(), regexec(), and friends?
Who says I don't use them? I prefer platform and C++ implementation independent libraries because I like my code to be portable.
"Thomas Maeder"
Eric Niebler
writes: Slightly off-topic, but ...
Would you be interested in getting this back on topic? <evil grin>
Getting there. :-) See below.
... have a look at GRETA, my own regular expression template library. I
have
found it to be as much as 7x faster than boost regex. It's free for non-commercial use.
Does this mean that it (or something derived from it) couldn't be boostified?
Since you asked ... I fought tooth and nail to get permission to submit to Boost, but our layers wouldn't budge. But the winds are changing in the VC++ group -- there are some who would like to see this happen. I'd still like very much to combine the best of boost regex with the best of GRETA and make it publically available, either in boost or with a boost-like license. We'll see.
I'd love any feedback you might have. The interface is not nearly as nice or as rich as Dr. Maddock's, but it's just about the fastest C/C++ regex library I've tested. Wide chars, backreferences, full Perl 5 syntax, etc., etc.
IMHO, it would be nice to have sort of regex light (less rich
functionality,
better performance) in boost, and GRETA might be a good starting point. It would be helpful if the parts of the interface common to both libraries were harmonized. Do you think this would be possible?
The interfaces are vastly different, unfortunately. I would like to put a more boost-ish interface on GRETA, because the boost interface is nicer. And it would give me perspective on regex interface issues that will be valuable during the standardization process, which I plan to be involved in. If I can find the time, I'll do it. (GRETA is a labor of love that I pursue in my spare time -- it has next to nothing to do with my job.) Note that "harmonizing" boost regex and GRETA would be difficult because their flavors of regular expressions have slightly different semantics. Boost has POSIX-like semantics, which makes stronger guarantees about the "left-most longest" rule. GRETA has Perl-like semantics, which makes weaker guarantees, but lets me cut some corners. FYI -- GRETA isn't "regex lite." It's a backtracking regex engine, so it's heavier-weight than, say, a DFA-based regex engine that doesn't do backreferences. It's faster than boost, but that's more because I use a different implementation, not a fundamentally different algorithm per se. Eric
Eric Niebler
I'm doing some performance comparisons between Rogue Wave's Regexp class (http://www.roguewave.com/support/docs/tlsref/rwcregexp.cfm) and boost::regex using simple expressions like "^metal.*" and "[0-9]+" with regex_search(). regex is slower by a factor of 6, measuring cpu time (on a SunBlade 100 runing Solaris 8). Is there anything I can do to speed it up? I tried using different values of boost::regbase::flag_type_ to adjust how the regular expressions are interpreted, but with little effect. Any other ideas? (I'm using boost_1_27_0.)
Hi, I downloaded your software and have a question concerning license. I'd like to build a template-libraray for PHP and need therefore regex-matching. I don't want to sell this library, I think I'll use a BSD style licence. Am I allowed to youse greta for this purpose ??? Greetings, Uwe -- Dr. rer. nat. Uwe Schmitt Computer science is no more about Computers, uwe.schmitt@procoders.net than astronomy is about telescopes. h(Dijkstra) http://www.procoders.net
Hi, I downloaded your software and have a question concerning license. I'd like to build a template-libraray for PHP and need therefore regex-matching. I don't want to sell this library, I think I'll use a BSD style licence. Am I allowed to youse greta for this purpose ???
Greetings, Uwe
I regret having posted any information about greta to this list -- it's pretty off-topic. My apologies. But to answer your question, no the license does not allow this. (Don't blame me, I wrote the code not the license.) You should stick with boost regex. If you are concerned about performance, I know that John Maddock is hard at work improving the performance of regex++. If you have any more questions, please send them to me directly. Happy grepping, Eric Niebler ericne@microsoft.com http://research.microsoft.com/projects/greta
participants (7)
-
Eric Niebler
-
George A. Heintzelman
-
John Maddock
-
Paul Dubuc
-
Peter Simons
-
Thomas Maeder
-
Uwe Schmitt