regex performance

older
Is it possible to retrieve blank...

Paul Dubuc

12 Apr 2002 12 Apr '02

10:38 p.m.

I'm doing some performance comparisons between Rogue Wave's Regexp class (http://www.roguewave.com/support/docs/tlsref/rwcregexp.cfm) and boost::regex using simple expressions like "^metal.*" and "[0-9]+" with regex_search(). regex is slower by a factor of 6, measuring cpu time (on a SunBlade 100 runing Solaris 8). Is there anything I can do to speed it up? I tried using different values of boost::regbase::flag_type_ to adjust how the regular expressions are interpreted, but with little effect. Any other ideas? (I'm using boost_1_27_0.) Thanks, -- Paul M. Dubuc

Show replies by date

Thomas Maeder

13 Apr 13 Apr

9:49 a.m.

On Sat, 13 Apr 2002 00:38:09 +0200, Paul Dubuc wrote:

...

Is there anything I can do to speed it up?

In my experience, the run-time performance of the boost regular expression machinery depends a lot on the allocator configured in the boost::match_results you pass to regex_match. I started a thread on this in the developer mailing list some months ago: http://lists.boost.org/MailArchives/boost/msg20512.php [it's kind of hard to search the archive, isn't it]

Paul Dubuc

15 Apr 15 Apr

1:56 p.m.

New subject: [Boost-Users] Re: regex performance

Thomas Maeder wrote:

...

On Sat, 13 Apr 2002 00:38:09 +0200, Paul Dubuc wrote:

...
Is there anything I can do to speed it up?

In my experience, the run-time performance of the boost regular expression machinery depends a lot on the allocator configured in the boost::match_results you pass to regex_match.

I started a thread on this in the developer mailing list some months ago:

http://lists.boost.org/MailArchives/boost/msg20512.php

[it's kind of hard to search the archive, isn't it]

Thanks for the information. I didn't think to search the developer mailing list archives. I didn't see much in the way of replies to your messages. Is your allocator, or another one that improves performance available for use? The performance of regex searching is very important to my application. But it tends to use very simple expresions. If there is any way I can simplify my use of boost::regex to improve performance, I would appreciate the help very much. Otherwise, I will probaby have to look for another alternative :-{ Thanks, Pau Dubuc -- Paul M. Dubuc

John Maddock

16 Apr 16 Apr

11:19 a.m.

New subject: [Boost-Users] Re: regex performance

...

...
...
Is there anything I can do to speed it up?

Other that use a custom allocator as suggested, probably not. One unfortunate side effect of regular expressions is that you pay for everything that you *may* use rather than just what you do. In particular if you don't want: a) wide character regexes. b) support for backreferences. c) support for marked subexpressions. Then there are significantly faster algorithms that can be used, of course these don't work if you do want those features :-( C based libraries can also use alloca, which generally gives at least a 2x performance increase. As far as I know, Rogue Waves lib uses all of the above - that gives better performance at the expense of features. John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm

George A. Heintzelman

1:06 p.m.

New subject: [Boost-Users] Re: regex performance

...

...
...
...
Is there anything I can do to speed it up?

Other that use a custom allocator as suggested, probably not. One unfortunate side effect of regular expressions is that you pay for everything that you *may* use rather than just what you do. In particular if you don't want:

a) wide character regexes. b) support for backreferences. c) support for marked subexpressions.

John, Speaking as a satisfied user of the regular expression library, always looking to help make it better: I was under the impression that point (a) didn't cost anything in Boost::Regex because it was templatized on the character type. Am I mistaken? Case (b) is fairly rarely used, but (c) is common. In any event, it is certainly true that after compiling the regular expression, you know whether these are needed. So if there are faster algorithms for these special cases, could they be incorporated into the library without much overhead?

...

C based libraries can also use alloca, which generally gives at least a 2x performance increase.

I know that alloca is not 'officially' available in portable C++. But I think most C++ compilers will handle C-like useages for this construct. I know we use it successfully on the compilers we use (gcc, Sun CC). So if there is someplace it would be useful, you could almost certainly get away with it, probably #ifdef'd around for safety. George Heintzelman georgeh@aya.yale.edu

John Maddock

18 Apr 18 Apr

11:24 a.m.

New subject: [Boost-Users] Re: regex performance

...

Speaking as a satisfied user of the regular expression library, always looking to help make it better:

I was under the impression that point (a) didn't cost anything in Boost::Regex because it was templatized on the character type. Am I mistaken? Case (b) is fairly rarely used, but (c) is common. In any event, it is certainly true that after compiling the regular expression, you know whether these are needed. So if there are faster algorithms for these special cases, could they be incorporated into the library without much overhead?

The point is that there are a wide range of differing state machine representations available - to make "automatic" use of these one would have to effectively implement several different regex state machines and switch between them based on run time detection (what kind of expression you have), this is a lot of work as well as adding code bloat. With respect to (a), it is true that narrow character regexes make some optimisations now, but many more are available - mainly in when in combination with (b) and (c).

...

...
C based libraries can also use alloca, which generally gives at least a 2x performance increase.

I know that alloca is not 'officially' available in portable C++. But I think most C++ compilers will handle C-like useages for this construct. I know we use it successfully on the compilers we use (gcc, Sun CC). So if there is someplace it would be useful, you could almost certainly get away with it, probably #ifdef'd around for safety.

Point taken, however it means a complete rewrite (and adds to the maintenance a lot - more config options to test etc). Personally I would rather see a separate regex type with limited usefulness, but better performance when it can be used. John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm

Eric Niebler

20 Apr 20 Apr

5:53 a.m.

On Fri, 12 Apr 2002 15:38:09 -0700, Paul Dubuc wrote: Slightly off-topic, but ... ... have a look at GRETA, my own regular expression template library. I have found it to be as much as 7x faster than boost regex. It's free for non-commercial use. You can download the source code from: http://research.microsoft.com/projects/greta I'd love any feedback you might have. The interface is not nearly as nice or as rich as Dr. Maddock's, but it's just about the fastest C/C++ regex library I've tested. Wide chars, backreferences, full Perl 5 syntax, etc., etc. Eric P.S. I work for Microsoft, but my posts are my own and do not express the views of Microsoft.

...

I'm doing some performance comparisons between Rogue Wave's Regexp class (http://www.roguewave.com/support/docs/tlsref/rwcregexp.cfm) and boost::regex using simple expressions like "^metal.*" and "[0-9]+" with regex_search(). regex is slower by a factor of 6, measuring cpu time (on a SunBlade 100 runing Solaris 8). Is there anything I can do to speed it up? I tried using different values of boost::regbase::flag_type_ to adjust how the regular expressions are interpreted, but with little effect. Any other ideas? (I'm using boost_1_27_0.)

Thanks,

Thomas Maeder

8:44 a.m.

Eric Niebler <yg-boost-users@m.gmane.org> writes:

...

Slightly off-topic, but ...

Would you be interested in getting this back on topic? <evil grin>

...

... have a look at GRETA, my own regular expression template library. I have found it to be as much as 7x faster than boost regex. It's free for non-commercial use.

Does this mean that it (or something derived from it) couldn't be boostified?

...

I'd love any feedback you might have. The interface is not nearly as nice or as rich as Dr. Maddock's, but it's just about the fastest C/C++ regex library I've tested. Wide chars, backreferences, full Perl 5 syntax, etc., etc.

IMHO, it would be nice to have sort of regex light (less rich functionality, better performance) in boost, and GRETA might be a good starting point. It would be helpful if the parts of the interface common to both libraries were harmonized. Do you think this would be possible?

Peter Simons

3:43 p.m.

New subject: [Boost-Users] Re: regex performance

Just a naive question: If the regular expression parsing performance is of paramount importance to you, why don't you use the native implementation, regcomp(), regexec(), and friends? The API isn't that nice for C++, but doubtlessly you can get along, and doubtlessly, these routines are pretty fast.

Thomas Maeder

4:51 p.m.

Peter Simons <simons+boost@cryp.to> writes:

...

Just a naive question: If the regular expression parsing performance is of paramount importance to you,

What I said is that I might like to see something like "regex light" in Boost. IIRC John Maddock, the creator of the Boost Regular Expressions Library, recently wrote something similar either here or in the developers list.

...

why don't you use the native implementation, regcomp(), regexec(), and friends?

Who says I don't use them? I prefer platform and C++ implementation independent libraries because I like my code to be portable.

Eric Niebler

5:29 p.m.

"Thomas Maeder" <yg-boost-users@m.gmane.org> wrote in message news:m3hem67oa3.fsf@madbox.local...

...

Eric Niebler <yg-boost-users@m.gmane.org> writes:

...
Slightly off-topic, but ...

Would you be interested in getting this back on topic? <evil grin>

Getting there. :-) See below.

...

...
... have a look at GRETA, my own regular expression template library. I

have

...

...
found it to be as much as 7x faster than boost regex. It's free for non-commercial use.

Does this mean that it (or something derived from it) couldn't be boostified?

Since you asked ... I fought tooth and nail to get permission to submit to Boost, but our layers wouldn't budge. But the winds are changing in the VC++ group -- there are some who would like to see this happen. I'd still like very much to combine the best of boost regex with the best of GRETA and make it publically available, either in boost or with a boost-like license. We'll see.

...

...
I'd love any feedback you might have. The interface is not nearly as nice or as rich as Dr. Maddock's, but it's just about the fastest C/C++ regex library I've tested. Wide chars, backreferences, full Perl 5 syntax, etc., etc.

IMHO, it would be nice to have sort of regex light (less rich

functionality,

...

better performance) in boost, and GRETA might be a good starting point. It would be helpful if the parts of the interface common to both libraries were harmonized. Do you think this would be possible?

The interfaces are vastly different, unfortunately. I would like to put a more boost-ish interface on GRETA, because the boost interface is nicer. And it would give me perspective on regex interface issues that will be valuable during the standardization process, which I plan to be involved in. If I can find the time, I'll do it. (GRETA is a labor of love that I pursue in my spare time -- it has next to nothing to do with my job.) Note that "harmonizing" boost regex and GRETA would be difficult because their flavors of regular expressions have slightly different semantics. Boost has POSIX-like semantics, which makes stronger guarantees about the "left-most longest" rule. GRETA has Perl-like semantics, which makes weaker guarantees, but lets me cut some corners. FYI -- GRETA isn't "regex lite." It's a backtracking regex engine, so it's heavier-weight than, say, a DFA-based regex engine that doesn't do backreferences. It's faster than boost, but that's more because I use a different implementation, not a fundamentally different algorithm per se. Eric

Uwe Schmitt

14 Aug 14 Aug

12:30 p.m.

Eric Niebler <yg-boost-users@m.gmane.org> wrote:

...

...
I'm doing some performance comparisons between Rogue Wave's Regexp class (http://www.roguewave.com/support/docs/tlsref/rwcregexp.cfm) and boost::regex using simple expressions like "^metal.*" and "[0-9]+" with regex_search(). regex is slower by a factor of 6, measuring cpu time (on a SunBlade 100 runing Solaris 8). Is there anything I can do to speed it up? I tried using different values of boost::regbase::flag_type_ to adjust how the regular expressions are interpreted, but with little effect. Any other ideas? (I'm using boost_1_27_0.)

Hi, I downloaded your software and have a question concerning license. I'd like to build a template-libraray for PHP and need therefore regex-matching. I don't want to sell this library, I think I'll use a BSD style licence. Am I allowed to youse greta for this purpose ??? Greetings, Uwe -- Dr. rer. nat. Uwe Schmitt Computer science is no more about Computers, uwe.schmitt@procoders.net than astronomy is about telescopes. h(Dijkstra) http://www.procoders.net

Eric Niebler

17 Aug 17 Aug

6:13 a.m.

...

Hi, I downloaded your software and have a question concerning license. I'd like to build a template-libraray for PHP and need therefore regex-matching. I don't want to sell this library, I think I'll use a BSD style licence. Am I allowed to youse greta for this purpose ???

Greetings, Uwe

I regret having posted any information about greta to this list -- it's pretty off-topic. My apologies. But to answer your question, no the license does not allow this. (Don't blame me, I wrote the code not the license.) You should stick with boost regex. If you are concerned about performance, I know that John Maddock is hard at work improving the performance of regex++. If you have any more questions, please send them to me directly. Happy grepping, Eric Niebler ericne@microsoft.com http://research.microsoft.com/projects/greta

8362

Age (days ago)

8489

Last active (days ago)

List overview

Download

1 comments

2 participants

participants (2)

Eric Niebler
George A. Heintzelman
John Maddock
Paul Dubuc
Peter Simons
Thomas Maeder
Uwe Schmitt