Re: [Boost-users] help extracting TAG with boost::regex

13 Mar 2007

      llwaeva@21cn.com wrote:
...
hi there,
 I am working with a TAG-oriented text with boost:regex. For example,
the following pattern might occur in the text
<before> <pre><p>Some Text</p></pre> <after> <pre> ddd </pre>
In this case, I would like to extract everything between <pre> </pre>.
Meanwhile, everything outside <pre> </pre> should be unchanged except
that < is replaced by < and > is replaced by >
For that purpose, I tried the following code
I don't see anything obviously too wrong based on a quick glance except that 
\s* should be \\s*.

If that doesn't fix things, post a self contained test case and I'll take a 
look.
...
In a far more complicated case, a nested <pre></pre> might occur as
follow
<before> <pre><pre><p>Some Text</p></pre></pre> <after>  <pre> ddd
</pre>
For this case, I only want to handle the outermost <pre></pre> and
keep everything inside it unchanged, i.e., the inner <pre></pre> will
be extracted as common text.
Hmmmm, traditional regexes don't handle that all that well, how deep will 
the nesting go?  You handle a finite number of nested occurences using 
something like:

<\s*pre[^>]*>(<\s*pre[^>]*>.*?</\s*pre\s*>|.)*?</\s*pre\s*>

and so on, but remember to double those \'s if you embed this in a C++ 
string.

John.

Re: [Boost-users] help extracting TAG with boost::regex

John Maddock