[regex] Why pattern anchored at buffer end ($) also match for (\x0c)FF, (\x0d)CR
Hello, As per PCRE /m modifier $ will match the pattern before any newline. However boost_regex gives back matches before formfeed(\x0c) & CarriagReturn(\x0d) Same is observed for ^ i.e match after newline. Please could anybody let me know what is the right pcre behaviour? I have attached the example for demonstration of the case, using boost_regex Regards, Chandan --------------------------------- Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. #include <iostream> #include <string> #include "/home/ChandanNilange/work/boost/boost_1_34_0/boost/regex.hpp" // Boost.Regex lib using namespace std; int main( ) { std::string s,sre; //regular expression sre="\x30$"; //Matching string s="\x01\x30\x02\x31\x03\x32\x04\x33\x05\x34\x06\x35\x07\x36\x08\x37\x09\x30\x0a\x39\x0b\x30\x0c\x30\x0d\x3c\x0e\x3d\x0f"; boost::regex re; re.assign(sre, boost::regex_constants::perl); std::string::const_iterator start, end; start = s.begin(); end = s.end(); boost::sregex_iterator it(start, end, re); boost::sregex_iterator it_end; for( ; it != it_end; it++) { cout << "Found Pattern:start:" << std::string((*it)[0].first, (*it)[0].second) << ":end:\n" << "Pattern starts at :" << (*it)[0].first - s.begin() << "\nPattern ends at: " << ((*it)[0].second - 1) - s.begin() << "\nPattern length:" << (*it)[0].length() << endl; } } /************************** Results: Found Pattern:start:0:end: Pattern starts at :17 Pattern ends at: 17 Pattern length:1 Found Pattern:start:0:end: Pattern starts at :21 Pattern ends at: 21 Pattern length:1 Found Pattern:start:0:end: Pattern starts at :23 Pattern ends at: 23 Pattern length:1 ****************************/
Chandan Nilange wrote:
Hello,
As per PCRE /m modifier $ will match the pattern before any newline. However boost_regex gives back matches before formfeed(\x0c) & CarriagReturn(\x0d) Same is observed for ^ i.e match after newline.
Yes, it regards any "vertical whitespace" as being suitable for a $ or ^ match: however if you have a \r\n sequence, then it will *not* match in the middle of that sequence. I do recall extending the list of characters that will form a line boundary recently, but not whether it was for Perl or Unicode compatibility :-( Ah, here we are: http://unicode.org/unicode/reports/tr18/#Line_Boundaries, remember that all the sequences given really do indicate a line boundary in most cases. John.
Hello John,
I not very clear.
So do you mean that -
$ & ^ matching pattern after/before any (\x0c)FF,
(\x0d)CR is right PCRE /m modifier behaviour.
Regards,
Chandan
--- John Maddock
Hello,
As per PCRE /m modifier $ will match the pattern before any newline. However boost_regex gives back matches before
Chandan Nilange wrote: formfeed(\x0c) &
CarriagReturn(\x0d) Same is observed for ^ i.e match after newline.
Yes, it regards any "vertical whitespace" as being suitable for a $ or ^ match: however if you have a \r\n sequence, then it will *not* match in the middle of that sequence.
I do recall extending the list of characters that will form a line boundary recently, but not whether it was for Perl or Unicode compatibility :-(
Ah, here we are:
http://unicode.org/unicode/reports/tr18/#Line_Boundaries,
remember that all the sequences given really do indicate a line boundary in most cases.
John.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users
____________________________________________________________________________________ Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545469
Chandan Nilange wrote:
Hello John,
I not very clear.
So do you mean that -
$ & ^ matching pattern after/before any (\x0c)FF, (\x0d)CR is right PCRE /m modifier behaviour.
I believe so, but deciding what the right behaviour should be is not necessarily easy: Perl has the luxury that it controls File IO as well as regexes: in contrast Boost.Regex is designed to accept text from all kinds of sources and platforms. Most applications will in any case regard an isolated \r or \n as a line break, as well as \r\n. Likewise in what sense does a form-feed *not* start a new line? :-) HTH, John.
Hi I have followed the instruction on link http://blogs.sun.com/sga/entry/boost_mini_howto but i m not able to run boost program on solaris i have "Sun WorkShop Compilers C/C++ 5.0".. OS version 5.8 I m using boost1.34 version. Also let me know if installation of Bjam is neccessary can't we simply specify the header and use boost. I have also problem while installing BJam in this case. Kindly help... Thanks Abhishek Vyas Tata Consultancy Services Mailto: abhishek.v@tcs.com Website: http://www.tcs.com ____________________________________________ Experience certainty. IT Services Business Solutions Outsourcing ____________________________________________ =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
participants (3)
-
Abhishek V
-
Chandan Nilange
-
John Maddock