
This is both a Regex syntax and a boost question (in, so here goes... I've got the following code to strip out all <a>, <frame>, and <iframe> tags from a webpage and parse them for their href or src attributes (yes, I realize that it can potentially grab an <a src=""> or an <iframe href="">, but that's ok for this project). Surprise surprise, it doesn't work quite as I'd hoped, and I was wondering if you could help me ascertain the problem: (pageSource is a pointer to a string containing the source of the page; the project specifications allow for the attribute to be formatted with either a single or double quote or neither around the actual URL. It correctly finds each tag and attribute, but it's grabbing the URL and also the "> that follow it.) How can I get rid of the closing "> ? void Page::parseLinks() { boost::regex linkTagRegex("(?i)<(a|i?frame)[^>]*>"); boost::regex linkRegex("(?i)(href|src)\\s*?=[\\w]*?([\\W]*?)[\\w]+?"); boost::sregex_token_iterator p(pageSource->begin(), pageSource->end(), linkTagRegex, 0); boost::sregex_token_iterator end; for (; p != end; p++) { string tag(p->first, p->second); boost::cmatch matches; if (boost::regex_search(tag.c_str(), matches, linkRegex)) { string * newLink = new string(matches[2].first); URL * foundLink = new URL(newLink); delete newLink; foundLink->resolveWithRespectTo(pageURL); foundLinks->add(foundLink); } } } Thanks! Dave