An invalid XML character (Unicode: 0x8) problem because of property_tree::xml_parser::write_xml
Hi, I have used the following C++ code to generate the xml boost::property_tree::ptree ptResponse; // Populate the tree from the Microsoft Outlook contactsstd::stringstream buf; const std::string enc("utf-8"); boost::property_tree::xml_writer_settings<char> settings(' ', 0, enc); boost::property_tree::xml_parser::write_xml(buf, ptResponse, settings); This works fine. But in one of the customer's machine, when reading the this(xml content) in a JAVA program. I get the following error An invalid XML character (Unicode: 0x8) was found in the element content of the document. Any help in solving this is appreciated. Regards,Rohan
On 02/03/2015 05:59, Rohan Shetty wrote:
Hi, I have used the following C++ code to generate the xml boost::property_tree::ptree ptResponse; // Populate the tree from the Microsoft Outlook contactsstd::stringstream buf; const std::string enc("utf-8"); boost::property_tree::xml_writer_settings<char> settings(' ', 0, enc); boost::property_tree::xml_parser::write_xml(buf, ptResponse, settings); This works fine. But in one of the customer's machine, when reading the this(xml content) in a JAVA program. I get the following error An invalid XML character (Unicode: 0x8) was found in the element content of the document.
Any help in solving this is appreciated.
I don't understand, the error message is quite explicit: your data isn't utf-8 even though you said it was. What were you expecting to happen? Also this would probably be more suited to the boost-users mailing list.
Hi Mathias,
Thanks for your response.
I was expecting write_xml(with "utf-8") to do the escape(e.g < replaced with <) or strip any invalid characters(e.g. anything other than #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
Is this part of the write_xml()?
Do let me know if this is not clear.
Regards,Rohan
From: Mathias Gaunard
Hi, I have used the following C++ code to generate the xml boost::property_tree::ptree ptResponse; // Populate the tree from the Microsoft Outlook contactsstd::stringstream buf; const std::string enc("utf-8"); boost::property_tree::xml_writer_settings<char> settings(' ', 0, enc); boost::property_tree::xml_parser::write_xml(buf, ptResponse, settings); This works fine. But in one of the customer's machine, when reading the this(xml content) in a JAVA program. I get the following error An invalid XML character (Unicode: 0x8) was found in the element content of the document.
Any help in solving this is appreciated.
I don't understand, the error message is quite explicit: your data isn't utf-8 even though you said it was. What were you expecting to happen? Also this would probably be more suited to the boost-users mailing list. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi Mathias, Thanks for your response. I was expecting write_xml(with "utf-8") to do the escape(e.g < replaced with <) or strip any invalid characters(e.g. anything other
This mailing-list uses bottom- and inline-posting, please lay out your responses accordingly. On 03/03/2015 04:11, Rohan Shetty wrote: than #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
Is this part of the write_xml()? Do let me know if this is not clear. Regards,Rohan
It is not reasonable to expect that the write_xml function would silently drop data by default. If you want invalid data to be removed, you'll have to do this yourself prior to calling the function. This signature of write_xml doesn't actually do anything encoding-wise, it outputs your data as-is, and marks the data as being the encoding you specified. It might be more sensible to set up the encoding correctly though, or to convert your data to the right encoding. There is another overload of write_xml that can imbue a locale when writing the data, which can be used for transparent transcoding.
On 03/03/2015 5:28 PM, Mathias Gaunard wrote:> This mailing-list uses bottom- and inline-posting, please lay out your > responses accordingly.
It is not reasonable to expect that the write_xml function would > silently drop data by default.> If you want invalid data to be removed, you'll have to do this yourself > prior to calling the function. This signature of write_xml doesn't actually do anything encoding-wise, > it outputs your data as-is, and marks the data as being the encoding you > specified. It might be more sensible to set up the encoding correctly though, or to > convert your data to the right encoding.> There is another overload of write_xml that can imbue a locale when > writing the data, which can be used for transparent transcoding. Thanks Mathias.
On 03/03/2015 04:11 AM, Rohan Shetty wrote:
I was expecting write_xml(with "utf-8") to do the escape(e.g < replaced with <) or strip any invalid characters(e.g. anything other than #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]) Is this part of the write_xml()?
Please read the documentation: "RapidXML does not fully support the XML standard; it is not capable of parsing DTDs and therefore cannot do full entity substitution. [...] Please note that RapidXML does not understand the encoding specification. If you pass it a character buffer, it assumes the data is already correctly encoded; if you pass it a filename, it will read the file using the character conversion of the locale you give it (or the global locale if you give it none). This means that, in order to parse a UTF-8-encoded XML file into a wptree, you have to supply an alternate locale, either directly or by replacing the global one." http://www.boost.org/doc/html/boost_propertytree/parsers.html
On 03/03/2015 5:48 PM, Bjorn Reese wrote:> Please read the documentation:> > "RapidXML does not fully support the XML standard; it is not capable> of parsing DTDs and therefore cannot do full entity substitution.> > [...]> > Please note that RapidXML does not understand the encoding> specification. If you pass it a character buffer, it assumes the data> is already correctly encoded; if you pass it a filename, it will read> the file using the character conversion of the locale you give it (or> the global locale if you give it none). This means that, in order to> parse a UTF-8-encoded XML file into a wptree, you have to supply an> alternate locale, either directly or by replacing the global one."> > http://www.boost.org/doc/html/boost_propertytree/parsers.html Thanks Bjorn.
participants (3)
-
Bjorn Reese
-
Mathias Gaunard
-
Rohan Shetty