Re: [Boost-users] [boost.serialization How does boost.serialization do with BOM
Thanks for the quick response. BOM is Windows specific. In my opinion, BOM is not really related to how you encode the text(thus not related to uft_codecvt_facet), but how you mark what your encoding is; so that any text editor can get a prompt of how to handle the text. It's implemented by inserting a few bytes to the very beginning of the file, which are never used in the chosen encoding of the following code. In the case of UTF-8, "EF BB BF" are used -- in the encoding table of UTF-8, "EF BB BF" should correspond to no character(I did not check, just out of guess). As it's related to general text files, not specific to xml files. basic_text_iarchive might be a better place to address the issue. I am thinking just detecting " EF BB BF " and discarding them if they exist would solve the issue. But I am not sure which method need to be overriden, can you please advise? Thanks, tom
This is news to me.
the wide character text/xml archives use UTF-8. They do this by creating a stream with the uft_codecvt_facet. I used this factet, it worked great and I moved on. So you're way ahead of me on this.
This would probably be easy to address in the xml_iarchive code or perhaps the xml_grammar - but, as I said, I don't know anything about it.
Robert Ramey
Tan, Tom (Shanghai) wrote:
what is BOM?
Probably "Byte Order Mark", see http://en.wikipedia.org/wiki/Byte-order_mark
Yes, That's what I meant.
I was testing the demo_xml_load.cpp and demo_xml_save.cpp available in the boost.serialization example. By simply opening demo_save.xml produced by demo_xml_save.exe with XML copy editor(http://xml-copy-editor.sourceforge.net/) and saving it back, demo_xml_load.exe would crash. I compared the two files with Winmerge. It said it's identical.
by studying the hex view, I later found it's because the 3-byte UTF-8 BOM was inserted to the beginning of file. It would not change the data, and in many cases was ignored by the text editors.
I thinking that Boost.serialization should also handle this for all text files including XML.
Tom
------------------------------ _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users End of Boost-users Digest, Vol 1744, Issue 1 ********************************************
basic_text_iprimitive .. This is used by all text archives including xml Robert Ramey
As it's related to general text files, not specific to xml files. basic_text_iarchive might be a better place to address the issue.
I am thinking just detecting " EF BB BF " and discarding them if they exist would solve the issue.
Robert Ramey wrote:
basic_text_iprimitive ..
This is used by all text archives including xml
Robert Ramey
As it's related to general text files, not specific to xml files. basic_text_iarchive might be a better place to address the issue.
I am thinking just detecting " EF BB BF " and discarding them if they exist would solve the issue.
The BOM is not Windows-specific. See this link for an explanation.
participants (3)
-
Kurt Kohler
-
Robert Ramey
-
Tan, Tom (Shanghai)