Standard c++ XML parser API (Boost.XML)
Hey guys, As part of this year's GSOC, I am interested in developing a Standard c++ XML parser for Boost (Boost.XML project in Ideas Page). What I have encountered till now, there are many libraries which could be used for parsing XML file, but there is no Standard XML API in c++ to do the same, contrary to other languages like Java or python. I think this is an appropriate place to discuss about this API and then develop it. Since a past couple of days, I have been reading and understanding the existing Boost.xml (sandbox project) and Arabica project. What I think now is that there is no need to develop such an API from scratch, it would just be a wastage of our resources, else we could use the existing codes to develop such an API. We can use multiple backends supported in Arabica API and merge it with out existing Boost.XML to make such a standard API, so that it can adapted to a wide range of existing XML libraries. Also, the SAX implementation of Boost.XML has to be changed and it has to be implemented/modified (i.e. reader API), comparing it with Arabica::SAX. DOM API is working very fine. So, according to me, the possible tasks will be 1. To enhance or change the SAX implementation of Boost.XML 2. To provide multiple backends to this API. But for now, I want to know what you guys think about the idea. Thanks, Saksham Maheshwari PS: I want to develop an API, not a library :-)
On 2014-03-08 15:03, Saksham Maheshwari wrote:
Hey guys,
As part of this year's GSOC, I am interested in developing a Standard c++ XML parser for Boost (Boost.XML project in Ideas Page).
As indicated on that page, I would be happy to mentor such a project.
What I have encountered till now, there are many libraries which could be used for parsing XML file, but there is no Standard XML API in c++ to do the same, contrary to other languages like Java or python. I think this is an appropriate place to discuss about this API and then develop it.
Since a past couple of days, I have been reading and understanding the existing Boost.xml (sandbox project) and Arabica project. What I think now is that there is no need to develop such an API from scratch, it would just be a wastage of our resources, else we could use the existing codes to develop such an API.
We can use multiple backends supported in Arabica API and merge it with out existing Boost.XML to make such a standard API, so that it can adapted to a wide range of existing XML libraries.
Also, the SAX implementation of Boost.XML has to be changed and it has to be implemented/modified (i.e. reader API), comparing it with Arabica::SAX.
DOM API is working very fine.
So, according to me, the possible tasks will be 1. To enhance or change the SAX implementation of Boost.XML 2. To provide multiple backends to this API.
But for now, I want to know what you guys think about the idea.
Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission. It would be really great if someone else would provide feedback, too. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...
On 03/17/2014 10:03 PM, Stefan Seefeld wrote:
Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission.
Where is the latest proposal?
It would be really great if someone else would provide feedback, too.
I would like to see how the proposal relates to other Boost libraries. For instance, how well does it integrate with Boost.Serialization or Boost.Fusion? Can it be used to replace the XML parser inside Boost.PropertyTree? The remaining comments are related to the GitHub code, as I suspect that you want it to be used in the GSoC project: https://github.com/stefanseefeld/boost.xml The code could be made easier to understand with documentation. Will it be possible to output streaming XML? (xmlTextWriter) The DOM and Reader parsers assume that input is in a file. What if I want to process a buffer in memory? What is the purpose of the S template argument? What is the purpose of the convert trait? How are different XML encodings handled? token_base::get_token() returns information about the current token, but it seems to be invalidated (or updated to the new token) after calling parser::next(). Is that correct?
On 03/18/2014 11:19 AM, Bjorn Reese wrote:
On 03/17/2014 10:03 PM, Stefan Seefeld wrote:
Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission.
Where is the latest proposal?
I believe Saksham is working on a formal submission right now.
It would be really great if someone else would provide feedback, too.
I would like to see how the proposal relates to other Boost libraries. For instance, how well does it integrate with Boost.Serialization or Boost.Fusion? Can it be used to replace the XML parser inside Boost.PropertyTree?
The idea is to provide complete but modular XML APIs. Complete in the sense that it can handle any well-formed XML, and modular in the sense that orthogonal pieces of functionality are kept separate so users can pull in only the pieces they need. I don't see any reason why such an XML API wouldn't be usable by other Boost libraries.
The remaining comments are related to the GitHub code, as I suspect that you want it to be used in the GSoC project:
I suggest you look at that as a proof-of-concept, not something finished. In fact, when I originally wrote that library, I used libxml2 as its only backend. However, doing that bears the risk that the entire API will be closely tied to the idiosyncrasies of that one backend, so adding support to more backends (such as Xerces) will help validate that the API is in fact backend-agnostic.
The code could be made easier to understand with documentation.
Agreed.
Will it be possible to output streaming XML? (xmlTextWriter)
That's a nice idea.
The DOM and Reader parsers assume that input is in a file. What if I want to process a buffer in memory?
Right, that should be possible. (I know libxml2 supports that, so at least for that backend it seems trivial to add the missing wrapper.)
What is the purpose of the S template argument?
To keep the concern for unicode or any other string type orthogonal from the XML library, i.e. to allow Boost.XML to interact with different Unicode implementations. In fact, in the existing demos I'm restricting content to ASCII, so I can in fact get away with using std::string, so this is a good example of the "modularity" design goal I mentioned above: Don't force anything on users they don't actually need.
What is the purpose of the convert trait?
To allow conversion between the backend's own string representation and the string type that is used with Boost.XML.
How are different XML encodings handled?
Can you ask more specifically ? I suspect the answer is that this is handled by the Unicode component to which Boost.XML gets bound by means of the string template parameter.
token_base::get_token() returns information about the current token, but it seems to be invalidated (or updated to the new token) after calling parser::next(). Is that correct?
Yes. Stefan -- ...ich hab' noch einen Koffer in Berlin...
On 03/18/2014 04:46 PM, Stefan Seefeld wrote:
I don't see any reason why such an XML API wouldn't be usable by other Boost libraries.
It should be part of the GSoC project to verify this for the most common use cases (XML serialization is the most obvious one.)
What is the purpose of the S template argument?
To keep the concern for unicode or any other string type orthogonal from the XML library, i.e. to allow Boost.XML to interact with different Unicode implementations. In fact, in the existing demos I'm restricting content to ASCII, so I can in fact get away with using std::string, so this is a good example of the "modularity" design goal I mentioned above: Don't force anything on users they don't actually need.
I agree with the goal, but I am not sure that the S type solves the problem. I must admit that I am having difficulty understanding exactly how you envision it should work for other encodings, because std::string is orthogonal to encoding (locale is usually attached to the I/O stream.) What encoding is used for std::string? ASCII, UTF-8, or "whatever the XML library gives me"? This should be documented as part of the API regardless of the answer. Should I define a new string type if I want to use Latin-1 or another encoding in my application? What if the rest of my application uses std::string for Latin-1 encodings? (I am wondering how will work with the current convert trait specialization for std::string.) How does the convert trait know the XML document encoding so that it is able to convert between this and the application encoding? I suggest that you adopt the libxml2 design decision to always use UTF-8 for std::string (and UTF-16 for std::wstring if needed.) See the design rationale here: http://xmlsoft.org/encoding.html Any backend that does not provide UTF-8 will have to be wrapped. With such a design decision, the S template parameter becomes superfluous (or should be changed to CharT if you wish to support both std::string and std::wstring.) Conversion between UTF-8 and application encodings would have to be done explicitly in the application. At any rate, encoding should be addressed in the GSoC project.
What is the purpose of the convert trait?
To allow conversion between the backend's own string representation and the string type that is used with Boost.XML.
Ok. You should, however, make sure that the strings are converted correctly: http://xmlsoft.org/html/libxml-xmlstring.html For instance, convert::in() does not take libxml2 custom allocators into account: http://xmlsoft.org/html/libxml-xmlmemory.html
On 03/20/2014 04:34 AM, Bjorn Reese wrote:
On 03/18/2014 04:46 PM, Stefan Seefeld wrote:
I don't see any reason why such an XML API wouldn't be usable by other Boost libraries.
It should be part of the GSoC project to verify this for the most common use cases (XML serialization is the most obvious one.)
I don't entirely understand your point. The goal is to define an XML API, and implement it, which complies with all related standards. As long as the existing Boost components (e.g. Boost.Serialization) work with standard XML tools, we should be compatible. I don't think, however, that we should be constrained to be API-compatible with existing tools, as otherwise the whole exercise to define a new API would be pointless. On the other hand, making minor adjustments to those libraries to work with Boost.XML would be fine. I just don't think we should make this part of the proposal, as it isn't even clear what existing Boost components would be affected, whether they are actively maintained / developed, etc.
What is the purpose of the S template argument?
To keep the concern for unicode or any other string type orthogonal from the XML library, i.e. to allow Boost.XML to interact with different Unicode implementations. In fact, in the existing demos I'm restricting content to ASCII, so I can in fact get away with using std::string, so this is a good example of the "modularity" design goal I mentioned above: Don't force anything on users they don't actually need.
I agree with the goal, but I am not sure that the S type solves the problem. I must admit that I am having difficulty understanding exactly how you envision it should work for other encodings, because std::string is orthogonal to encoding (locale is usually attached to the I/O stream.)
You are right, encoding and string type are (mostly) orthogonal. I have never said anything else. :-)
What encoding is used for std::string? ASCII, UTF-8, or "whatever the XML library gives me"? This should be documented as part of the API regardless of the answer.
Yes.
Should I define a new string type if I want to use Latin-1 or another encoding in my application? What if the rest of my application uses std::string for Latin-1 encodings? (I am wondering how will work with the current convert trait specialization for std::string.)
How does the convert trait know the XML document encoding so that it is able to convert between this and the application encoding?
I suggest that you adopt the libxml2 design decision to always use UTF-8 for std::string (and UTF-16 for std::wstring if needed.) See the design rationale here:
http://xmlsoft.org/encoding.html
Any backend that does not provide UTF-8 will have to be wrapped.
With such a design decision, the S template parameter becomes superfluous (or should be changed to CharT if you wish to support both std::string and std::wstring.)
Conversion between UTF-8 and application encodings would have to be done explicitly in the application.
At any rate, encoding should be addressed in the GSoC project.
I agree, and this is in fact part of the proposal. To be specific, one of the first steps is to add tests that instantiate the XML classes with existing unicode string classes (such as glib::ustring or Qt's QString), and demonstrate how to use them.
What is the purpose of the convert trait?
To allow conversion between the backend's own string representation and the string type that is used with Boost.XML.
Ok. You should, however, make sure that the strings are converted correctly:
http://xmlsoft.org/html/libxml-xmlstring.html
For instance, convert::in() does not take libxml2 custom allocators into account:
Good point. As I said, the existing Boost.XML was meant to be a proof-of-concept. Thanks for your feedback, Stefan -- ...ich hab' noch einen Koffer in Berlin...
On 17/03/2014 22:03, Stefan Seefeld wrote:
Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission.
It would be really great if someone else would provide feedback, too.
I don't know how useful this might prove, but you may want to take a look at http://xeumeuleu.sourceforge.net/ Regards, MAT.
@Manish: there is nothing wrong with xerces parser, but it cannot be standardized since the xerces code cannot be compiled by, let us say, libxml2, i.e. it cannot support multiple backends feature. The code has certain dependencies which directly depends upon xerces compiler, hence it is of lesser use. @MAT: same with xeumeuleu, it has many important features, but still it has dependencies on Apache xerces. What I want to do is to define or change the base interface of Boost.XML so that it can support multiple backends. Any feedback. Thanks
Hello everyone, Kindly find my proposal for Boost.XML here: https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/... I think my proposal is ready for first submission and I'm eager to refine it further. I would like your valuable comments on it. Please tell me what should I do/add to make it a strong proposal. Thanks, Saksham
This is the public URL :-
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/sam_1993/...
Thanks
On Wed, Mar 19, 2014 at 12:14 PM, Saksham Maheshwari
Hello everyone,
Kindly find my proposal for Boost.XML here:
https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/...
I think my proposal is ready for first submission and I'm eager to refine it further. I would like your valuable comments on it. Please tell me what should I do/add to make it a strong proposal.
Thanks, Saksham
What's wrong with xerces XML parsers?
-Manish
On Mar 9, 2014 2:27 AM, "Saksham Maheshwari"
Hey guys,
As part of this year's GSOC, I am interested in developing a Standard c++ XML parser for Boost (Boost.XML project in Ideas Page).
What I have encountered till now, there are many libraries which could be used for parsing XML file, but there is no Standard XML API in c++ to do the same, contrary to other languages like Java or python. I think this is an appropriate place to discuss about this API and then develop it.
Since a past couple of days, I have been reading and understanding the existing Boost.xml (sandbox project) and Arabica project. What I think now is that there is no need to develop such an API from scratch, it would just be a wastage of our resources, else we could use the existing codes to develop such an API.
We can use multiple backends supported in Arabica API and merge it with out existing Boost.XML to make such a standard API, so that it can adapted to a wide range of existing XML libraries.
Also, the SAX implementation of Boost.XML has to be changed and it has to be implemented/modified (i.e. reader API), comparing it with Arabica::SAX.
DOM API is working very fine.
So, according to me, the possible tasks will be 1. To enhance or change the SAX implementation of Boost.XML 2. To provide multiple backends to this API.
But for now, I want to know what you guys think about the idea.
Thanks, Saksham Maheshwari
PS: I want to develop an API, not a library :-)
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (5)
-
Bjorn Reese
-
Mathieu Champlon
-
munsingh
-
Saksham Maheshwari
-
Stefan Seefeld