Standard c++ XML parser API (Boost.XML)

newer
[Range] Are pull request in github...

older
GSoC14: Projects currently needing...

Saksham Maheshwari

8 Mar 2014 8 Mar '14

8:03 p.m.

Hey guys, As part of this year's GSOC, I am interested in developing a Standard c++ XML parser for Boost (Boost.XML project in Ideas Page). What I have encountered till now, there are many libraries which could be used for parsing XML file, but there is no Standard XML API in c++ to do the same, contrary to other languages like Java or python. I think this is an appropriate place to discuss about this API and then develop it. Since a past couple of days, I have been reading and understanding the existing Boost.xml (sandbox project) and Arabica project. What I think now is that there is no need to develop such an API from scratch, it would just be a wastage of our resources, else we could use the existing codes to develop such an API. We can use multiple backends supported in Arabica API and merge it with out existing Boost.XML to make such a standard API, so that it can adapted to a wide range of existing XML libraries. Also, the SAX implementation of Boost.XML has to be changed and it has to be implemented/modified (i.e. reader API), comparing it with Arabica::SAX. DOM API is working very fine. So, according to me, the possible tasks will be 1. To enhance or change the SAX implementation of Boost.XML 2. To provide multiple backends to this API. But for now, I want to know what you guys think about the idea. Thanks, Saksham Maheshwari PS: I want to develop an API, not a library :-)

Show replies by date

Stefan Seefeld

17 Mar 17 Mar

9:03 p.m.

On 2014-03-08 15:03, Saksham Maheshwari wrote:

...

Hey guys,

As part of this year's GSOC, I am interested in developing a Standard c++ XML parser for Boost (Boost.XML project in Ideas Page).

As indicated on that page, I would be happy to mentor such a project.

...

What I have encountered till now, there are many libraries which could be used for parsing XML file, but there is no Standard XML API in c++ to do the same, contrary to other languages like Java or python. I think this is an appropriate place to discuss about this API and then develop it.

Since a past couple of days, I have been reading and understanding the existing Boost.xml (sandbox project) and Arabica project. What I think now is that there is no need to develop such an API from scratch, it would just be a wastage of our resources, else we could use the existing codes to develop such an API.

We can use multiple backends supported in Arabica API and merge it with out existing Boost.XML to make such a standard API, so that it can adapted to a wide range of existing XML libraries.

Also, the SAX implementation of Boost.XML has to be changed and it has to be implemented/modified (i.e. reader API), comparing it with Arabica::SAX.

DOM API is working very fine.

So, according to me, the possible tasks will be 1. To enhance or change the SAX implementation of Boost.XML 2. To provide multiple backends to this API.

But for now, I want to know what you guys think about the idea.

Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission. It would be really great if someone else would provide feedback, too. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Bjorn Reese

18 Mar 18 Mar

3:19 p.m.

On 03/17/2014 10:03 PM, Stefan Seefeld wrote:

...

Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission.

Where is the latest proposal?

...

It would be really great if someone else would provide feedback, too.

I would like to see how the proposal relates to other Boost libraries. For instance, how well does it integrate with Boost.Serialization or Boost.Fusion? Can it be used to replace the XML parser inside Boost.PropertyTree? The remaining comments are related to the GitHub code, as I suspect that you want it to be used in the GSoC project: https://github.com/stefanseefeld/boost.xml The code could be made easier to understand with documentation. Will it be possible to output streaming XML? (xmlTextWriter) The DOM and Reader parsers assume that input is in a file. What if I want to process a buffer in memory? What is the purpose of the S template argument? What is the purpose of the convert trait? How are different XML encodings handled? token_base::get_token() returns information about the current token, but it seems to be invalidated (or updated to the new token) after calling parser::next(). Is that correct?

Stefan Seefeld

3:46 p.m.

On 03/18/2014 11:19 AM, Bjorn Reese wrote:

...

On 03/17/2014 10:03 PM, Stefan Seefeld wrote:

...
Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission.

Where is the latest proposal?

I believe Saksham is working on a formal submission right now.

...

...
It would be really great if someone else would provide feedback, too.

I would like to see how the proposal relates to other Boost libraries. For instance, how well does it integrate with Boost.Serialization or Boost.Fusion? Can it be used to replace the XML parser inside Boost.PropertyTree?

The idea is to provide complete but modular XML APIs. Complete in the sense that it can handle any well-formed XML, and modular in the sense that orthogonal pieces of functionality are kept separate so users can pull in only the pieces they need. I don't see any reason why such an XML API wouldn't be usable by other Boost libraries.

...

The remaining comments are related to the GitHub code, as I suspect that you want it to be used in the GSoC project:

https://github.com/stefanseefeld/boost.xml

I suggest you look at that as a proof-of-concept, not something finished. In fact, when I originally wrote that library, I used libxml2 as its only backend. However, doing that bears the risk that the entire API will be closely tied to the idiosyncrasies of that one backend, so adding support to more backends (such as Xerces) will help validate that the API is in fact backend-agnostic.

...

The code could be made easier to understand with documentation.

Agreed.

...

Will it be possible to output streaming XML? (xmlTextWriter)

That's a nice idea.

...

The DOM and Reader parsers assume that input is in a file. What if I want to process a buffer in memory?

Right, that should be possible. (I know libxml2 supports that, so at least for that backend it seems trivial to add the missing wrapper.)

...

What is the purpose of the S template argument?

To keep the concern for unicode or any other string type orthogonal from the XML library, i.e. to allow Boost.XML to interact with different Unicode implementations. In fact, in the existing demos I'm restricting content to ASCII, so I can in fact get away with using std::string, so this is a good example of the "modularity" design goal I mentioned above: Don't force anything on users they don't actually need.

...

What is the purpose of the convert trait?

To allow conversion between the backend's own string representation and the string type that is used with Boost.XML.

...

How are different XML encodings handled?

Can you ask more specifically ? I suspect the answer is that this is handled by the Unicode component to which Boost.XML gets bound by means of the string template parameter.

...

token_base::get_token() returns information about the current token, but it seems to be invalidated (or updated to the new token) after calling parser::next(). Is that correct?

Yes. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Bjorn Reese

20 Mar 20 Mar

8:34 a.m.

On 03/18/2014 04:46 PM, Stefan Seefeld wrote:

...

I don't see any reason why such an XML API wouldn't be usable by other Boost libraries.

It should be part of the GSoC project to verify this for the most common use cases (XML serialization is the most obvious one.)

...

...
What is the purpose of the S template argument?

To keep the concern for unicode or any other string type orthogonal from the XML library, i.e. to allow Boost.XML to interact with different Unicode implementations. In fact, in the existing demos I'm restricting content to ASCII, so I can in fact get away with using std::string, so this is a good example of the "modularity" design goal I mentioned above: Don't force anything on users they don't actually need.

I agree with the goal, but I am not sure that the S type solves the problem. I must admit that I am having difficulty understanding exactly how you envision it should work for other encodings, because std::string is orthogonal to encoding (locale is usually attached to the I/O stream.) What encoding is used for std::string? ASCII, UTF-8, or "whatever the XML library gives me"? This should be documented as part of the API regardless of the answer. Should I define a new string type if I want to use Latin-1 or another encoding in my application? What if the rest of my application uses std::string for Latin-1 encodings? (I am wondering how will work with the current convert trait specialization for std::string.) How does the convert trait know the XML document encoding so that it is able to convert between this and the application encoding? I suggest that you adopt the libxml2 design decision to always use UTF-8 for std::string (and UTF-16 for std::wstring if needed.) See the design rationale here: http://xmlsoft.org/encoding.html Any backend that does not provide UTF-8 will have to be wrapped. With such a design decision, the S template parameter becomes superfluous (or should be changed to CharT if you wish to support both std::string and std::wstring.) Conversion between UTF-8 and application encodings would have to be done explicitly in the application. At any rate, encoding should be addressed in the GSoC project.

...

...
What is the purpose of the convert trait?

To allow conversion between the backend's own string representation and the string type that is used with Boost.XML.

Ok. You should, however, make sure that the strings are converted correctly: http://xmlsoft.org/html/libxml-xmlstring.html For instance, convert::in() does not take libxml2 custom allocators into account: http://xmlsoft.org/html/libxml-xmlmemory.html

Stefan Seefeld

11:55 a.m.

On 03/20/2014 04:34 AM, Bjorn Reese wrote:

...

On 03/18/2014 04:46 PM, Stefan Seefeld wrote:

...
I don't see any reason why such an XML API wouldn't be usable by other Boost libraries.

It should be part of the GSoC project to verify this for the most common use cases (XML serialization is the most obvious one.)

I don't entirely understand your point. The goal is to define an XML API, and implement it, which complies with all related standards. As long as the existing Boost components (e.g. Boost.Serialization) work with standard XML tools, we should be compatible. I don't think, however, that we should be constrained to be API-compatible with existing tools, as otherwise the whole exercise to define a new API would be pointless. On the other hand, making minor adjustments to those libraries to work with Boost.XML would be fine. I just don't think we should make this part of the proposal, as it isn't even clear what existing Boost components would be affected, whether they are actively maintained / developed, etc.

...

...
...
What is the purpose of the S template argument?

To keep the concern for unicode or any other string type orthogonal from the XML library, i.e. to allow Boost.XML to interact with different Unicode implementations. In fact, in the existing demos I'm restricting content to ASCII, so I can in fact get away with using std::string, so this is a good example of the "modularity" design goal I mentioned above: Don't force anything on users they don't actually need.

I agree with the goal, but I am not sure that the S type solves the problem. I must admit that I am having difficulty understanding exactly how you envision it should work for other encodings, because std::string is orthogonal to encoding (locale is usually attached to the I/O stream.)

You are right, encoding and string type are (mostly) orthogonal. I have never said anything else. :-)

...

What encoding is used for std::string? ASCII, UTF-8, or "whatever the XML library gives me"? This should be documented as part of the API regardless of the answer.

Yes.

...

Should I define a new string type if I want to use Latin-1 or another encoding in my application? What if the rest of my application uses std::string for Latin-1 encodings? (I am wondering how will work with the current convert trait specialization for std::string.)

How does the convert trait know the XML document encoding so that it is able to convert between this and the application encoding?

I suggest that you adopt the libxml2 design decision to always use UTF-8 for std::string (and UTF-16 for std::wstring if needed.) See the design rationale here:

http://xmlsoft.org/encoding.html

Any backend that does not provide UTF-8 will have to be wrapped.

With such a design decision, the S template parameter becomes superfluous (or should be changed to CharT if you wish to support both std::string and std::wstring.)

Conversion between UTF-8 and application encodings would have to be done explicitly in the application.

At any rate, encoding should be addressed in the GSoC project.

I agree, and this is in fact part of the proposal. To be specific, one of the first steps is to add tests that instantiate the XML classes with existing unicode string classes (such as glib::ustring or Qt's QString), and demonstrate how to use them.

...

...
...
What is the purpose of the convert trait?

To allow conversion between the backend's own string representation and the string type that is used with Boost.XML.

Ok. You should, however, make sure that the strings are converted correctly:

http://xmlsoft.org/html/libxml-xmlstring.html

For instance, convert::in() does not take libxml2 custom allocators into account:

http://xmlsoft.org/html/libxml-xmlmemory.html

Good point. As I said, the existing Boost.XML was meant to be a proof-of-concept. Thanks for your feedback, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Mathieu Champlon

18 Mar 18 Mar

7:11 p.m.

On 17/03/2014 22:03, Stefan Seefeld wrote:

...

Just for the record: I have been collaborating with Saksham over the last couple of weeks to refine ideas that could be cast into a formal project submission.

It would be really great if someone else would provide feedback, too.

I don't know how useful this might prove, but you may want to take a look at http://xeumeuleu.sourceforge.net/ Regards, MAT.

Saksham Maheshwari

9:45 p.m.

@Manish: there is nothing wrong with xerces parser, but it cannot be standardized since the xerces code cannot be compiled by, let us say, libxml2, i.e. it cannot support multiple backends feature. The code has certain dependencies which directly depends upon xerces compiler, hence it is of lesser use. @MAT: same with xeumeuleu, it has many important features, but still it has dependencies on Apache xerces. What I want to do is to define or change the base interface of Boost.XML so that it can support multiple backends. Any feedback. Thanks

Saksham Maheshwari

19 Mar 19 Mar

6:44 a.m.

Hello everyone, Kindly find my proposal for Boost.XML here: https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/... I think my proposal is ready for first submission and I'm eager to refine it further. I would like your valuable comments on it. Please tell me what should I do/add to make it a strong proposal. Thanks, Saksham

Saksham Maheshwari

7:26 a.m.

This is the public URL :- http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/sam_1993/... Thanks On Wed, Mar 19, 2014 at 12:14 PM, Saksham Maheshwari <sam21.zero@gmail.com>wrote:

...

Hello everyone,

Kindly find my proposal for Boost.XML here:

https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/...

I think my proposal is ready for first submission and I'm eager to refine it further. I would like your valuable comments on it. Please tell me what should I do/add to make it a strong proposal.

Thanks, Saksham

munsingh

18 Mar 18 Mar

3:39 p.m.

What's wrong with xerces XML parsers? -Manish On Mar 9, 2014 2:27 AM, "Saksham Maheshwari" <sam21.zero@gmail.com> wrote:

...

Hey guys,

As part of this year's GSOC, I am interested in developing a Standard c++ XML parser for Boost (Boost.XML project in Ideas Page).

What I have encountered till now, there are many libraries which could be used for parsing XML file, but there is no Standard XML API in c++ to do the same, contrary to other languages like Java or python. I think this is an appropriate place to discuss about this API and then develop it.

Since a past couple of days, I have been reading and understanding the existing Boost.xml (sandbox project) and Arabica project. What I think now is that there is no need to develop such an API from scratch, it would just be a wastage of our resources, else we could use the existing codes to develop such an API.

We can use multiple backends supported in Arabica API and merge it with out existing Boost.XML to make such a standard API, so that it can adapted to a wide range of existing XML libraries.

Also, the SAX implementation of Boost.XML has to be changed and it has to be implemented/modified (i.e. reader API), comparing it with Arabica::SAX.

DOM API is working very fine.

So, according to me, the possible tasks will be 1. To enhance or change the SAX implementation of Boost.XML 2. To provide multiple backends to this API.

But for now, I want to know what you guys think about the idea.

Thanks, Saksham Maheshwari

PS: I want to develop an API, not a library :-)

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

4129

Age (days ago)

4141

Last active (days ago)

List overview

Download

10 comments

5 participants

participants (5)

Bjorn Reese
Mathieu Champlon
munsingh
Saksham Maheshwari
Stefan Seefeld