[GSOC] XML library of Boost

older
[type_erasure] How to have just...

Amos Ji

28 Apr 2013 28 Apr '13

6:08 p.m.

Hi All, I'm a graduate student from Fudan University in China. And I hope to contribute some code to Boost during the GSOC. I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library. However, I know Boost contains RapidXML in property_tree library to parse XML file. So what I want to make sure is I need to implement a new XML parser in this project instead of make enhancement for RapidXML. Am I correct? If so, I have some ideas to share with you. In my opinion, an XML parser must be able to do these things: 1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library. And there are some optional function too: 1. To support XPATH; 2. To validate whether the XML file is valid; 3. To support various encoding; 4. To manage memory better; 5. To support regular expression. The ideas are not mature now. I'll improve them in my proposal. In fact, it's not easy to implement a perfect XML parser but I'll do my best. And I have another question. Who will mentor this XML library project? Thanks for your patience to read this email. Sincerely, Mingchao -- Amos Ji(Mingchao Ji) Master in School of Information Science and Engineering, Fudan University MSN: jmc891205@hotmail.com E-mail: jmc891205@gmail.com

Show replies by date

Andrey Semashev

28 Apr 28 Apr

7:11 p.m.

On Monday 29 April 2013 02:08:01 Amos Ji wrote:

...

Hi All,

I'm a graduate student from Fudan University in China. And I hope to contribute some code to Boost during the GSOC.

I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library.

Agreed, a good and fast XML library would be most useful.

...

In my opinion, an XML parser must be able to do these things:

1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

IMHO, support for SAX is also mandatory. I would even say that SAX should be the first and the primary thing to be implemented in Boost.XML, as DOM can be added later on top of it if the time for GSOC allows. You also listed support for encodings in the optional features. I agree with it being optional with exception for Unicode. At least UTF-8 should be supported from the very start. But that should not be a problem now that we have Boost.Locale.

Marshall Clow

8:53 p.m.

On Apr 28, 2013, at 12:11 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...

On Monday 29 April 2013 02:08:01 Amos Ji wrote:

...
Hi All,

I'm a graduate student from Fudan University in China. And I hope to contribute some code to Boost during the GSOC.

I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library.

Agreed, a good and fast XML library would be most useful.

Yes, yes, please. I would add "with a good native C++ interface". Get the interface right first.

...

...
In my opinion, an XML parser must be able to do these things:

1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

IMHO, support for SAX is also mandatory. I would even say that SAX should be the first and the primary thing to be implemented in Boost.XML, as DOM can be added later on top of it if the time for GSOC allows.

You also listed support for encodings in the optional features. I agree with it being optional with exception for Unicode. At least UTF-8 should be supported from the very start. But that should not be a problem now that we have Boost.Locale.

I am looking forward to seeing your design. -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Amos Ji

9:16 p.m.

2013/4/29 Marshall Clow <mclow.lists@gmail.com>

...

On Apr 28, 2013, at 12:11 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...
On Monday 29 April 2013 02:08:01 Amos Ji wrote:

...
Hi All,

I'm a graduate student from Fudan University in China. And I hope to contribute some code to Boost during the GSOC.

I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library.

Agreed, a good and fast XML library would be most useful.

Yes, yes, please. I would add "with a good native C++ interface".

Get the interface right first.

What does "native C++ interface" mean? Are the other libraries in Boost are not allowed to use?

...

...
...
In my opinion, an XML parser must be able to do these things:

1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

IMHO, support for SAX is also mandatory. I would even say that SAX should be the first and the primary thing to be implemented in Boost.XML, as DOM can be added later on top of it if the time for GSOC allows.

You also listed support for encodings in the optional features. I agree with it being optional with exception for Unicode. At least UTF-8 should be supported from the very start. But that should not be a problem now that we have Boost.Locale.

I am looking forward to seeing your design.

Thank you. I have a long way towards success. BTW, will you mentor this XML library project?

...

-- Marshall

Marshall Clow Idio Software <mailto:mclow.lists@gmail.com>

A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Marshall Clow

29 Apr 29 Apr

6:23 p.m.

On Apr 28, 2013, at 2:16 PM, Amos Ji <jmc891205@gmail.com> wrote:

...

2013/4/29 Marshall Clow <mclow.lists@gmail.com>

...
On Apr 28, 2013, at 12:11 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...
On Monday 29 April 2013 02:08:01 Amos Ji wrote:

...
Hi All,

I'm a graduate student from Fudan University in China. And I hope to contribute some code to Boost during the GSOC.

I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library.

Agreed, a good and fast XML library would be most useful.

Yes, yes, please. I would add "with a good native C++ interface".

Get the interface right first.

What does "native C++ interface" mean?

A "C++ interface", in my opinion, uses the same principles/concepts as the standard library. As opposed to, say, looking like Java or Python.

...

Are the other libraries in Boost are not allowed to use?

No, no - that's not what I meant.

...

...
...
...
In my opinion, an XML parser must be able to do these things:

1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

IMHO, support for SAX is also mandatory. I would even say that SAX should be the first and the primary thing to be implemented in Boost.XML, as DOM can be added later on top of it if the time for GSOC allows.

You also listed support for encodings in the optional features. I agree with it being optional with exception for Unicode. At least UTF-8 should be supported from the very start. But that should not be a problem now that we have Boost.Locale.

I am looking forward to seeing your design.

Thank you. I have a long way towards success. BTW, will you mentor this XML library project?

Sadly, I do not have time to be a mentor this year :-( I would be happy to look over your design and code and offer suggestions, though. -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Martin Desharnais

9:33 p.m.

2013/4/29 Marshall Clow <mclow.lists@gmail.com>

...

A "C++ interface", in my opinion, uses the same principles/concepts as the standard library.

As opposed to, say, looking like Java or Python.

I think that value semantics should be consider for XML elements, as opposed to most other libraries. Now that we have move semantic, it should be possible to get good performance and it would "use the same principles" as the standard library containers. How about something like that? xml::element root("root"); root.push_back(xml::comment("Comment")); root.push_back(xml::element("Element")); root.push_back(xml::text("Text")); root.push_back(root); Since xml::element have value semantic, it's current state will be copied and appended to itself, resulting in the following structure: <root>  <Element /> Text <root>  <Element /> Text </root> </root> -- Martin Desharnais

Daniel Pfeifer

9:32 p.m.

2013/4/28 Andrey Semashev <andrey.semashev@gmail.com>

...

IMHO, support for SAX is also mandatory. I would even say that SAX should be the first and the primary thing to be implemented in Boost.XML, as DOM can be added later on top of it if the time for GSOC allows.

Personally, I find pull-parsing much more convenient than SAX. But personal preferences put aside, there are generally three approaches to parsing XML: 1. DOM 2. SAX, push-parsing, callback-driven 3. StAX, pull-parsing, streamreader Each approach is better than the two others in some way. We need them all three in Boost (and then in the standard). I believe that these three approaches may share some code, but don't need to be based upon each other. You might want to look at pugixml [1], a "Light-weight, simple and fast XML parser for C++ with XPath support". There might be a reason why it is not built on SAX. Concerning pull-parsing, llamaxml [2] and the streamreader from Qt [3] and may be of interest. I also wrote a simple stream reader (and writer) that you might find helpful [4]. [1] http://pugixml.org/ [2] http://llamaxml.berlios.de/ [3] http://qt-project.org/doc/qt-4.8/qxmlstreamreader.html [4] https://github.com/purpleKarrot/xml cheers, Daniel

Bjorn Reese

30 Apr 30 Apr

10:40 a.m.

On 04/29/2013 11:32 PM, Daniel Pfeifer wrote:

...

I also wrote a simple stream reader (and writer) that you might find helpful [4]. [...] [4] https://github.com/purpleKarrot/xml

Interesting little project. Have you considered using Boost.Regex instead of re2c?

Daniel Pfeifer

11:59 a.m.

2013/4/30 Bjorn Reese <breese@mail1.stofanet.dk>

...

On 04/29/2013 11:32 PM, Daniel Pfeifer wrote:

I also wrote a simple stream reader (and writer) that you might find

...
helpful [4].

[...]

...
[4] https://github.com/**purpleKarrot/xml<https://github.com/purpleKarrot/xml>

Interesting little project. Have you considered using Boost.Regex instead of re2c?

Boost.Regex solves a different purpose than re2c. The mapping goes: pcre -> Boost.Regex re2c -> Boost.Spirit cheers, Daniel

Bjorn Reese

2 May 2 May

9:35 a.m.

On 04/30/2013 01:59 PM, Daniel Pfeifer wrote:

...

2013/4/30 Bjorn Reese <breese@mail1.stofanet.dk>

...

...
Interesting little project. Have you considered using Boost.Regex instead of re2c?

Boost.Regex solves a different purpose than re2c. The mapping goes:

pcre -> Boost.Regex re2c -> Boost.Spirit

Ok, so let me rephrase my question. Have you considered Boost.Regex or Boost.Spirit instead of re2c?

Simon Siemens

1 May 1 May

7:05 a.m.

Please also look at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2101.html . As an XML document has a tree structure, I guess an XML DOM interface should implement a C++ tree container interface. The above paper suggests some important principles for tree-like containers in C++. Maybe someone could comment on the state of this paper. Regards Simon Am Montag, den 29.04.2013, 23:32 +0200 schrieb Daniel Pfeifer:

...

2013/4/28 Andrey Semashev <andrey.semashev@gmail.com>

...
IMHO, support for SAX is also mandatory. I would even say that SAX should be the first and the primary thing to be implemented in Boost.XML, as DOM can be added later on top of it if the time for GSOC allows.

Personally, I find pull-parsing much more convenient than SAX. But personal preferences put aside, there are generally three approaches to parsing XML:

1. DOM 2. SAX, push-parsing, callback-driven 3. StAX, pull-parsing, streamreader

Each approach is better than the two others in some way. We need them all three in Boost (and then in the standard). I believe that these three approaches may share some code, but don't need to be based upon each other.

You might want to look at pugixml [1], a "Light-weight, simple and fast XML parser for C++ with XPath support". There might be a reason why it is not built on SAX. Concerning pull-parsing, llamaxml [2] and the streamreader from Qt [3] and may be of interest. I also wrote a simple stream reader (and writer) that you might find helpful [4].

[1] http://pugixml.org/ [2] http://llamaxml.berlios.de/ [3] http://qt-project.org/doc/qt-4.8/qxmlstreamreader.html [4] https://github.com/purpleKarrot/xml

cheers, Daniel

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Rene Rivera

2:59 p.m.

On Wed, May 1, 2013 at 2:05 AM, Simon Siemens <simon.siemens@arcor.de>wrote:

...

Please also look at

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2101.html .

As an XML document has a tree structure, I guess an XML DOM interface should implement a C++ tree container interface. The above paper suggests some important principles for tree-like containers in C++. Maybe someone could comment on the state of this paper.

I am reworking the paper to make it C++11 compliant. In the hope that it can get discussed at the fall C++ meeting. Which since the meeting is at my home town I'll be attending for this purpose. You can see the slow incremental progress directly at <http://tinyurl.com/claew33>, and the overall tree project at <https://github.com/grafikrobot/boost-tree>. -- -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - grafikrobot/yahoo

Stefan Seefeld

3:15 p.m.

On 2013-04-28 14:08, Amos Ji wrote:

...

I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library.

However, I know Boost contains RapidXML in property_tree library to parse XML file. So what I want to make sure is I need to implement a new XML parser in this project instead of make enhancement for RapidXML. Am I correct? If so, I have some ideas to share with you.

I don't think your assumptions are entirely correct. First, I think the "XML" project on the ideas page is mis-classified, as it implies a misunderstanding. XML isn't a parser, and neither a file format - it is in fact quite a bit more. As I have argued many times before on this list, I think it would be foolish to try to reimplement all the functionality to support XML. There already are quite a few decent implementations available, written in different languages (mostly C and Java), so it might be more appropriate to reuse them. I agree with others that in the context of boost this should be about defining a good XML API, and then map that to existing libraries. In fact, I have done that a long time ago by wrapping libxml2. You can still see the code in the sandbox at http://svn.boost.org/svn/boost/sandbox/xml/.

...

In my opinion, an XML parser must be able to do these things:

1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

And there are some optional function too:

1. To support XPATH; 2. To validate whether the XML file is valid; 3. To support various encoding; 4. To manage memory better; 5. To support regular expression.

I agree with all of the above. Still, I think trying to reimplement this as a "pure" boost library is the wrong approach. Focus on the API, then map it to an existing library.

...

The ideas are not mature now. I'll improve them in my proposal. In fact, it's not easy to implement a perfect XML parser but I'll do my best.

And I have another question. Who will mentor this XML library project?

I would be happy to mentor this. Stefan -- ...ich hab' noch einen Koffer in Berlin...

姬明超

7:25 p.m.

在 2013-5-1，下午11:15，Stefan Seefeld <stefan@seefeld.name> 写道：

...

On 2013-04-28 14:08, Amos Ji wrote:

...
I've scanned the idea page. The ideas in the page are all very interesting and challenging. But what I'm interested in most is XML library, which is at the bottom of page. I think the XML format is the most popular standard for storing information so it's important for Boost to have a good XML library.

However, I know Boost contains RapidXML in property_tree library to parse XML file. So what I want to make sure is I need to implement a new XML parser in this project instead of make enhancement for RapidXML. Am I correct? If so, I have some ideas to share with you.

I don't think your assumptions are entirely correct. First, I think the "XML" project on the ideas page is mis-classified, as it implies a misunderstanding. XML isn't a parser, and neither a file format - it is in fact quite a bit more.

As I have argued many times before on this list, I think it would be foolish to try to reimplement all the functionality to support XML. There already are quite a few decent implementations available, written in different languages (mostly C and Java), so it might be more appropriate to reuse them.

I agree with others that in the context of boost this should be about defining a good XML API, and then map that to existing libraries. In fact, I have done that a long time ago by wrapping libxml2. You can still see the code in the sandbox at http://svn.boost.org/svn/boost/sandbox/xml/.

Thanks for your comments. I agree that it isn't clever to "reinvent the wheels". The libxml2 and expat are both great xml libraries. It will be much easier to offer API of the existing libraries than to implement a new xml library. But what if the users don't have libxml2 on their computer? Just tell the users to download libxml2 before they want to use boost::xml? Or include the libxml2 in release version? I think neither of them are friendly to users.

...

...
In my opinion, an XML parser must be able to do these things:

1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

And there are some optional function too:

1. To support XPATH; 2. To validate whether the XML file is valid; 3. To support various encoding; 4. To manage memory better; 5. To support regular expression.

I agree with all of the above. Still, I think trying to reimplement this as a "pure" boost library is the wrong approach. Focus on the API, then map it to an existing library.

If the 3rd-party libraries are allowed in boost, I'll be glad to work in accordance with your advice. I hope to hear more voice.

...

...
The ideas are not mature now. I'll improve them in my proposal. In fact, it's not easy to implement a perfect XML parser but I'll do my best.

And I have another question. Who will mentor this XML library project?

I would be happy to mentor this.

Thank you very much! I'm glad to work with you if I'm accepted. Sincerely, Mingchao

legalize+jeeves＠mail.xmission.com

7:34 p.m.

[Please do not mail me a copy of your followup] boost@lists.boost.org spake the secret code <E6C80D62-CECD-4250-B50A-3CC9AA006940@gmail.com> thusly:

...

I agree that it isn't clever to "reinvent the wheels". The libxml2 and expat are both great xml libraries. It will be much easier to offer API of the existing libraries than to implement a new xml library.

But what if the users don't have libxml2 on their computer? Just tell the users to download libxml2 before they want to use boost::xml? Or include the libxml2 in release version? I think neither of them are friendly to users.

I think boost.multiprecision has a good example to follow: you specify a back-end as a template argument, where the backends can be an existing library or a boost licensed implementation. The boost licensed implementation needn't be as fast or memory efficient as the existing libraries, the emphasis should be on simplest implementation that can satisfy the backend requirements of the front end library, without beeing too slow or memory intensive. -- "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline> The Computer Graphics Museum <http://computergraphicsmuseum.org> The Terminals Wiki <http://terminals.classiccmp.org> Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

姬明超

7:53 p.m.

在 2013-5-2，上午3:34，legalize+jeeves@mail.xmission.com (Richard) 写道：

...

I think boost.multiprecision has a good example to follow: you specify a back-end as a template argument, where the backends can be an existing library or a boost licensed implementation. The boost licensed implementation needn't be as fast or memory efficient as the existing libraries, the emphasis should be on simplest implementation that can satisfy the backend requirements of the front end library, without beeing too slow or memory intensive.

Thanks for your helpful instruction! I'll look into boost.multiprecision to see how it solve this problem. But according to your explanation, boost still need its own xml library implementation. Am I correct? Currently, boost use rapidXML, a 3rd-party xml library, to read or write xml files. Then I think this project can be divided into two parts. 1. Define a good API and map it to some 3rd-party libraries such as libxml2. 2. Implement a simple xml library for boost. Sincerely, Mingchao

Stefan Seefeld

8:16 p.m.

On 2013-05-01 15:53, 姬明超 wrote:

...

在 2013-5-2，上午3:34，legalize+jeeves@mail.xmission.com (Richard) 写道：

...
I think boost.multiprecision has a good example to follow: you specify a back-end as a template argument, where the backends can be an existing library or a boost licensed implementation. The boost licensed implementation needn't be as fast or memory efficient as the existing libraries, the emphasis should be on simplest implementation that can satisfy the backend requirements of the front end library, without beeing too slow or memory intensive. Thanks for your helpful instruction! I'll look into boost.multiprecision to see how it solve this problem.

I honestly doubt that making the choice of backend a template parameter is an appropriate approach, at least for boost.xml. Why would someone want to control this choice by means of a template parameter, as opposed to a configuration / build system flag ? Would you want to instantiate multiple backend bindings in the same application ?

...

But according to your explanation, boost still need its own xml library implementation. Am I correct?

That's exactly what I was referring to as a "foolish idea": XML is big, and implementing a conforming API a highly non-trivial (multi-year) task.

...

Currently, boost use rapidXML, a 3rd-party xml library, to read or write xml files.

I think boost uses a couple of implementations, none of which are truly XML compliant (mostly because they only care about subsets of the spec, depending on what use-case they are targeting).

...

Then I think this project can be divided into two parts. 1. Define a good API and map it to some 3rd-party libraries such as libxml2. 2. Implement a simple xml library for boost.

I think 2. is far too ambitious for a GSoC project. I'd thus focus on 1. Specifically, I would start with existing C++ XML APIs (including my boost.xml sandbox project, as well as arabica), and improve and refine them, as appropriate. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Boris Schaeling

9:38 p.m.

On Wed, 01 May 2013 22:16:02 +0200, Stefan Seefeld <stefan@seefeld.name> wrote:

...

[...]

...
Then I think this project can be divided into two parts. 1. Define a good API and map it to some 3rd-party libraries such as libxml2. 2. Implement a simple xml library for boost.

I think 2. is far too ambitious for a GSoC project. I'd thus focus on 1. Specifically, I would start with existing C++ XML APIs (including my boost.xml sandbox project, as well as arabica), and improve and refine them, as appropriate.

I agree with Stefan. Your application is good enough if it "only" covers item 1, Mingchao. If you happen to be done with item 1 before the GSoC project ends you could still work on item 2 if you wanted. But I would rather want to overperform than risk failing to deliver item 2. :) Boris

legalize+jeeves＠mail.xmission.com

2 May 2 May

1:24 a.m.

[Please do not mail me a copy of your followup] boost@lists.boost.org spake the secret code <539DD973-0901-46B0-AAA4-3B7ABB32264C@gmail.com> thusly:

...

ÔÚ 2013-5-2£¬ÉÏÎç3:34£¬legalize+jeeves@mail.xmission.com (Richard) ÐŽµÀ£º

...
I think boost.multiprecision has a good example to follow: you specify a back-end as a template argument, where the backends can be an existing library or a boost licensed implementation. The boost licensed implementation needn't be as fast or memory efficient as the existing libraries, the emphasis should be on simplest implementation that can satisfy the backend requirements of the front end library, without beeing too slow or memory intensive.

Thanks for your helpful instruction! I'll look into boost.multiprecision to see how it solve this problem. But according to your explanation, boost still need its own xml library implementation. Am I correct?

I'm sorry I wasn't clear. I was describing how boost.multiprecision does things: you can choose between several existing open source back ends, or you can select the boost backend. The boost license is more permissive with respect to commercial use than the license on GMP, for instance. It's not the only option, simply an example that's already in boost that solves the problem of how do you target multiple different libraries that have similar features. -- "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline> The Computer Graphics Museum <http://computergraphicsmuseum.org> The Terminals Wiki <http://terminals.classiccmp.org> Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Stefan Seefeld

1 May 1 May

7:46 p.m.

On 2013-05-01 15:25, 姬明超 wrote:

...

I agree that it isn't clever to "reinvent the wheels". The libxml2 and expat are both great xml libraries. It will be much easier to offer API of the existing libraries than to implement a new xml library.

But what if the users don't have libxml2 on their computer?

Then (s)he has to install the prerequisite libraries, as is custom with all software that is being used today.

...

Just tell the users to download libxml2 before they want to use boost::xml? Or include the libxml2 in release version? I think neither of them are friendly to users.

I agree that usability and user-friendliness should be a concern. However, most of these aspects can be taken care of by proper packaging or package management. For example, on Linux you would typically use package managers such as rpm or deb to manage packages, and managing such dependencies as all handled very conveniently. For avoidance of doubt: I'm not suggesting that libxml2 has to be the backend-of-choice. In fact, I would argue that a much better curse of action is to add more backends, which has a couple of benefits: * supporting multiple backends helps to make sure the API is itself not tied to any particular backend by accident. * it gives more choices to package managers who build packages for a variety of platforms, and who then have to choose how to configure the library, and specifically, what backend to pick. As a good reference, I suggest you have a look at arabica (http://www.jezuk.co.uk/cgi-bin/view/arabica), which has already done that. Perhaps some of it can even be reused and incorporated into such a boost.xml library.

...

If the 3rd-party libraries are allowed in boost, I'll be glad to work in accordance with your advice. I hope to hear more voice.

We have had this discussion a couple of times in the past on this list. I doubt there is any fundamental disagreement on this approach. Other boost libraries use similar approaches if they cover a sufficiently complex domain. For example, consider boost.mpi or boost.python.

...

...
I would be happy to mentor this. Thank you very much! I'm glad to work with you if I'm accepted.

Great, then let's get started on some project details (we still have two days ! :-) ) We still need to establish clear (achievable) goals, as well as a realistic schedule. Feel free to follow up offlist if you want to discuss any of this. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Bjorn Reese

5 May 5 May

9:53 a.m.

On 05/01/2013 05:15 PM, Stefan Seefeld wrote:

...

As I have argued many times before on this list, I think it would be foolish to try to reimplement all the functionality to support XML.

I went back to read those discussions, and I think that there are good points on both sides. In my experience, most people only need a simple XML parser without all the extra features of XML Schema, XSLT, etc. Others also need fairly simple extensions such as XPath, and finally some need the full monty. It all depend on their use cases.

...

I agree with others that in the context of boost this should be about defining a good XML API, and then map that to existing libraries. In

Here is an alternative suggestion. Parsing the XML syntax is fairly simple, so we could provide a basic XML parser for those with simple needs. This XML parser should be based on the Builder design pattern (somewhat reminiscient of a SAX interface). The default builder will create our own tree/DOM, with which you can do nothing by simple tree manipulation. In case we need more complex features, such as XML Schema, we could provide a builder for, say, libxml2, which generates libxml2 trees that can be used, via the XML APIs you envision, to invoke the the libxml2 XML Schema validation. This approach allows us to have a simple XML parser for projects like Boost.Serialization and Boost.PropertyTree without the need for an external dependency. It also allows us to do more advanced stuff with external dependencies.

Andrey Semashev

10:50 a.m.

On Sunday 05 May 2013 11:53:14 Bjorn Reese wrote:

...

On 05/01/2013 05:15 PM, Stefan Seefeld wrote:

...
As I have argued many times before on this list, I think it would be foolish to try to reimplement all the functionality to support XML.

I went back to read those discussions, and I think that there are good points on both sides. In my experience, most people only need a simple XML parser without all the extra features of XML Schema, XSLT, etc. Others also need fairly simple extensions such as XPath, and finally some need the full monty. It all depend on their use cases.

...
I agree with others that in the context of boost this should be about defining a good XML API, and then map that to existing libraries. In

Here is an alternative suggestion. Parsing the XML syntax is fairly simple, so we could provide a basic XML parser for those with simple needs. This XML parser should be based on the Builder design pattern (somewhat reminiscient of a SAX interface). The default builder will create our own tree/DOM, with which you can do nothing by simple tree manipulation. In case we need more complex features, such as XML Schema, we could provide a builder for, say, libxml2, which generates libxml2 trees that can be used, via the XML APIs you envision, to invoke the the libxml2 XML Schema validation.

This approach allows us to have a simple XML parser for projects like Boost.Serialization and Boost.PropertyTree without the need for an external dependency. It also allows us to do more advanced stuff with external dependencies.

+1 for this approach. I agree with Stefan that XML in whole is a too large project, and this approach allows to work on it gradually. A SAX-like parser (with the builder API) is big enough of itself for a GSOC project, IMHO, and it lays a good foundation for further additions like DOM and XSLT while being very useful by itself. My only wish for the DOM/XSLT/XPath part is that it should provide a portable and consistent C++ API regardless of the used backend, be that libxml2, xerces/xalan or MSXML. The same applies to the SAX parser part, if it is backed by a third party library. Speaking of the third party dependencies, care must be taken for choosing the backend with the least restricting license. Permissive licensing of Boost is one of the keys to its success, IMHO. In fact, the licensing issue may be one reason to have our own BSL-licensed implementation, at least of some subset of XML (my preference would be at least the SAX parser).

Stefan Seefeld

5 p.m.

On 05/05/2013 05:53 AM, Bjorn Reese wrote:

...

Here is an alternative suggestion. Parsing the XML syntax is fairly simple, so we could provide a basic XML parser for those with simple needs.

Define "simple needs". I bet there are as many different expectations for that as you ask people. How would you package boost.xml, to offer these different implementations with varying feature sets ? I don't see any reasonable way to achieve that. In contrast, there are a couple of well-established APIs to deal with XML (notably SAX, XMLReader, and DOM), it just so happens that none of them are available as standard C++ APIs. I strongly believe that boost.xml should support APIs to parse XML documents (SAX and XMLReader, say), as well as to navigate and manipulate XML Infosets (DOM). I agree that the APIs should be simple and modular, but I don't see any way to let a single library implementation generate multiple differing DOM trees, for example. That would turn into a nightmare for library maintainers, packagers, and users alike. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Bjorn Reese

6 May 6 May

9:43 a.m.

On 05/05/2013 07:00 PM, Stefan Seefeld wrote:

...

Define "simple needs". I bet there are as many different expectations for that as you ask people.

But that does not mean that we should ignore their needs. We do not have to look further than Boost to find use cases wherein XML is used as an encoding format and nothing else. Boost.PropertyTree has a tree data structure that can be saved in XML, JSON, INI, or its own file format. It therefore needs to parse an XML document into its own data structure. Boost.Serialization has an XML archive that needs to parse an XML document into user-defined data structures. XmlReader would be a perfect fit for Boost.Serialization. With a builder design pattern both can be handled directly without any intermediate DOM data structure. I am going to elaborate on that below.

...

How would you package boost.xml, to offer these different implementations with varying feature sets ? I don't see any reasonable way to achieve that.

In the same way that you intend to support wrappers for libxml2 and Xerces.

...

In contrast, there are a couple of well-established APIs to deal with XML (notably SAX, XMLReader, and DOM), it just so happens that none of them are available as standard C++ APIs.

I must have expressed myself badly, if I left you with the impression that I am against these APIs or C++ versions thereof. Quite to the contrary. Let me outline how I would approach this project: Start with an XML lexer. This simply returns the next token (start tag, attribute, data, etc.) when called. Put the XML lexer in a loop, and you get a SAX parser. Pair the XML lexer with a parent stack, and you get an XmlReader. Base the DOM parser on the SAX parser to create its tree. This is how libxml2 does it, and how it reuses the tree generator for parsing other formats such as HTML and DocBook. By default, I would provide our own tree, although this is not terribly important. If I want to use XML Schema or XSLT, I would instead replace the builder (the SAX callbacks) with one for libxml2, and then use libxml2 for validation or transformation. Creating such a libxml2 builder is straight-forward, because libxml2 already supplies it in its API: xmlDefaultSAXHandler. No maintenance nightmare here. Another advantage of using this interpreter/builder split, is that it gives our users the freedom to create new frontends for alternative XML encodings, such as binary XML or SXML (XML as S-expressions.) This would not be possible if we only created a wrapper for libxml2.

Stefan Seefeld

1:18 p.m.

On 05/06/2013 05:43 AM, Bjorn Reese wrote:

...

On 05/05/2013 07:00 PM, Stefan Seefeld wrote:

...
How would you package boost.xml, to offer these different implementations with varying feature sets ? I don't see any reasonable way to achieve that.

In the same way that you intend to support wrappers for libxml2 and Xerces.

Well, I don't expect any packager to package both. Or perhaps they might, so users have the choice to install, say, `yum install boost-xml-xerces` or `yum install boost-xml-libxml2`. Still, for these two the provided functionality should be mostly the same, while you are advocating a 'boost-xml' package offering a reduced API. I'm not convinced that will solve any real problem.

...

...
In contrast, there are a couple of well-established APIs to deal with XML (notably SAX, XMLReader, and DOM), it just so happens that none of them are available as standard C++ APIs.

I must have expressed myself badly, if I left you with the impression that I am against these APIs or C++ versions thereof. Quite to the contrary. Let me outline how I would approach this project:

Start with an XML lexer. This simply returns the next token (start tag, attribute, data, etc.) when called.

[....] Fine, so you insist on writing your own XML implementation. That's obviously up to you, and as long as your implementation is complete and validates, there should be no problem using that as backend for the (to be defined) boost.xml.

...

By default, I would provide our own tree, although this is not terribly important.

Can you elaborate ? Each backend library has its own data structure to keep content and associated state. Whatever of that state is made visible through boost.xml needs to be done through a portable and public API. Or do you expect users to access the backend directly ? The two existing implementations that come close to what I think is a good model to follow are the boost.xml sandbox project (which I now moved to https://github.com/stefanseefeld/boost.xml), as well as arabica (www.jezuk.co.uk/arabica). By mapping to a range of implementations (libxml2, xerces, MSXML, etc.) they prove that the API is robust. I suggest you base your critique on those APIs, and indicate what you think isn't working there. Thanks, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Bjorn Reese

8 May 8 May

11:21 a.m.

On 05/06/2013 03:18 PM, Stefan Seefeld wrote:

...

Well, I don't expect any packager to package both. Or perhaps they might, so users have the choice to install, say, `yum install boost-xml-xerces` or `yum install boost-xml-libxml2`. Still, for these two the provided functionality should be mostly the same, while you are advocating a 'boost-xml' package offering a reduced API. I'm not convinced that will solve any real problem.

I am not really advocating anything with regards to packaging; I was responding to your question. Linux packages are a peripheral issue to me because they are usually too old (the latest version in my distribution is Boost 1.49.) So the following is just a brainstorm. A solution could be to have a 'boost-xml' package with the Boost.Xml codebase -- that is, the APIs, the wrappers for libxml2 and Xerces, and whatever implementation we write. This package does not contain any package dependencies on libxml2 or Xerces, because neither may be used by the user. If such dependencies are important (although Boost.Multiprecision seems to live happily without), then I would add a 'boost-xml-libxml2' package without any content except dependencies on the boost-xml and libxml2 packages, and similar for Xerces.

...

Fine, so you insist on writing your own XML implementation. That's obviously up to you, and as long as your implementation is complete and validates, there should be no problem using that as backend for the (to be defined) boost.xml.

I am not really sure why you keep insisting on validation. It may be important for your applications, but not every application needs it. This includes Boost.Serialization and Boost.PropertyTree. W3C clearly acknowledges this. The XML standard explicitly allows both validating and non-validating XML processors (see section 5.1.) Having said that, my alternative proposal did offer a solution for validation, so I do not really see the problem.

...

Can you elaborate ? Each backend library has its own data structure to keep content and associated state. Whatever of that state is made visible through boost.xml needs to be done through a portable and public API. Or do you expect users to access the backend directly ?

I was not arguing against a C++ API for DOM. I was talking about whether or not we should provide our own implementation thereof.

...

The two existing implementations that come close to what I think is a good model to follow are the boost.xml sandbox project (which I now moved to https://github.com/stefanseefeld/boost.xml), as well as arabica (www.jezuk.co.uk/arabica). By mapping to a range of implementations (libxml2, xerces, MSXML, etc.) they prove that the API is robust.

I suggest you base your critique on those APIs, and indicate what you think isn't working there.

Why? I have not made any arguments for or against how the APIs should actually look. I am addressing a different problem.

Stefan Seefeld

12:08 p.m.

On 05/08/2013 07:21 AM, Bjorn Reese wrote:

...

On 05/06/2013 03:18 PM, Stefan Seefeld wrote:

...
Well, I don't expect any packager to package both. Or perhaps they might, so users have the choice to install, say, `yum install boost-xml-xerces` or `yum install boost-xml-libxml2`. Still, for these two the provided functionality should be mostly the same, while you are advocating a 'boost-xml' package offering a reduced API. I'm not convinced that will solve any real problem.

A solution could be to have a 'boost-xml' package with the Boost.Xml codebase -- that is, the APIs, the wrappers for libxml2 and Xerces, and whatever implementation we write.

You are evading the question. A user may not even care how boost.xml is implemented, as long as the functionality is there. If I'm such a user, I don't want to be confronted with the question of what backend to pick.

...

...
Fine, so you insist on writing your own XML implementation. That's obviously up to you, and as long as your implementation is complete and validates, there should be no problem using that as backend for the (to be defined) boost.xml.

I am not really sure why you keep insisting on validation.

Sorry, bad choice of words on my part. By "validates" I was referring to some functional requirements (such as a test suite) which an implementation is measured with for correctness. But since you are talking about it...

...

It may be important for your applications, but not every application needs it. This includes Boost.Serialization and Boost.PropertyTree. W3C clearly acknowledges this. The XML standard explicitly allows both validating and non-validating XML processors (see section 5.1.)

Right. But again, I think you are making life much harder than it needs to be for users. As a user I want to use the boost.xml library in my own project. Do you really anticipate there to be a bunch of different backends being offered to end-users to pick from, depending on what functionality he requires ? What a drag ! Just give him a a single library with easy instructions on how to call, link, and execute it. Being forced to look at backends totally defeats the purpose of having an abstraction layer in the first place.

...

...
Can you elaborate ? Each backend library has its own data structure to keep content and associated state. Whatever of that state is made visible through boost.xml needs to be done through a portable and public API. Or do you expect users to access the backend directly ?

I was not arguing against a C++ API for DOM. I was talking about whether or not we should provide our own implementation thereof.

What I'm saying is that, once you impose "our own implementation", you eliminate the majority of existing backends, including libxml2 and xerces, because they have their own. And so, to support such backends, it's best to keep the choice of implementation for that data structure close to the backend, an "implementation detail". Stefan -- ...ich hab' noch einen Koffer in Berlin...

Bjorn Reese

9 May 9 May

10 a.m.

On 05/08/2013 02:08 PM, Stefan Seefeld wrote:

...

You are evading the question. A user may not even care how boost.xml is implemented, as long as the functionality is there. If I'm such a user, I don't want to be confronted with the question of what backend to pick.

Then create a 'boost-xml-standalone' package without dependencies, and let the 'boost-xml' package depend on the 'boost-xml-standalone' and 'libxml2' packages. Problem solved. Let me quote some conventional wisdom about external dependencies from the discussion: "Then (s)he has to install the prerequisite libraries, as is custom with all software that is being used today."

...

Right. But again, I think you are making life much harder than it needs to be for users. As a user I want to use the boost.xml library in my own project. Do you really anticipate there to be a bunch of different backends being offered to end-users to pick from, depending on what functionality he requires ? What a drag ! Just give him a a single

I thought that this was part of the GSoC proposal, which states: "Then I’ll define some APIs which boost.xml doesn’t support currently and map them to libxml2. Then I also want to add support for xerces." I have not seen any complaints from your side about adding support for Xerces. Having said that, with the proper defaults, the user do not have to do anything. Only if he wants to do something different does he need to include another header, pass an extra argument, or whatever. This is how the rest of Boost handles variation. Why has this suddenly become much harder?

...

library with easy instructions on how to call, link, and execute it. Being forced to look at backends totally defeats the purpose of having an abstraction layer in the first place.

I am not sure that I follow you here. Why do the users need to look at backends?

...

What I'm saying is that, once you impose "our own implementation", you eliminate the majority of existing backends, including libxml2 and xerces, because they have their own. And so, to support such backends,

No, you have got it completely backwards. My proposal does not have this limitation. Quite to the contrary, and in the case of libxml2 the integration is even trivial. Please go back and re-read my proposal, and pay attention to the builders. With my proposal we get more flexibility, not less. For instance, we can write a binary XML frontend, and have it generate a libxml2 tree. This allows us to validate or transform binary XML with libxml2 even though it is not supported by libxml2. It also allows us to convert between textual and binary XML quite easily.

Stefan Seefeld

2:26 p.m.

Bjorn, we are going in circles, which is in part because we still are talking past each other. In particular, it seems you aren't distinguishing between users and developers. On 05/09/2013 06:00 AM, Bjorn Reese wrote:

...

On 05/08/2013 02:08 PM, Stefan Seefeld wrote:

...
You are evading the question. A user may not even care how boost.xml is implemented, as long as the functionality is there. If I'm such a user, I don't want to be confronted with the question of what backend to pick.

Then create a 'boost-xml-standalone' package without dependencies, and let the 'boost-xml' package depend on the 'boost-xml-standalone' and 'libxml2' packages. Problem solved.

Sorry, what problem is solved ?

...

...
Right. But again, I think you are making life much harder than it needs to be for users. As a user I want to use the boost.xml library in my own project. Do you really anticipate there to be a bunch of different backends being offered to end-users to pick from, depending on what functionality he requires ? What a drag ! Just give him a a single

I thought that this was part of the GSoC proposal, which states:

[...] You are citing out of context. Implementing multiple backends has many benefits for *developers*, for example as it helps to guarantee that the API isn't tied to a particular backend. It should not affect in any way *users*, who will only use the boost.xml API (and library), without any concern for any particular implementation choice.

...

Having said that, with the proper defaults, the user do not have to do anything. Only if he wants to do something different does he need to include another header, pass an extra argument, or whatever. This is how the rest of Boost handles variation. Why has this suddenly become much harder?

It hasn't, and when expressed that way, I actually agree. What I don't agree with is this:

...

Start with an XML lexer. This simply returns the next token (start tag, attribute, data, etc.) when called.

Put the XML lexer in a loop, and you get a SAX parser.

Pair the XML lexer with a parent stack, and you get an XmlReader.

Base the DOM parser on the SAX parser to create its tree. This is how libxml2 does it, and how it reuses the tree generator for parsing other formats such as HTML and DocBook.

By default, I would provide our own tree, although this is not terribly important.

While the layering you describe pretty much matches a typical implementation, this doesn't have any consequences for users, as these layers can't be exchanged. You can't mix a layer from one backend and combine it with another layer from a different backend. So why care, on an API level ? I believe your point was that you want to be able to implement only the "XML lexer", but neither the SAX nor DOM APIs, and still be able to call the result "boost.xml", yes ? I still think this is a bad idea. Otherwise, as long as the full functionality is provided, I don't care about the implementation, and in particular, whether someone will fancy to rewrite it "natively" instead of building on top of existing third-party libs. Stefan -- ...ich hab' noch einen Koffer in Berlin...

Roger Martin

11:24 p.m.

This is a fun topic. How should c++ play 'catchup' to other languages on xml handling. What applications will develop from such an XML API? Xml editors and xml creators/modifiers? Data flow and communications between apps, web services? What can be leveraged in c++ to do something new/faster with xml? If there was a way to dynamically load a shared library(compiled at runtime) at run time then some pretty nitfy things could be achieved with metaprogramming and expression templates. I'm not sure there are any strong backend candidates to provide satisfaction to c++ developers and users at this time but there has to be needs besides mine. Xerces is poor at large xml documents. As far as DOM is rearranging xml elements/attributes being pursued? http://xalan.apache.org/ is xslt 1.0 and after 2.0 noone wants to go back to 1.0. Binding is an important area for me. xmlbeanscxx which is based on Xerces couldn't satisfy for binding(because the underlying DOM wasn't helpful in the task of binding) data into my applications. Xml schema constraints are a must for binding. The http://sourceforge.net/projects/pion/ could really use a binder inside it's RESTful web service. In other languages compact http://relaxng.org/ is getting addressed too. I just saw http://code.google.com/p/xplus-xsd2cpp/ recently and have yet to test it. (If you do try it, do so outside of any of your own code and in its own folder) To give examples, I use cml, mathml, graphml, svg, bibtexml and a number of custom xml formats. Each of these have their quirks and are difficult to bind. Haven't tried http://vtd-xml.sourceforge.net/ for a while because its license doesn't work for my company. With custom code I've been doing something similar for simply reading data from xml documents. On 05/09/2013 10:26 AM, Stefan Seefeld wrote:

...

Bjorn,

we are going in circles, which is in part because we still are talking past each other.

In particular, it seems you aren't distinguishing between users and developers.

On 05/09/2013 06:00 AM, Bjorn Reese wrote:

...
On 05/08/2013 02:08 PM, Stefan Seefeld wrote:

...
You are evading the question. A user may not even care how boost.xml is implemented, as long as the functionality is there. If I'm such a user, I don't want to be confronted with the question of what backend to pick. Then create a 'boost-xml-standalone' package without dependencies, and let the 'boost-xml' package depend on the 'boost-xml-standalone' and 'libxml2' packages. Problem solved. Sorry, what problem is solved ?

...
...
Right. But again, I think you are making life much harder than it needs to be for users. As a user I want to use the boost.xml library in my own project. Do you really anticipate there to be a bunch of different backends being offered to end-users to pick from, depending on what functionality he requires ? What a drag ! Just give him a a single I thought that this was part of the GSoC proposal, which states: [...]

You are citing out of context. Implementing multiple backends has many benefits for *developers*, for example as it helps to guarantee that the API isn't tied to a particular backend. It should not affect in any way *users*, who will only use the boost.xml API (and library), without any concern for any particular implementation choice.

...
Having said that, with the proper defaults, the user do not have to do anything. Only if he wants to do something different does he need to include another header, pass an extra argument, or whatever. This is how the rest of Boost handles variation. Why has this suddenly become much harder? It hasn't, and when expressed that way, I actually agree. What I don't agree with is this:

...
Start with an XML lexer. This simply returns the next token (start tag, attribute, data, etc.) when called.

Put the XML lexer in a loop, and you get a SAX parser.

Pair the XML lexer with a parent stack, and you get an XmlReader.

Base the DOM parser on the SAX parser to create its tree. This is how libxml2 does it, and how it reuses the tree generator for parsing other formats such as HTML and DocBook.

By default, I would provide our own tree, although this is not terribly important. While the layering you describe pretty much matches a typical implementation, this doesn't have any consequences for users, as these layers can't be exchanged. You can't mix a layer from one backend and combine it with another layer from a different backend. So why care, on an API level ?

I believe your point was that you want to be able to implement only the "XML lexer", but neither the SAX nor DOM APIs, and still be able to call the result "boost.xml", yes ? I still think this is a bad idea. Otherwise, as long as the full functionality is provided, I don't care about the implementation, and in particular, whether someone will fancy to rewrite it "natively" instead of building on top of existing third-party libs.

Stefan

legalize+jeeves＠mail.xmission.com

10 May 10 May

12:59 a.m.

[Please do not mail me a copy of your followup] boost@lists.boost.org spake the secret code <518C302F.6000707@quantumbioinc.com> thusly:

...

What applications will develop from such an XML API?

One area where I can tell you that we had difficulty with C++ was attempting to operate on COLLADA documents. <http://en.wikipedia.org/wiki/COLLADA> The documents are very, very large and achieving efficient import/export for large 3D models was difficult. We swithced between several different XML libraries, first starting with free ones and then switching to commercial libraries. In the end, we made it work reasonably well, but it was a very painful experience and the memory usage and runtime performance of several libraries was very painful. -- "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline> The Computer Graphics Museum <http://computergraphicsmuseum.org> The Terminals Wiki <http://terminals.classiccmp.org> Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Bjorn Reese

12 May 12 May

11:35 a.m.

On 05/10/2013 01:24 AM, Roger Martin wrote:

...

This is a fun topic. How should c++ play 'catchup' to other languages on xml handling.

I agree that the use cases and requirements could be more clear.

...

What applications will develop from such an XML API? Xml editors and xml creators/modifiers? Data flow and communications between apps, web services? What can be leveraged in c++ to do something new/faster with xml? If

I believe Stefan's ultimate goal is the C++ standardization of XML APIs. This is similiar to how std::thread provides a threading API that can be implemented with, say, pthreads (ok, maybe threading is a bad example because it did require substantial changes to the C++ standard.) You can think of Boost.Xml as providing the primitives needed to parse (and generate) an XML document. The more advanced features such as XML Schema validation, XSLT transforms, and XQuery are currently out of scope, but they should be easier to build on top of the primitives.

...

Binding is an important area for me. xmlbeanscxx which is based on

It seems to me that there are three levels of bindings: 1. Schema-less binding where the structure of the XML document is handcoded into the application. This can be handled by Boost.Serialization. 2. Compile-time schema binding where a schema is used to generate the binding code. The xmlbeanscxx and xplus-xsd2cpp project that you refer to are examples if this kind of binding. 3. Run-time schema binding where new schemas can be loaded on-the-fly. Something like Microsoft InfoPath springs to mind.

...

Haven't tried http://vtd-xml.sourceforge.net/ for a while because its license doesn't work for my company. With custom code I've been doing something similar for simply reading data from xml documents.

With the right parsing primitive (an XML lexer), non-extracting parsing would be quite simple.

Bjorn Reese

11 a.m.

On 05/09/2013 04:26 PM, Stefan Seefeld wrote:

...

In particular, it seems you aren't distinguishing between users and developers.

In our case there is overlap between them. Some users do not care about the implementation details. Other users want to use most of the library, but want to change certain aspects, e.g. adding their own binary XML frontend, or providing their own DOM. Having said that, I believe I have satisfied all the "pure" user requirements that you have mentioned so far.

...

...
Then create a 'boost-xml-standalone' package without dependencies, and let the 'boost-xml' package depend on the 'boost-xml-standalone' and 'libxml2' packages. Problem solved.

Sorry, what problem is solved ?

The problem you mentioned: "A user may not even care how boost.xml is implemented, as long as the functionality is there. If I'm such a user, I don't want to be confronted with the question of what backend to pick."

...

You are citing out of context. Implementing multiple backends has many benefits for *developers*, for example as it helps to guarantee that the API isn't tied to a particular backend. It should not affect in any way *users*, who will only use the boost.xml API (and library), without any concern for any particular implementation choice.

So if we want to offer both libxml2 and Xerces wrappers, this would result two separate Boost libraries?

...

While the layering you describe pretty much matches a typical implementation, this doesn't have any consequences for users, as these layers can't be exchanged. You can't mix a layer from one backend and combine it with another layer from a different backend. So why care, on an API level ?

I was demonstrating how easy it would be to provide our own implementation of the well-established APIs (or their C++ equivalents.) It is possible to mix SAX and DOM from different backends, as I have indicated for libxml2, which is the only kind of mixing that I have argued for.

...

I believe your point was that you want to be able to implement only the "XML lexer", but neither the SAX nor DOM APIs, and still be able to call the result "boost.xml", yes ?

No, the XML lexer is more an implementation detail. My proposal includes the whole range: XML lexer, SAX, XmlReader, and DOM. I have never argued that Boost.Xml should contain anything less. XmlReader will be ideal for Boost.Serialization. A clean C++ separation (builder pattern) between SAX and DOM would give us the flexibility to plug in different DOMs. This will be ideal for Boost.PropertyTree, which has its own "DOM". The separation will also allow us to plug in different SAXs, such as binary XML parsers. Come to think of it, the XML lexer could be useful for indexing (see Roger Martin's reference to VDT-XML.)

legalize+jeeves＠mail.xmission.com

6 May 6 May

9:37 p.m.

[Please do not mail me a copy of your followup] boost@lists.boost.org spake the secret code <51877B4B.5040900@mail1.stofanet.dk> thusly:

...

Start with an XML lexer. This simply returns the next token (start tag, attribute, data, etc.) when called.

Boost.Spirit has a "mini xml" parser example that may be a useful starting point. -- "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline> The Computer Graphics Museum <http://computergraphicsmuseum.org> The Terminals Wiki <http://terminals.classiccmp.org> Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Simon Siemens

9 May 9 May

9:30 a.m.

Am Sonntag, den 05.05.2013, 11:53 +0200 schrieb Bjorn Reese:

...

On 05/01/2013 05:15 PM, Stefan Seefeld wrote:

...
As I have argued many times before on this list, I think it would be foolish to try to reimplement all the functionality to support XML.

I went back to read those discussions, and I think that there are good points on both sides. In my experience, most people only need a simple XML parser without all the extra features of XML Schema, XSLT, etc. Others also need fairly simple extensions such as XPath, and finally some need the full monty. It all depend on their use cases.

...
I agree with others that in the context of boost this should be about defining a good XML API, and then map that to existing libraries. In

Here is an alternative suggestion. Parsing the XML syntax is fairly simple, so we could provide a basic XML parser for those with simple needs. This XML parser should be based on the Builder design pattern (somewhat reminiscient of a SAX interface). The default builder will create our own tree/DOM, with which you can do nothing by simple tree manipulation. In case we need more complex features, such as XML Schema, we could provide a builder for, say, libxml2, which generates libxml2 trees that can be used, via the XML APIs you envision, to invoke the the libxml2 XML Schema validation.

This approach allows us to have a simple XML parser for projects like Boost.Serialization and Boost.PropertyTree without the need for an external dependency. It also allows us to do more advanced stuff with external dependencies.

I would prefer seeing it from the following point of view. I separate the GSOC project from what the library can be in the *best* case. Mingchao is doing a GSOC project. That project must find a reasonable result in a limit time. To succeed is important for him. The best way to reach this goal is to priorities requirements on the project result. For me, the prioritized requirements are 1. Define various interfaces (StAX style, SAX style, DOM style) in C++ standard library style. This is difficult but most important. 2. Implement the library with one backend. Only with this step, other people can really test the library and comment on the interface. Thus Step 1 and 2 cannot be separated. Rather Mingchao can start with one interface and do step 1 and 2. Then he proceeds with the next interface performing step 1 and 2, ... 3. Add more backends When Mingchao has a good interface and a good implementation with one backend for one, two or all three interface styles, he has a reasonable project result. If he has some time left, he can implement some more backends. Thus the question of implementing a native Boost backend is simply a matter of how far he gets down the prioritized requirements list. Because he already designs a backend component in the architecture, anyone else could also implement the native Boost backend. This feature independent of the GSOC project. It is just important, to a have generic backend interface. Cheers Simon

Bjorn Reese

10:03 a.m.

On 05/09/2013 11:30 AM, Simon Siemens wrote:

...

Mingchao is doing a GSOC project. That project must find a reasonable result in a limit time. To succeed is important for him. The best way to reach this goal is to priorities requirements on the project result.

I agree. Most of the discussion between Stefan and me is really about a potential Boost.Xml library, more than it is about the GSoC proposal.

Frédéric Bron

1 May 1 May

7:37 p.m.

...

In my opinion, an XML parser must be able to do these things: 1. To Iterate over DOM nodes tree; 2. To access the values of nodes and their attributes quickly; 3. To insert or delete nodes or attribute of an exact node easily; 4. To generate new XML from the structure which stores XML in the library.

Your work would be clearly useful. What I miss in the current implementation in property_tree is the ability to access line number and column number of nodes and attributes to be able to display useful error message not related to XML but to the values stored in the attributes. Frédéric

4436

Age (days ago)

4450

Last active (days ago)

List overview

Download

36 comments

14 participants

participants (14)

Amos Ji
Andrey Semashev
Bjorn Reese
Boris Schaeling
Daniel Pfeifer
Frédéric Bron
legalize+jeeves＠mail.xmission.com
Marshall Clow
Martin Desharnais
Rene Rivera
Roger Martin
Simon Siemens
Stefan Seefeld
姬明超