Feature request for boost::filesystem

Delfin Rojas

22 Jun 2004 22 Jun '04

6:24 p.m.

Hello everybody, I'm new to this mailing list. The first thing I should say is I love boost and I think it is a great library and I'm really thankful to all its contributors. Now, that being said, I'm sending this email to ask about a feature in the boost::filesystem library that I believe would be valuable to have. Currently the only library that supports file and directory operations in boost is boost::filesystem. Thanks tot hat library I don't have to use platform specific file manipulation rutines for things like copying a file, deleting all files in a directory, iterating through a directory, etc. Now, all the operations in boost::filesystem take a boost::filesystem::path as an object encapsulating a platform independent "path" to a file or a directory. My problem is this path can only be built using a single char (ANSI) character string. This doesn't allow me to use boost::filesystem in windows with Unicode support since I cannot convert a wide string (UTF16) path to a boost::filesystem::path object. I believe it must not be too difficult to add Unicode path support to boost::filesystem since only the "path" class would need to be modified I think. Another small feature I think could be interesting is a convenience function to set the current path (cd operation). There is a function already to get the current path but not to set it. I do not know if this is the right group to ask for these features. If not please point me in the right direction. Thanks a lot Delfin Rojas Delfin@moodlogic.com

Attachments:

attachment.html (text/html — 4.2 KB)

Show replies by date

Keith MacDonald

25 Jun 25 Jun

4:11 p.m.

I decided not to use boost::filesystem, because it does not support Unicode. There's a thread in the archives about it, explaining that Unicode was ignored, because it was specific to Windows, and this is intended to be a portable library. However, I think the developers have missed the point that the Windows file system uses Unicode natively, so boost::filesystem is not really portable to it. A more useful solution, in my opinion, would be one that allowed the user to choose which char type to use, like boost::regex. Keith MacDonald

...

"Delfin Rojas" <drojas@moodlogic.com> wrote in message news:200406221823.i5MINpjj020023@patti.moodlogic.com... Hello everybody,

I'm new to this mailing list. The first thing I should say is I love boost and I think it is a great library and I'm really thankful to all its contributors. Now, that being said, I'm sending this email to ask about a feature in the boost::filesystem library that I believe would be valuable to have.

Currently the only library that supports file and directory operations in boost is boost::filesystem. Thanks tot hat library I don't have to use platform specific file manipulation rutines for things like copying a file, deleting all files in a directory, iterating through a directory, etc. Now, all the operations in boost::filesystem take a boost::filesystem::path as an object encapsulating a platform independent "path" to a file or a directory. My problem is this path can only be built using a single char (ANSI) character string. This doesn't allow me to use boost::filesystem in windows with Unicode support since I cannot convert a wide string (UTF16) path to a boost::filesystem::path object. I believe it must not be too difficult to add Unicode path support to boost::filesystem since only the "path" class would need to be modified I think.

Another small feature I think could be interesting is a convenience function to set the current path (cd operation). There is a function already to get the current path but not to set it.

I do not know if this is the right group to ask for these features. If not please point me in the right direction.

Thanks a lot

Delfin Rojas Delfin@moodlogic.com

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Duane Murphy

5:23 p.m.

--- At Fri, 25 Jun 2004 17:11:12 +0100, Keith MacDonald wrote:

...

I decided not to use boost::filesystem, because it does not support Unicode. There's a thread in the archives about it, explaining that Unicode was ignored, because it was specific to Windows, and this is intended to be a portable library. However, I think the developers have missed the point that the Windows file system uses Unicode natively, so boost::filesystem is not really portable to it. A more useful solution, in my opinion, would be one that allowed the user to choose which char type to use, like boost::regex.

I thought I would point out the Mac OS X is also a unicode file system. Naturally the future of all file systems will be toward unicode. Unicode support ought to be addressed. ...Duane

Delfin Rojas

29 Jun 29 Jun

8:56 p.m.

New subject: Possible is_shared_ptr type_traits function

Hi everybody, I am writing a template class that processes several types. However my problem is that when the type of the template is boost::shared_ptr (and only in that case) I need to do a boost::dynamic_pointer_cast inside a method. So, I searched boost::type_traits library but could not find anything to detect a shared_ptr type at compile time. I would appreciate any ideas of how this could be implemented. Thanks -delfin

John Maddock

28 Jun 28 Jun

1:24 p.m.

New subject: Possible is_shared_ptr type_traits function

...

I am writing a template class that processes several types. However my problem is that when the type of the template is boost::shared_ptr (and only in that case) I need to do a boost::dynamic_pointer_cast inside a method.

So, I searched boost::type_traits library but could not find anything to detect a shared_ptr type at compile time. I would appreciate any ideas of how this could be implemented.

Easy: not a traits class, but a function overload: template <class To, class From> To* my_cast(From* p) { return dynamic_cast<To*>(p); } template <class To, class From> shared_ptr<To> my_cast(shared_ptr<From> p) { return dynamic_pointer_cast<From>(p); } Then call my_cast<Target*>(p); rather than: dynamic_cast<Target*>(p); John.

Jonathan Turkanis

29 Jun 29 Jun

9:36 p.m.

New subject: Possible is_shared_ptr type_traits function

"Delfin Rojas" <drojas@moodlogic.com> wrote in message news:200406292056.i5TKuKjj027844@patti.moodlogic.com...

...

Hi everybody,

I am writing a template class that processes several types. However my problem is that when the type of the template is boost::shared_ptr (and only in that case) I need to do a boost::dynamic_pointer_cast inside a method.

So, I searched boost::type_traits library but could not find anything to detect a shared_ptr type at compile time. I would appreciate any ideas of how this could be implemented.

Here's one way: template<typename T> struct is_shared_ptr : mpl::false_ { }; template<typename P> struct is_shared_ptr< shared_ptr > : mpl::true_ { }; You can get approximately the same effect without partial specialization, if necessary. Jonathan

David Abrahams

30 Jun 30 Jun

12:23 a.m.

New subject: Possible is_shared_ptr type_traits function

"Delfin Rojas" <drojas@moodlogic.com> writes:

...

Hi everybody,

I am writing a template class that processes several types. However my problem is that when the type of the template is boost::shared_ptr (and only in that case) I need to do a boost::dynamic_pointer_cast inside a method.

So, I searched boost::type_traits library but could not find anything to detect a shared_ptr type at compile time. I would appreciate any ideas of how this could be implemented.

#include <boost/mpl/bool.hpp> #include <boost/shared_ptr.hpp> template <class T> struct is_shared_ptr : mpl::false_ {}; template <class T> struct is_shared_ptr<shared_ptr<T> > : mpl::true_ {}; -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Jeff Wang

7 Jul 7 Jul

8:08 p.m.

New subject: shared_ptr Rationale

Hello All, I have old codes which are like below: Struct A { int a; double X; } Struct B { int b; double Y; A* pA; } May I use boost::shared_ptr to define the sturct B smart pointer? like boost::shared_ptr sB(new B)? Thanks Jeff __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail

Vladimir Prus

30 Jun 30 Jun

7:44 a.m.

Keith MacDonald wrote:

...

I decided not to use boost::filesystem, because it does not support Unicode. There's a thread in the archives about it, explaining that Unicode was ignored, because it was specific to Windows, and this is intended to be a portable library. However, I think the developers have missed the point that the Windows file system uses Unicode natively, so boost::filesystem is not really portable to it. A more useful solution, in my opinion, would be one that allowed the user to choose which char type to use, like boost::regex.

Keith MacDonald

2:04 p.m.

...

And what would POSIX system to with basic_path<wchar_t>?

As it's not possible to write an application that's binary portable between Windows and POSIX systems, does it matter? When I build my code, I know what it's targeted at, so could set the template parameter appropriately. That's all I require. Keith MacDonald "Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cbtr09$co8$1@sea.gmane.org...

...

Keith MacDonald wrote:

...
I decided not to use boost::filesystem, because it does not support Unicode. There's a thread in the archives about it, explaining that Unicode was ignored, because it was specific to Windows, and this is intended to be a portable library. However, I think the developers have missed the point that the Windows file system uses Unicode natively, so boost::filesystem is not really portable to it. A more useful solution, in my opinion, would be one that allowed the user to choose which char type to use, like boost::regex.

And what would POSIX system to with basic_path<wchar_t>? I don't know exactly what Beman thinks on the matter, but I'd prefer some single class which works everywhere. Maybe, you could summarize all use cases you need and post them to the developer's list?

- Volodya

Vladimir Prus

1 Jul 1 Jul

6:20 a.m.

Keith MacDonald wrote:

...

...
And what would POSIX system to with basic_path<wchar_t>?

As it's not possible to write an application that's binary portable between Windows and POSIX systems, does it matter? When I build my code, I know what it's targeted at, so could set the template parameter appropriately. That's all I require.

But if I write a code which deals with paths, I don't want it all to be templated. Really. So I want boost::filesystem::path which can support *both* ascii and unicode. That's why I was interested in specific use cases -- could you still provide them? For example, you surely want to create boost::path from Unicode string. Do you want to create them from ascii string? What encoding that ascii string should be in? (E.g. you get "base" path in unicode and the get relative path from some non-unicode-aware source). Would you want to convert unicode path back to ascii? And so on... - Volodya

Peter Dimov

11:34 a.m.

Vladimir Prus wrote:

...

For example, you surely want to create boost::path from Unicode string. Do you want to create them from ascii string?

Yes.

...

What encoding that ascii string should be in?

Implementation (OS) defined (by 'ASCII string' you mean narrow string, I presume, since ASCII _is_ an encoding).

...

Would you want to convert unicode path back to ascii?

There is no such thing as an "Unicode path". A path is a path. A path constructed from a wide string should be able to give you a narrow string, but whether this narrow string will construct the same path is implementation defined.

Vladimir Prus

12:34 p.m.

Peter Dimov wrote:

...

Vladimir Prus wrote:

...
For example, you surely want to create boost::path from Unicode string. Do you want to create them from ascii string?

Yes.

Ok.

...

...
What encoding that ascii string should be in?

Implementation (OS) defined (by 'ASCII string' you mean narrow string, I presume, since ASCII _is_ an encoding).

(Yes, I meant narrow string). That's a kind of answer I hoped for. This means that just basic_path<wchar_t> is not so good, since it won't have conversion from char*.

...

...
Would you want to convert unicode path back to ascii?

There is no such thing as an "Unicode path". A path is a path.

Unless there are two classes, basic_path<char> and basic_path<wchar_t>, of course, in which case they are 8-bit path and unicode path.

...

A path constructed from a wide string should be able to give you a narrow string, but whether this narrow string will construct the same path is implementation defined.

Do you think it should be able to *always* give you a narrow string? Even if OS-defined 8-bit encoding cannot represent some of the characters in the original wide string. Won't throwing be more appropriate? - Volodya

Delfin Rojas

30 Jun 30 Jun

6:26 p.m.

Well, my point is precisely that this class does _not_ work everywhere. In an OS with a Unicode file system I cannot use the boost::filesystem library to represent a path to a Unicode directory. If the characters in the name of the directory are in the current locale then it is possible to use the name but if I am using an English locale and the name of the directory is in Japanese then I cannot point boost::filesystem to it using an ANSI string. However this is not the main problem. The main problem arrives with iteration. There is even a point in the boost::filesystem documentation that says an open issue with the library is what happens when a Unicode file is found during directory iteration. As you can understand, if I cannot retrieve the names of all the files in the directory when the directory is iterated then the library is not working for me. If the library would attempt to convert the name to ANSI and I am not running the right locale then the conversion would produce garbage. So what I ask is a library that works everywhere, just like you say. I understand your point about POSIX file systems but since the library is compiled for Windows _or_ for POSIX systems I think it would be possible to compile for single char strings or double byte strings (UTF16). Windows systems solve this problem with the concept of TCHAR, a type that is defined as a char or wchar_t depending on a preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;) Thanks -delfin -----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Vladimir Prus Sent: Wednesday, June 30, 2004 12:44 AM To: boost-users@lists.boost.org Subject: [Boost-users] Re: Feature request for boost::filesystem Keith MacDonald wrote:

...

I decided not to use boost::filesystem, because it does not support Unicode. There's a thread in the archives about it, explaining that Unicode was ignored, because it was specific to Windows, and this is intended to be a portable library. However, I think the developers have missed the point that the Windows file system uses Unicode natively, so boost::filesystem is not really portable to it. A more useful solution, in my opinion, would be one that allowed the user to choose which char type to use, like boost::regex.

And what would POSIX system to with basic_path<wchar_t>? I don't know exactly what Beman thinks on the matter, but I'd prefer some single class which works everywhere. Maybe, you could summarize all use cases you need and post them to the developer's list? - Volodya _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Jeff Garland

1 Jul 1 Jul

2:21 a.m.

On Wed, 30 Jun 2004 11:26:14 -0700, Delfin Rojas wrote

...

... snip ... preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;)

I'm guessing it's a bit more complicated than that. In February Beman (author of filesystem) was working on a proposal for this which was to be discussed at the Sydney meeting. See http://lists.boost.org/MailArchives/boost/msg60389.php You can find a copy of it here: http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/papers/2004/n1576.html Apparently he didn't finish in time because he left internationalization as an open issue (see the bottom of the previous link). I'm not certain what's happened to Beman -- he hasn't posted lately -- hopefully he or someone else can update the state of the work. Jeff

Edward Diener

3:04 a.m.

Jeff Garland wrote:

...

On Wed, 30 Jun 2004 11:26:14 -0700, Delfin Rojas wrote

...
... snip ... preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;)

I'm guessing it's a bit more complicated than that. In February Beman (author of filesystem) was working on a proposal for this which was to be discussed at the Sydney meeting. See

http://lists.boost.org/MailArchives/boost/msg60389.php

You can find a copy of it here:

http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/papers/2004/n1576.html

Apparently he didn't finish in time because he left internationalization as an open issue (see the bottom of the previous link). I'm not certain what's happened to Beman -- he hasn't posted lately -- hopefully he or someone else can update the state of the work.

I went through long arguments on comp.std.c++ trying to push for wide character filenames to be added into the current C++ standard library where only narrow character filenames are allowed. I wasn't attempting to specify what those wide character filenames should mean on operating systems which supported them, but suggesting it should be completely implementation specific. Nor was I attempting to suggest that a wide character filename be connected in any way to a particular Unicode representation. Merely on the basis of orthogonality, and the fact that a number of operating systems supported it, I argued that wide character filenames be allowed. In the C++ standard library the only thing known about filenames is that they are currently a sequence of narrow characters, with all other meanings and usage being purely implementation defined. My suggestion was that wide character filenames be added to the C++ standard library with the single proviso that they be a sequence of wide characters, with all other meanings being purely implementation defined. Needless to say, my suggestion was rejected. In the case of the boost::filesystem the situation is a great deal more complicated since the notion of a filename is much more specifically defined than in the C++ standard library. So I do wish Mr. Dawes good luck in attempting to come up with a good definition of wide character filenames in his library.

David Abrahams

4:28 p.m.

"Edward Diener" <eddielee@tropicsoft.com> writes:

...

In the C++ standard library the only thing known about filenames is that they are currently a sequence of narrow characters, with all other meanings and usage being purely implementation defined. My suggestion was that wide character filenames be added to the C++ standard library with the single proviso that they be a sequence of wide characters, with all other meanings being purely implementation defined. Needless to say, my suggestion was rejected.

IIRC some people agreed with you, and some didn't. No? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Edward Diener

2 Jul 2 Jul

12:50 a.m.

David Abrahams wrote:

...

"Edward Diener" <eddielee@tropicsoft.com> writes:

...
In the C++ standard library the only thing known about filenames is that they are currently a sequence of narrow characters, with all other meanings and usage being purely implementation defined. My suggestion was that wide character filenames be added to the C++ standard library with the single proviso that they be a sequence of wide characters, with all other meanings being purely implementation defined. Needless to say, my suggestion was rejected.

IIRC some people agreed with you, and some didn't. No?

That is true, but I would say that those people who were part of the C++ standard committee generally didn't. In particular Mr. Plauger didn't, and he seems to be a leading force behind internationalization efforts on the commitee. I received a strong feeling that the committee had already rejected the argument previously to add implementation defined wide character file name support to the standard library, and that no argument could change their mind. What still bothers me is how clear my argument was and how much resistance existed to it. Hopefully my argument, and that of others on the NG who also argued for the support, will lead to adding it to the C++ standard library in the future.

David Abrahams

3:43 a.m.

"Edward Diener" <eddielee@tropicsoft.com> writes:

...

David Abrahams wrote:

...
IIRC some people agreed with you, and some didn't. No?

That is true, but I would say that those people who were part of the C++ standard committee generally didn't. In particular Mr. Plauger didn't, and he seems to be a leading force behind internationalization efforts on the commitee.

I don't think he is. There really aren't many internationalization efforts AFAICT.

...

I received a strong feeling that the committee had already rejected the argument previously to add implementation defined wide character file name support to the standard library, and that no argument could change their mind.

Take care not to interpret one or two peoples' strongly-stated opinions as the judgement of the committee as a whole.

...

What still bothers me is how clear my argument was and how much resistance existed to it. Hopefully my argument, and that of others on the NG who also argued for the support, will lead to adding it to the C++ standard library in the future.

Not likely. As I've said over and over; proposals and follow-through get stuff added. Arguments on newsgroups typically do not. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Vladimir Prus

1 Jul 1 Jul

6:34 a.m.

Delfin Rillustration> So what I ask is a library that works everywhere, just like you say.

...

I understand your point about POSIX file systems but since the library is compiled for Windows _or_ for POSIX systems I think it would be possible to compile for single char strings or double byte strings (UTF16). Windows systems solve this problem with the concept of TCHAR, a type that is defined as a char or wchar_t depending on a preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;)

I'm very much opposed to the idea of having templated classes for unicode support. Let me explain. Suppose I write a library which does something with strings, paths, whatever in the interface. I have these choices: 1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode. The first approach is bad for code size reasons. If the library does substantial work (e.g. HTTP library), it better be dynamic library, so that applications don't have include all the code. Of course, you might want static linking, but dynamic linking should be possible too. The second approach is what's commonly done. As the result, unicode is not very supported in the standard C++. The third approach looks reasonable. However, Mr. Random Library Writer might think: "I don't have a need for unicode now, so wstring is overhead". So he might discard this approach and use std::string. Even if wstring is used, there are some issues. E.g. std::wstring does not have a constructor taking either char* or std:string, or if you ever get ascii string, you're in trouble. The last solution is what looks most reasonable to me. It would be nice to have unicode string which can be created from ascii or unicode, converted back to either representation, and manipulated without regard to encoding. For fs::path it would be that unicode will be just supported. The templated solution will bring us back to the above choice. And if you have a library which has boost::filesystem::basic_path<char> in the interface, it does not matter if your whole application uses basic_path<charT> -- you'd need conversions somewhere, so why don't have single fs::path which can do all conversions. Just an illustraction: class path { path(const std::string& s); // The 's' is in the local 8-bit encoding path(const std::wstring& s); // The 's' in in unicode template<class charT> std::basic_string<charT> native_file_path(); // other operations, independed of charT }; - Volodya

Keith MacDonald

2:10 p.m.

...

Delfin Rillustration> So what I ask is a library that works everywhere, just like you say.

...
I understand your point about POSIX file systems but since the library

is

...
compiled for Windows _or_ for POSIX systems I think it would be possible to compile for single char strings or double byte strings (UTF16). Windows systems solve this problem with the concept of TCHAR, a type that is defined as a char or wchar_t depending on a preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;)

I'm very much opposed to the idea of having templated classes for unicode support.

Let me explain. Suppose I write a library which does something with strings, paths, whatever in the interface. I have these choices:

1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons. If the library does substantial work (e.g. HTTP library), it better be dynamic library, so

I can't see why this has to be so complicated. I either build my Windows apps for DBCS or for Unicode, so use the narrow or wide Win32 file system API accordingly. All that I want from boost::filesystem is a simple switch that sets its mode at compile time. Doing that with a template parameter is not going to cause any code bloat, and is neater than Microsoft's #ifdef _UNICODE method. I suppose some people may want to use narrow and wide APIs within a single application, but they can't use boost::filesystem now anyway, so just keep it simple. - Keith MacDonald "Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cc0ba6$iqe$1@sea.gmane.org... that

...

applications don't have include all the code. Of course, you might want static linking, but dynamic linking should be possible too.

The second approach is what's commonly done. As the result, unicode is not very supported in the standard C++.

The third approach looks reasonable. However, Mr. Random Library Writer might think: "I don't have a need for unicode now, so wstring is overhead". So he might discard this approach and use std::string.

Even if wstring is used, there are some issues. E.g. std::wstring does not have a constructor taking either char* or std:string, or if you ever get ascii string, you're in trouble.

The last solution is what looks most reasonable to me. It would be nice to have unicode string which can be created from ascii or unicode, converted back to either representation, and manipulated without regard to encoding.

For fs::path it would be that unicode will be just supported. The templated solution will bring us back to the above choice. And if you have a library which has boost::filesystem::basic_path<char> in the interface, it does not matter if your whole application uses basic_path<charT> -- you'd need conversions somewhere, so why don't have single fs::path which can do all conversions.

Just an illustraction:

class path { path(const std::string& s); // The 's' is in the local 8-bit encoding path(const std::wstring& s); // The 's' in in unicode

template<class charT> std::basic_string<charT> native_file_path();

// other operations, independed of charT };

- Volodya

Vladimir Prus

2:42 p.m.

Keith MacDonald wrote:

...

I can't see why this has to be so complicated. I either build my Windows apps for DBCS or for Unicode, so use the narrow or wide Win32 file system API accordingly. All that I want from boost::filesystem is a simple switch that sets its mode at compile time. Doing that with a template parameter is not going to cause any code bloat, and is neater than Microsoft's #ifdef _UNICODE method.

I have tried to explain this in great detail. Let me rephrase again: neither template not define is really attractive when used in a library interface.

...

I suppose some people may want to use narrow and wide APIs within a single application, but they can't use boost::filesystem now anyway, so just keep it simple.

If we're going to change boost::path interface, it's better to change it in a way which will be OK for everyone, not only to those which consistently use only one kind of characters. Otherwise, some time later we'd have to change the interface yet again. - Volodya

Russell Hind

3:42 p.m.

Vladimir Prus wrote:

...

I have tried to explain this in great detail. Let me rephrase again: neither template not define is really attractive when used in a library interface.

Maybe I'm missing something, but couldn't it be like std::string and std::wstring? Thanks Russell

Vladimir Prus

3:57 p.m.

Russell Hind wrote:

...

Vladimir Prus wrote:

...
I have tried to explain this in great detail. Let me rephrase again: neither template not define is really attractive when used in a library interface.

Maybe I'm missing something, but couldn't it be like std::string and std::wstring?

I think std::string and std::wstring have exactly the same drawbacks. On a library interface (when the library is compiled one, not-header only), you have to use either string of wstring. If there were single std::string which supported wide characters, there would be no choice, and most C++ libraries were at last half-ready for Unicode. As an example, there are two environment with a single string type: Qt and Java, and in both there's no issue of Unicode any more, AFAICT. - Volodya

David Abrahams

4:25 p.m.

Vladimir Prus <ghost@cs.msu.su> writes:

...

Russell Hind wrote:

...
Vladimir Prus wrote:

...
I have tried to explain this in great detail. Let me rephrase again: neither template not define is really attractive when used in a library interface.

Maybe I'm missing something, but couldn't it be like std::string and std::wstring?

I think std::string and std::wstring have exactly the same drawbacks. On a library interface (when the library is compiled one, not-header only), you have to use either string of wstring. If there were single std::string which supported wide characters, there would be no choice, and most C++ libraries were at last half-ready for Unicode.

As an example, there are two environment with a single string type: Qt and Java, and in both there's no issue of Unicode any more, AFAICT.

Har! Java "unicode" is utf-16, I think. Unicode now has at least 32 bits per character, IIUC, so I don't think any simplistic interface choices can make a non-issue of Unicode. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Vladimir Prus

2 Jul 2 Jul

7:12 a.m.

David Abrahams wrote:

...

...
As an example, there are two environment with a single string type: Qt and Java, and in both there's no issue of Unicode any more, AFAICT.

Har!

Java "unicode" is utf-16, I think. Unicode now has at least 32 bits per character, IIUC, so I don't think any simplistic interface choices can make a non-issue of Unicode.

Huh, the utf-16 is 16-bit *encoding* for 32-unicode, it's not 16-bit unicode. There are so called surrogate pairs which allows to represent 32-bit values. According to http://java.sun.com/j2se/1.5.0/docs/guide/intl/enhancements.html and http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ this is not what Java did in 1.4, but with 1.5 release it really supports 32-bit encoding. - Volodya

David Abrahams

1:07 p.m.

Vladimir Prus <ghost@cs.msu.su> writes:

...

David Abrahams wrote:

...
...
As an example, there are two environment with a single string type: Qt and Java, and in both there's no issue of Unicode any more, AFAICT.

Har!

Java "unicode" is utf-16, I think. Unicode now has at least 32 bits per character, IIUC, so I don't think any simplistic interface choices can make a non-issue of Unicode.

Huh, the utf-16 is 16-bit *encoding* for 32-unicode, it's not 16-bit unicode. There are so called surrogate pairs which allows to represent 32-bit values.

Sure; by the same token we could also use utf-8 and encode your Unicode in narrow strings. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

John Meinel

3:22 p.m.

David Abrahams wrote:

...

Sure; by the same token we could also use utf-8 and encode your Unicode in narrow strings.

Actually, I was wondering why this isn't used? The "big" advantage for UTF-16 was that it followed the one char->one code point. But then that was broken with the new UNICODE spec. So why not stick with utf-8. I know that most Linux file systems will support utf-8 (if your terminal supports it, then you see the nice characters, otherwise you see really bad "ASCII" ones.) I know there is a gnome library with a Glib::ustring that I believe internally uses a utf-8 string. However, isn't utf-8 fully compatible with std::string? Provided that you understand some "characters" take more than one char? But that only matters when you are trying to interpret what the string means, which is done by the OS, or by something that is rendering it on the screen. I suppose you still have to convert whenever you call one of the OpenFileW commands. And probably that is what all this is about. Someone feels that everything should be handled in the "native" format (which on Win32 is some sort of wchar_t, and on other platforms is char (though a UTF-8 char)). My personal vote is to have the library convert to whatever internal representation is considered "preferred", and then have the convenience functions for converting to whatever the user wants. (native_file_wstring). I don't think templated makes sense, since boost::filesystem is a library, not just a collection of headers. John =:->

Aaron W. LaFramboise

8:42 p.m.

John Meinel wrote:

...

David Abrahams wrote:

...
Sure; by the same token we could also use utf-8 and encode your Unicode in narrow strings.

Actually, I was wondering why this isn't used? The "big" advantage for UTF-16 was that it followed the one char->one code point. But then that was broken with the new UNICODE spec. So why not stick with utf-8. I know that most Linux file systems will support utf-8 (if your terminal supports it, then you see the nice characters, otherwise you see really bad "ASCII" ones.)

I know there is a gnome library with a Glib::ustring that I believe internally uses a utf-8 string.

However, isn't utf-8 fully compatible with std::string? Provided that you understand some "characters" take more than one char? But that only matters when you are trying to interpret what the string means, which is done by the OS, or by something that is rendering it on the screen.

(I am not an expert.) Unfortunately, utf8 and similar do not work correctly in C++ for many common cases. For example, the thousands separator in a C++ is mandated by the standard to only be a single character, but in some locales, the utf8 sequence to represent the preferred character is more than one character. utf8 is great for simply storing and copying strings, but it will fail quickly if you try to do any character-level direct manipulation on it without outside help.

...

I suppose you still have to convert whenever you call one of the OpenFileW commands. And probably that is what all this is about. Someone feels that everything should be handled in the "native" format (which on Win32 is some sort of wchar_t, and on other platforms is char (though a UTF-8 char)).

My personal vote is to have the library convert to whatever internal representation is considered "preferred", and then have the convenience functions for converting to whatever the user wants. (native_file_wstring).

I agree. I think the interface should have both narrow and wide versions, provided was normal functions without templates or other character polymorphism. On operating systems that only use char, we can do the same conversion that std::wcout presently does on these systems. On operating systems such as Win32 that have the unique ability to take both narrow and wide operands natively, no conversion will be necessary. I don't think this will do the wrong thing in any reasonable case. Aaron W. LaFramboise

Vladimir Prus

3 Jul 3 Jul

8:12 a.m.

David Abrahams wrote:

...

...
...
Java "unicode" is utf-16, I think. Unicode now has at least 32 bits per character, IIUC, so I don't think any simplistic interface choices can make a non-issue of Unicode.

Huh, the utf-16 is 16-bit *encoding* for 32-unicode, it's not 16-bit unicode. There are so called surrogate pairs which allows to represent 32-bit values.

Sure; by the same token we could also use utf-8 and encode your Unicode in narrow strings.

Yes. This is very reasonable approach for implementing support for Unicode without double the library size. And it's exactly what program_options does, BTW. - Volodya

David Abrahams

1 Jul 1 Jul

4:23 p.m.

Vladimir Prus <ghost@cs.msu.su> writes:

...

Delfin Rillustration> So what I ask is a library that works everywhere, just like you say.

...
I understand your point about POSIX file systems but since the library is compiled for Windows _or_ for POSIX systems I think it would be possible to compile for single char strings or double byte strings (UTF16). Windows systems solve this problem with the concept of TCHAR, a type that is defined as a char or wchar_t depending on a preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;)

I'm very much opposed to the idea of having templated classes for unicode support.

Let me explain. Suppose I write a library which does something with strings, paths, whatever in the interface. I have these choices:

1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons.

It doesn't have to be. There can be a library object with explicit instantiations of the wide and narrow classes. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Vladimir Prus

4:28 p.m.

David Abrahams wrote:

...

...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons.

It doesn't have to be. There can be a library object with explicit instantiations of the wide and narrow classes.

Which doubles the size of shared library itself. - Volodya

David Abrahams

4:45 p.m.

Vladimir Prus <ghost@cs.msu.su> writes:

...

David Abrahams wrote:

...
...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons.

It doesn't have to be. There can be a library object with explicit instantiations of the wide and narrow classes.

Which doubles the size of shared library itself.

It depends; the narrow specialization might be implemented in terms of the wide one ;-) -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Delfin Rojas

8:01 p.m.

It seems most people post here at night PST. I never thought my posting would generate so many interesting discussions. Vladimir, from your 4 options I agree #1 and #4 (or a combination of both) would be good for most of the cases. However I still think a #define based option would be the best. Let me explain myself: I have been taking a look at the library code and certainly the only thing that would need to change is to use a preprocessor define to turn on/off wide character strings and everywhere in the code use TChar strings. When the code is being compiled for POSIX systems this Unicode define should be turned off. In the Windows specific code all the calls to the Windows API would need to change from "FunctionCallA" to "FunctionCall" since internally the Windows API also works with TChar. The caller could also use the TChar idea to have its code talk to the library seamlessly. String constants can also be expressed in TChars (_T("my string") in Windows). Even in Windows 9X and Me where the Windows API is not Unicode natively this approach will work if the Microsoft redistributable DLL unicows.dll is placed in the directory where the application runs. This Dll transforms all the wide string API calls to narrow strings and converts the responses back to wide strings. As far as a library that can be passed both single char and double char strings it is also a possibility that would play along well with the scenario I just described. The library can perform a string_cast<TChar> always to make sure the string is converted to the string type being used by the library. If the library is compiled to use wide strings internally then string_cast<TChar> would convert char strings to wchar_t strings and wchar_t strings would remain unchanged. The contrary occurs when Unicode define is turned off. However, I feel this interface is not the best since it would allow the caller to mix single char strings and double char strings and this is not a good practice generally. Converting strings back and forth is not a fast process and conversions may not always result in what you expect, especially if you are a novice working with encodings. Somebody mentioned Java doesn't have this problem. This is because all strings in Java are UTF-16 (wchar_t) strings. Let me know what you guys think of all this. Thanks -delfin -----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of David Abrahams Sent: Thursday, July 01, 2004 9:46 AM To: boost-users@lists.boost.org Subject: [Boost-users] Re: Feature request for boost::filesystem Vladimir Prus <ghost@cs.msu.su> writes:

...

David Abrahams wrote:

...
...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons.

It doesn't have to be. There can be a library object with explicit instantiations of the wide and narrow classes.

Which doubles the size of shared library itself.

It depends; the narrow specialization might be implemented in terms of the wide one ;-) -- Dave Abrahams Boost Consulting http://www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Vladimir Prus

2 Jul 2 Jul

7:36 a.m.

Delfin Rojas wrote:

...

It seems most people post here at night PST. I never thought my posting would generate so many interesting discussions.

Well.. night PST is evening GMT+3, which explains at least my postings ;-)

...

I have been taking a look at the library code and certainly the only thing that would need to change is to use a preprocessor define to turn on/off wide character strings and everywhere in the code use TChar strings. When the code is being compiled for POSIX systems this Unicode define should be turned off. In the Windows specific code all the calls to the Windows API would need to change from "FunctionCallA" to "FunctionCall" since internally the Windows API also works with TChar.

Yes, that would work. But note that you might want to use wide string even on Linux -- so you get two versions, narrow and wide.

...

The caller could also use the TChar idea to have its code talk to the library seamlessly.

Yes, that's OK for application, where the decision to use Unicode is global. But if you write another library which uses the first one. Then it also must have two variants. This is what bothers me: everything library should be unicode and non-unicode variant, even if the differences can probably be hidden somewhere inside implemenetation.

...

String constants can also be expressed in TChars (_T("my string") in Windows).

If I understand correctly, this expands to L"my string" -- i.e. string constant. Then I think it's still needed to have portable string->wstring conversion which repsects the current locale.

...

As far as a library that can be passed both single char and double char strings it is also a possibility that would play along well with the scenario I just described. The library can perform a string_cast<TChar> always to make sure the string is converted to the string type being used by the library. If the library is compiled to use wide strings internally then string_cast<TChar> would convert char strings to wchar_t strings and wchar_t strings would remain unchanged. The contrary occurs when Unicode define is turned off.

Yes, that's what I find right. The question is whether you ever need two version of the library. Supposing that conversions are optimized enough, or that the performance does not matter much (e.g. for boost::path access to files via OS might cost must more than any conversion), then you can have just one version of the compiled library. The users don't have to worry which one to obtain/install/link to.

...

However, I feel this interface is not the best since it would allow the caller to mix single char strings and double char strings and this is not a good practice generally. Converting strings back and forth is not a fast process and conversions may not always result in what you expect, especially if you are a novice working with encodings.

This is where we disagree. For example, I want to support Unicode on Linux. All filesystem functions accept char*, so I *have* to do conversion. Another question is that many other function only return char*, so again I need conversions. Why can't they be done by boost::path? E.g.: boost::path p(L"......."); p /= argv[1]; p /= to_wstring(argv[1]); I don't really think the latter is better than the former. - Volodya

...

Somebody mentioned Java doesn't have this problem. This is because all strings in Java are UTF-16 (wchar_t) strings.

Let me know what you guys think of all this.

Thanks

-delfin

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of David Abrahams Sent: Thursday, July 01, 2004 9:46 AM To: boost-users@lists.boost.org Subject: [Boost-users] Re: Feature request for boost::filesystem

Vladimir Prus <ghost@cs.msu.su> writes:

...
David Abrahams wrote:

...
...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons.

It doesn't have to be. There can be a library object with explicit instantiations of the wide and narrow classes.

Which doubles the size of shared library itself.

It depends; the narrow specialization might be implemented in terms of the wide one ;-)

Vladimir Prus

7:48 a.m.

David Abrahams wrote:

...

...
...
...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons.

It doesn't have to be. There can be a library object with explicit instantiations of the wide and narrow classes.

Which doubles the size of shared library itself.

It depends; the narrow specialization might be implemented in terms of the wide one ;-)

Yes, and that would be absolutely reasonable. I think I'm not 100% against templated classes, I'd only want that: 1. wide version and narrow version is freely convertible to each other 2. there's only one price to pay: if you use either version you need to link one library which is 100K or 100M in size. If you use two versions you don't have any additional costs. In this way each library can use whatever interface is more *convenient*, but it will still be unicode aware. For example, the design for boost::path that I have in mind is: class path { std::string native_file_string(); std::string native_file_wstring(); }; the templated interface might be: template<class charT> class path { std::basic_string<charT> native_file_string(); }; It's in fact move convenient, because if you work only with wide string, you save one character when converting path to string. I don't know if this convenience justifies asking the user which kind of path he wants. Maybe it does. But if my two requirements above are met, it's only a convenience matter. BTW, probably single path can be move convenient in other situations: p = p / L"foo" / "bar"; - Volodya

Edward Diener

12:53 a.m.

Vladimir Prus wrote:

...

Delfin Rillustration> So what I ask is a library that works everywhere, just like you say.

...
I understand your point about POSIX file systems but since the library is compiled for Windows _or_ for POSIX systems I think it would be possible to compile for single char strings or double byte strings (UTF16). Windows systems solve this problem with the concept of TCHAR, a type that is defined as a char or wchar_t depending on a preprocessor definition. Then, boost::fylesystem::path could accept std::basic_string<TCHAR> instead of std::basic_string<char>. That would solve the problem and everybody would be happy ;)

I'm very much opposed to the idea of having templated classes for unicode support.

Let me explain. Suppose I write a library which does something with strings, paths, whatever in the interface. I have these choices:

1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons. If the library does substantial work (e.g. HTTP library), it better be dynamic library, so that applications don't have include all the code. Of course, you might want static linking, but dynamic linking should be possible too.

One could quite easily provide separate libraries for different character types, if a library was necessary in the first place.

Vladimir Prus

7:18 a.m.

Edward Diener wrote:

...

...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons. If the library does substantial work (e.g. HTTP library), it better be dynamic library, so that applications don't have include all the code. Of course, you might want static linking, but dynamic linking should be possible too.

One could quite easily provide separate libraries for different character types, if a library was necessary in the first place.

Yea, but whether as two separate file or one file, the size is still twice as large. E.g. if on a typical Linux system, just one application uses wide version, you have to install both wide and narrow version. Here on my box, the size of /usr/lib in 1.2G. Making it into 2.4G does not seem right ;-) - Volodya

Keith MacDonald

8:49 a.m.

I have an MP3 player with a 40G hard drive, which puts your concerns about a 2.4G library into perspective. That's got to be a secondary issue, compared with the requirements for convenience and functionality, when designing a library. Regarding the overhead of narrow to wide conversion, that's what any Windows app suffers, every time it calls a Win32 API with a narrow string parameter. Inside the kernel, everything is in Unicode (ignoring Win9x/ME). - Keith MacDonald "Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cc327o$irp$2@sea.gmane.org...

...

Edward Diener wrote:

...
...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons. If the library does substantial work (e.g. HTTP library), it better be dynamic library, so that applications don't have include all the code. Of course, you might want static linking, but dynamic linking should be possible too.

One could quite easily provide separate libraries for different character types, if a library was necessary in the first place.

Yea, but whether as two separate file or one file, the size is still twice as large. E.g. if on a typical Linux system, just one application uses wide version, you have to install both wide and narrow version. Here on my box, the size of /usr/lib in 1.2G. Making it into 2.4G does not seem right ;-)

- Volodya

Vladimir Prus

3 Jul 3 Jul

9:20 a.m.

Keith MacDonald wrote:

...

I have an MP3 player with a 40G hard drive, which puts your concerns about a 2.4G library into perspective.

Not quite. Whenever you do network install (which I happened to do), or update packages (which is being done regularly on Linux boxes here), the size matters much more that for local install.

...

That's got to be a secondary issue, compared with the requirements for convenience and functionality, when designing a library.

This issue should be considered, at least. Speaking about convenience and functionality, what's wrong with single path class which can accept both narrow and wide strings?

...

Regarding the overhead of narrow to wide conversion, that's what any Windows app suffers, every time it calls a Win32 API with a narrow string parameter. Inside the kernel, everything is in Unicode (ignoring Win9x/ME).

I know; do you mean this is an argument in favour of single implementation (that's what windows kernel does), or two separate implementations? - Volodya

Edward Diener

3:33 a.m.

Vladimir Prus wrote:

...

Edward Diener wrote:

...
...
1. Make the library interface templated. 2. Use narrow classes: e.g. string 3. Use wide classes: e.g. wstring 4. Have some class which works with ascii and unicode.

The first approach is bad for code size reasons. If the library does substantial work (e.g. HTTP library), it better be dynamic library, so that applications don't have include all the code. Of course, you might want static linking, but dynamic linking should be possible too.

One could quite easily provide separate libraries for different character types, if a library was necessary in the first place.

Yea, but whether as two separate file or one file, the size is still twice as large. E.g. if on a typical Linux system, just one application uses wide version, you have to install both wide and narrow version. Here on my box, the size of /usr/lib in 1.2G. Making it into 2.4G does not seem right ;-)

Why would you install both a narrow character version and a wide character version if you are only going to use one or the other ? Of course if you have applications which use both, you need to install both, but that doesn't make every application twice as large.

Vladimir Prus

8:09 a.m.

Edward Diener wrote:

...

...
Yea, but whether as two separate file or one file, the size is still twice as large. E.g. if on a typical Linux system, just one application uses wide version, you have to install both wide and narrow version. Here on my box, the size of /usr/lib in 1.2G. Making it into 2.4G does not seem right ;-)

Why would you install both a narrow character version and a wide character version if you are only going to use one or the other ? Of course if you have applications which use both, you need to install both, but that doesn't make every application twice as large.

I had in mind situation where there are two applications -- one which uses narrow version and another which uses wide version. If each library is used by several apps -- which is likely, and not all those apps agree and narrow/wide question, you really need to install two versions for each library. - Volodya

Edward Diener

12:50 p.m.

Vladimir Prus wrote:

...

Edward Diener wrote:

...
...
Yea, but whether as two separate file or one file, the size is still twice as large. E.g. if on a typical Linux system, just one application uses wide version, you have to install both wide and narrow version. Here on my box, the size of /usr/lib in 1.2G. Making it into 2.4G does not seem right ;-)

Why would you install both a narrow character version and a wide character version if you are only going to use one or the other ? Of course if you have applications which use both, you need to install both, but that doesn't make every application twice as large.

I had in mind situation where there are two applications -- one which uses narrow version and another which uses wide version. If each library is used by several apps -- which is likely, and not all those apps agree and narrow/wide question, you really need to install two versions for each library.

I don't understand what you are saying in the penultimate clause "and not all those apps agree and narrow/wide question". If an application uses both the narrow and wide versions of a library, then of course it will have to include both of them. My own experience is that most applications will use one or the other. An application built for the international market will probably use the wide character implementation, else if it is built only for languages whose encoding can be represented by the 256 code points of the narrow character set, it will use the narrow character implementation. This is the normal way libraries are used. One pays for what one uses. I think C++ should have template specializations for all of its native character types in its standard libraries whenever a character is being used properly as a native character type type. Currently C++ has two native character types, 'char' and wchar_t'. In the future who knows whether ot not other native character types will be added, perhaps a specific Unicode type. Using templates, and having specializations of its native character types, makes it much easier for C++ to adapt other future character types as native character types. Even when other native character types are not added to C++, creating one's own implementations of character types is much easier when templates and specializations are used. We have this wonderful facility in the C++ language, templates. Not using it, because an application might have to use different character types and include more than one specialization, seems illogical to me.

Vladimir Prus

5 Jul 5 Jul

8:10 a.m.

Edward Diener wrote:

...

...
I had in mind situation where there are two applications -- one which uses narrow version and another which uses wide version. If each library is used by several apps -- which is likely, and not all those apps agree and narrow/wide question, you really need to install two versions for each library.

I don't understand what you are saying in the penultimate clause "and not all those apps agree and narrow/wide question". If an application uses both the narrow and wide versions of a library, then of course it will have to include both of them. My own experience is that most applications will use one or the other.

Probably we misunderstand each other because I have in mind Linux model of application installation. Each application and each library is a separate package. Typically, a library is used by several applications. Now, I have a number of library packages. Say no application uses wide char interface at the moment, so only narrow libraries are installed, Now, a single application decides to use wide interface, so its package now depends on wide libraries. As the result, I have to install, in addition to some narrow libraries, their wide equivalents. After enough applications decide to use Unicode, most libraries will have to be installed in two flavours.

...

I think C++ should have template specializations for all of its native character types in its standard libraries whenever a character is being used properly as a native character type type. Currently C++ has two native character types, 'char' and wchar_t'. In the future who knows whether ot not other native character types will be added, perhaps a specific Unicode type. Using templates, and having specializations of its native character types, makes it much easier for C++ to adapt other future character types as native character types.

I don't think this is very likely for new character types to appear.

...

Even when other native character types are not added to C++, creating one's own implementations of character types is much easier when templates and specializations are used. We have this wonderful facility in the C++ language, templates. Not using it, because an application might have to use different character types and include more than one specialization, seems illogical to me.

I'm actually worried that when using templates in a straight-forward way, all libraries will have to some in two variants or be twice larger, which is bad because of: - code size reasons, - configurations reason (just one more configuration variant to worry about) - interoperability/convenience? (what if I use unicode paths and want to pass narrow string to one of the operators?) With a bit of additional design, it's possible to make library use one representation internally, and have either non-templated interface, or a tiny templated facade. E.g: boost::path p; p = p / L"foo" / "bar"; does not seem all that bad thing for me. - Volodya

Eddie Diener

11:21 p.m.

Vladimir Prus wrote:

...

Edward Diener wrote:

...
...
I had in mind situation where there are two applications -- one which uses narrow version and another which uses wide version. If each library is used by several apps -- which is likely, and not all those apps agree and narrow/wide question, you really need to install two versions for each library.

I don't understand what you are saying in the penultimate clause "and not all those apps agree and narrow/wide question". If an application uses both the narrow and wide versions of a library, then of course it will have to include both of them. My own experience is that most applications will use one or the other.

Probably we misunderstand each other because I have in mind Linux model of application installation. Each application and each library is a separate package. Typically, a library is used by several applications.

That's normal.

...

Now, I have a number of library packages. Say no application uses wide char interface at the moment, so only narrow libraries are installed, Now, a single application decides to use wide interface, so its package now depends on wide libraries. As the result, I have to install, in addition to some narrow libraries, their wide equivalents.

You only have to install the appropriate wide equivalents. There is nothing to say that a wide character application uses all wide character libraries. Obviously whatever wide character implementations it uses should be a package which can be installed as part of the application distribution.

...

After enough applications decide to use Unicode, most libraries will have to be installed in two flavours.

How is this worse than having a single version which has both wide and narrow character equivalents ? You are not saving anything in this latter way, and you are definitely worse than if the libraries were separate and you only used one or the other versions in your applications.

...

...
I think C++ should have template specializations for all of its native character types in its standard libraries whenever a character is being used properly as a native character type type. Currently C++ has two native character types, 'char' and wchar_t'. In the future who knows whether ot not other native character types will be added, perhaps a specific Unicode type. Using templates, and having specializations of its native character types, makes it much easier for C++ to adapt other future character types as native character types.

I don't think this is very likely for new character types to appear.

I do. I would be very surprised if C++ does not adapt new character types in the years to come. Do you really think that if the programming world settles on other standard character representations that C++ will adamantly ignore it ? Even now a number of programmers would like to see C++ support one of the Unicode standards natively, most likely UTF-32.

...

...
Even when other native character types are not added to C++, creating one's own implementations of character types is much easier when templates and specializations are used. We have this wonderful facility in the C++ language, templates. Not using it, because an application might have to use different character types and include more than one specialization, seems illogical to me.

I'm actually worried that when using templates in a straight-forward way, all libraries will have to some in two variants or be twice larger, which is bad because of:

No. There is nothing saying that a library must support more than one character type. But if it does, isolating each character type in its own header files and libraries is the right design.

...

- code size reasons, - configurations reason (just one more configuration variant to worry about) - interoperability/convenience? (what if I use unicode paths and want to pass narrow string to one of the operators?)

None of your reasons holds much weight. Code size wouldn't be affected since each implementation is in its own library. There is nothing to configure since character types are part of the C++ standard. If you need to pass a unicode path to a narrow string operator, you the programmer are either doing something wrong or, if there is a valid conersion, you can make it yourself ( like wcstombs ).

...

With a bit of additional design, it's possible to make library use one representation internally, and have either non-templated interface, or a tiny templated facade. E.g:

boost::path p; p = p / L"foo" / "bar";

does not seem all that bad thing for me.

It is possible to do that if you can convert all character types into your internal representation. Even here I am paying for conversionsa back and forth I may not need. I therefore would prefer separate templated libraries. Why make headaches for oneself ? I am always in favor of designs which are clear and understandable over all other considerations. The headaches one brings about in the future by trying to save a little size of speed in the present are innumerable. If there is a conversion between character types, I don't mind if a library for a particular character type supports it. When I say different libraries for different character types it doesn't preclude conversion routines.

Vladimir Prus

6 Jul 6 Jul

6:33 a.m.

Eddie Diener wrote:

...

...
Now, I have a number of library packages. Say no application uses wide char interface at the moment, so only narrow libraries are installed, Now, a single application decides to use wide interface, so its package now depends on wide libraries. As the result, I have to install, in addition to some narrow libraries, their wide equivalents.

You only have to install the appropriate wide equivalents. There is nothing to say that a wide character application uses all wide character libraries.

But a few wide applications can span a lot of libraries.

...

...
After enough applications decide to use Unicode, most libraries will have to be installed in two flavours.

How is this worse than having a single version which has both wide and narrow character equivalents ? You are not saving anything in this latter way, and you are definitely worse than if the libraries were separate and you only used one or the other versions in your applications.

I never talked about "single version which has both wide and narrow character equivalents". What I'm after is a single version which supports both wide and narrow interface/operations. But internally, it's just one code.

...

...
I don't think this is very likely for new character types to appear.

I do. I would be very surprised if C++ does not adapt new character types in the years to come. Do you really think that if the programming world settles on other standard character representations that C++ will adamantly ignore it ?

Isn't there one representation already?

...

Even now a number of programmers would like to see C++ support one of the Unicode standards natively, most likely UTF-32.

If you recall the Unicode discussions we had on the main list (hmm.... we probably should have this discussion there as well), there were two problems: 1. wchar_t is 16 bit on some platforms 2. even if it's 32 bit, wchar_t represents only codepoint, and complete character with all the accents and other marks might take several codepoints. The second problem is actually most serious, and I'm really not sure that the right solution would be yet another character type.

...

...
I'm actually worried that when using templates in a straight-forward way, all libraries will have to some in two variants or be twice larger, which is bad because of:

No. There is nothing saying that a library must support more than one character type. But if it does, isolating each character type in its own header files

I don't understand this. For templated implementation, you sure can't have wide and narrow version in different headers.

...

and libraries is the right design.

...

...
- code size reasons, - configurations reason (just one more configuration variant to worry about) - interoperability/convenience? (what if I use unicode paths and want to pass narrow string to one of the operators?)

None of your reasons holds much weight. Code size wouldn't be affected since each implementation is in its own library.

Only if you don't buy my argument about system-wide code size.

...

There is nothing to configure since character types are part of the C++ standard.

And? You still need to build two library variants, test them separately, make two packages. Current Boost build process creates a huge number of library variants (debug/release, MT/ST, stldebug ...). Is there a need to double that number for libraries which might need unicode?

...

If you need to pass a unicode path to a narrow string operator, you the programmer are either doing something wrong or, if there is a valid conersion, you can make it yourself ( like wcstombs ).

If I have basic_path<char> and want to convert it into basic_path<wchar_t>, do I really have to use mbstowcs? So, I need to iterate over all elements of a path, calling that function, and creating the path? Sorry, there should be a simpler way. And that simpler way is converting constructor.

...

...
With a bit of additional design, it's possible to make library use one representation internally, and have either non-templated interface, or a tiny templated facade. E.g:

boost::path p; p = p / L"foo" / "bar";

does not seem all that bad thing for me.

It is possible to do that if you can convert all character types into your internal representation. Even here I am paying for conversionsa back and forth I may not need.

If you want to append narrow path element to a unicode string, you *need* to convert. Besides, I'm not all that sure this conversion is performance bottleneck, given that boost::path need to use OS services. A single 'stat' that fs::exists does might make performance of conversion non-important.

...

I therefore would prefer separate templated libraries. Why make headaches for oneself ?

The templated library is much bigger headache that it seems. Unless you're willing to put template code in header (which is bad for big libraries, and is really bad for boost::fs which has to include system headers), you need: - declare templates in public headers - define templates in private headers/sources - explicitly instantiate the templates for char and wchar_t. Not so nice.

...

I am always in favor of designs which are clear and understandable over all other considerations.

And what's so un-understantable about boost::path which has both narrow and wide methods? - Volodya

David Abrahams

9:42 a.m.

Vladimir Prus <ghost@cs.msu.su> writes:

...

...
I am always in favor of designs which are clear and understandable over all other considerations.

And what's so un-understantable about boost::path which has both narrow and wide methods?

I'm finding it hard to argue with that idea. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Vladimir Prus

10:24 a.m.

David Abrahams wrote:

...

Vladimir Prus <ghost@cs.msu.su> writes:

...
...
I am always in favor of designs which are clear and understandable over all other considerations.

And what's so un-understantable about boost::path which has both narrow and wide methods?

I'm finding it hard to argue with that idea.

You mean it's hard to argue with the idea to have class path { path(char*) path(wchar_t*); } or with the idea to have template<class charT> class basic_path { basic_path(charT*) }; or that designs should be clear and understandable? ;-) - Volodya

David Abrahams

12:25 p.m.

Vladimir Prus <ghost@cs.msu.su> writes:

...

David Abrahams wrote:

...
Vladimir Prus <ghost@cs.msu.su> writes:

...
...
I am always in favor of designs which are clear and understandable over all other considerations.

And what's so un-understantable about boost::path which has both narrow and wide methods?

I'm finding it hard to argue with that idea.

You mean it's hard to argue with the idea to have

class path { path(char*) path(wchar_t*); }

Yes. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Peter Dimov

10:35 a.m.

Eddie Diener wrote:

...

Vladimir Prus wrote:

...
With a bit of additional design, it's possible to make library use one representation internally, and have either non-templated interface, or a tiny templated facade. E.g:

boost::path p; p = p / L"foo" / "bar";

does not seem all that bad thing for me.

It is possible to do that if you can convert all character types into your internal representation. Even here I am paying for conversionsa back and forth I may not need. I therefore would prefer separate templated libraries.

I'm not sure which conversions back and forth you have in mind, and how separate templated libraries avoid them. The goal of Boost.Filesystem is to allow you to write OS-independent code. However the native path character is OS-dependent. So if you, for example, enumerate a directory, you will get back paths that are either narrow or wide, depending on the native character type. You cannot make the narrow vs wide decision at compile time (assuming that the code does not change), because (1) there are wide paths that do not have a narrow representation, and vice versa; (2) Boost.Filesystem can talk to several different filesystems from within a single program. It is possible to typedef basic_path<_Native_fs_char> path, but then p /= "images"; may fail - you'll need to use BOOST_FS_PATH("images") or something equally ugly.

Delfin Rojas

5:49 p.m.

I think you bring very good points. When you receive a path object back from a directory iteration and you want to convert it to string it is really dangerous to expose both narrow and wide strings because the caller doesn't know if the string had a good conversion or not. On the other hand, I don't agree these arguments say the narrow vs wide decision cannot be made at compile time. If the programmer compiles the library for single character strings then she/he knows what to expect. After all a project in Windows gets compiled for single char or double char strings. Boost::filesystem can talk to several filesystems from within a single program but it currently has only two implementations; one implementation for POSIX and one for Windows. On Windows OS all the calls to any file system will happen through the Windows API which takes narrow and wide strings. In the POSIX system there is no wide string support for the file system so the library should work with narrow strings internally. -delfin -----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Peter Dimov Sent: Tuesday, July 06, 2004 3:35 AM To: boost-users@lists.boost.org Subject: Re: [Boost-users] Re: Re: Feature request for boost::filesystem Eddie Diener wrote:

...

Vladimir Prus wrote:

...
With a bit of additional design, it's possible to make library use one representation internally, and have either non-templated interface, or a tiny templated facade. E.g:

boost::path p; p = p / L"foo" / "bar";

does not seem all that bad thing for me.

It is possible to do that if you can convert all character types into your internal representation. Even here I am paying for conversionsa back and forth I may not need. I therefore would prefer separate templated libraries.

Vladimir Prus

7 Jul 7 Jul

6:03 a.m.

Delfin Rojas wrote:

...

In the POSIX system there is no wide string support for the file system so the library should work with narrow strings internally.

I don't think this is true. If you set locale encoding to UTF8 with export LC_ALL=ru_RU.UTF-8 you should be able to store unicode strings. Therefore I'd expect path p; p /= L"Документы"; to work there. - Volodya

7676

Age (days ago)

7691

Last active (days ago)

List overview

Download

10 comments

5 participants

participants (5)

Aaron W. LaFramboise
David Abrahams
Delfin Rojas
Duane Murphy
Eddie Diener
Edward Diener
Jeff Garland
Jeff Wang
John Maddock
John Meinel
Jonathan Turkanis
Keith MacDonald
Peter Dimov
Russell Hind
Vladimir Prus