Re: [boost] [review] Review of Nowide (Unicode)
Hi, this is my official review of the Nowide library. I haven't had much time to look at Nowide, but I have noticed some things that might be of interest, so here we go...
- What is your evaluation of the design?
Overall the design looks good. Some things I'd suggest to reconsider: * I'm a big fan of "try" names, so I'd suggest to rename the array versions of widen/narrow to try_widen/try_narrow. Makes it more clear that they can fail without throwing. Also I'm not sure about the benefit of returning the output buffer pointer or NULL. I'd simply return bool instead. * nowide_filesystem... to be honest I'm not sure what this does. If it does something related to UTF-8 on POSIX systems, then I smell a potential problem. If it's only for wide->narrow on Windows, it should be OK. * Also not sure what boost::nowide::cin/cout/cerr do. Do they call SetConsoleCP/SetConsoleOutputCP? Will they work with re-directed input/output streams? What will happen if the user uses them AND std::cin/cout/cerr or stdin/stdout/stderr in the same application? * I'm not a big fan of the basic_stackstring classes. I'd suggest to provide template function overloads for narrow/widen instead. Like: users_favorite::wstring_class wstr; if (boost::nowide::try_widen(wstr, str)) use(wstr); else too_bad(); They could write to the target-string using the usual duck-typing - e.g. via clear/push_back or clear/back_inserter. That way the user can use whatever SSO string class he likes and Nowide wouldn't have to provide string classes -- which IMO isn't its job. As a bonus, the user also gets an exception-free (except bad_alloc) function that doesn't require him to provide a raw array output buffer. * The Windows & invalid UTF-16 thing... I still think invalid UTF-16 should round-trip.
- What is your evaluation of the implementation?
Haven't had much time to check the implementation. What I did notice is that when windows.h is not used, the WinAPI prototypes are declared in the global namespace. This can cause issues when windows.h is also included before/after the Nowide headers, and the prototypes don't match exactly (e.g. short vs. wchar_t). Also it pollutes the global namespace. I think Nowide should do what other Boost libraries do, and declare the prototypes in a private detail namespace instead.
- What is your evaluation of the documentation?
Incomplete. Too few examples and the reference documentation leaves many things unclear. IMO the reference documentation should unequivocally define the contract of each API. Also I think the documentation should make it very clear what Nowide does and doesn't do on different platforms. Especially that it doesn't do anything Unicode-related on POSIX platforms and that using it does NOT mean the application will only have to deal with UTF-8 strings on POSIX platforms. Also parts of the documentation that suggest or even _state_ that systems like Linux internally use UTF-8 should be fixed.
- What is your evaluation of the potential usefulness of the library?
I'd say it's useful.
- Did you try to use the library? With what compiler? Did you have any problems?
No.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Quick reading of the documentation + some very quick glances at parts of the implementation.
- Are you knowledgeable about the problem domain?
More or less. I have never used or written a "nowide" layer myself, but I have spent plenty of time "widening" Windows-only applications.
And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
I certainly wouldn't vote against it. I didn't spend enough time to understand every aspect of it though, so I don't feel qualified to vote for it either. I would however suggest to require a major update of the documentation before accepting it. Regards, Paul Groke
On Wed, Jun 21, 2017 at 3:18 AM, Groke, Paul via Boost
* I'm a big fan of "try" names, so I'd suggest to rename the array versions of widen/narrow to try_widen/try_narrow. Makes it more clear that they can fail without throwing. Also I'm not sure about the benefit of returning the output buffer pointer or NULL. I'd simply return bool instead.
These aren't try functions as they succeed in general (unless you are out of memory after conversion policy update) and returing char */wchar_t* makes code shorter.
* nowide_filesystem... to be honest I'm not sure what this does. If it does something related to UTF-8 on POSIX systems, then I smell a potential problem. If it's only for wide->narrow on Windows, it should be OK.
Probably documentation should be improved. Boost.Filesystem uses std::codecvt facet to convert between narrow and wide encodings. On windows the default is ANSI encoding. By starting integration it installs utf8/utf-16 codecvt facet forcing boost::filesystem to use utf-8 in its narrow strings on Windows.
* Also not sure what boost::nowide::cin/cout/cerr do. Do they call SetConsoleCP/SetConsoleOutputCP? Will they work with re-directed input/output streams? What will happen if the user uses them AND std::cin/cout/cerr or stdin/stdout/stderr in the same application?
SetConsoleCP is problematic and does not allow you to input UTF-8. What it does it uses native console Wide API to write/read to/from console: http://cppcms.com/files/nowide/html/index.html#technical_cio In case of redirection it just becomes alias of std::cin/cout (docs need to be updated a little)
* I'm not a big fan of the basic_stackstring classes. I'd suggest to provide template function overloads for narrow/widen instead. Like: users_favorite::wstring_class wstr; if (boost::nowide::try_widen(wstr, str)) use(wstr); else too_bad(); They could write to the target-string using the usual duck-typing - e.g. via clear/push_back or clear/back_inserter. That way the user can use whatever SSO string class he likes and Nowide wouldn't have to provide string classes -- which IMO isn't its job. As a bonus, the user also gets an exception-free (except bad_alloc) function that doesn't require him to provide a raw array output buffer.
These are mostly internal functions and since the "target" is WinAPI you need to provide finally a pointer to string. So I don't see advantage in iterator based approach,
* The Windows & invalid UTF-16 thing... I still think invalid UTF-16 should round-trip.
- What is your evaluation of the implementation?
Haven't had much time to check the implementation. What I did notice is that when windows.h is not used, the WinAPI prototypes are declared in the global namespace. This can cause issues when windows.h is also included before/after the Nowide headers, and the prototypes don't match exactly (e.g. short vs. wchar_t). Also it pollutes the global namespace. I think Nowide should do what other Boost libraries do, and declare the prototypes in a private detail namespace instead.
Ok I need to check this. It something new for me.
- What is your evaluation of the documentation?
Incomplete. Too few examples and the reference documentation leaves many things unclear. IMO the reference documentation should unequivocally define the contract of each API.
Also I think the documentation should make it very clear what Nowide does and doesn't do on different platforms.
Yes it was mentioned by several reviewers and will be addressed.
And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
I certainly wouldn't vote against it. I didn't spend enough time to understand every aspect of it though, so I don't feel qualified to vote for it either. I would however suggest to require a major update of the documentation before accepting it.
Regards, Paul Groke
Thank You for the review and the valuable comments! Artyom Beilis
Boost.Filesystem uses std::codecvt facet to convert between narrow and wide encodings. On windows the default is ANSI encoding. By starting integration it installs utf8/utf-16 codecvt facet forcing boost::filesystem to use utf-8 in its narrow strings on Windows.
2 questions about integration in filesystem: - is there a way to undo what boost::nowide::nowide_filesystem() does? - can do the same with std::filesystem that is coming soon? Frédéric
2 questions about integration in filesystem: - is there a way to undo what boost::nowide::nowide_filesystem() does?
Good question. You need to restore the locale that was imbued to path. I don't see an API to get one without replacement. So probably I need to return the previous locale - can do the same with std::filesystem that is coming soon?
Frédéric
It looks like std filesystem uses global locale... (Bad idea. ) Yes it shouldn't be a problem. Just the codecvt need to be installed to global locale Artyom
- is there a way to undo what boost::nowide::nowide_filesystem() does?
Good question. You need to restore the locale that was imbued to path. I don't see an API to get one without replacement.
So probably I need to return the previous locale
which is returned already by boost::filesystem::path::imbue, good?
בתאריך 23 ביוני 2017 12:05 אחה״צ, "Frédéric Bron"
- is there a way to undo what boost::nowide::nowide_filesystem() does?
Good question. You need to restore the locale that was imbued to path. I don't see an API to get one without replacement.
So probably I need to return the previous locale
which is returned already by boost::filesystem::path::imbue, good? Yes, this is what I meant
So probably I need to return the previous locale
which is returned already by boost::filesystem::path::imbue, good?
but then we get: auto old_loc = boost::nowide::nowide_filesystem(); // by the way, why repeat nowide? ... boost::filesystem::imbue(old_loc); What about the following symmetrical approach: auto old_loc = boost::filesystem::imbue(boost::nowide::locale()); ... boost::filesystem::imbue(old_loc); Frédéric
It looks like std filesystem uses global locale... (Bad idea. ) Does it? They finally succeeded in breaking it after so many tries? Oh dear.
apparently, with std::filesystem, you can create a path from UTF-8 string with u8path and from a path, you can get its UTF-8 representation with u8string. You do not need a locale for this. I looked into the current draft standard about ill-formed UTF-16. The draft standard says in 30.10.8.4.6 path native format observers [fs.path.native.obs]: "Remarks: Conversion, if any, is performed as specified by 30.10.8.2. The encoding of the string returned by u8string() is always UTF-8." Here u8string() must return a perfectly valid UTF-8 string (but its name implies so). And in 30.10.8.2.1 path argument format conversions [fs.path.fmt.cvt], I read: "Pathnames are converted as needed between the generic and native formats in an operating-system-dependent manner. Let G(n) and N(g) in a mathematical sense be the implementation’s functions that convert native- to-generic and generic-to-native formats respectively. If g=G(n) for some n, then G(N(g))=g; if n=N(g) for some g, then N(G(n))=n. [ Note: Neither G nor N need be invertible. — end note ]" It is funny how it is said: it does not say the roundtrip must work, it says the roundtrip must work on any converted paths. This is the case for boost.nowide as once converted to UTF-8, the round trip narrow->wide->narrow works. I also read in 30.10.8.2.2 path type and encoding conversions [fs.path.type.cvt] "If the encoding being converted to has no representation for source characters, the resulting converted characters, if any, are unspecified. Implementations should not modify member function arguments if already of type path::value_type." This means that it is implementation defined how to convert an ill-formed path. Therefore, using the replacement character U+FFFD would match the std::filesystem policy. Note also that std::filesystem just uses narrow string as is on Posix as boost.nowide does. char: The encoding is the native narrow encoding (30.10.4.9). The method of conversion, if any, is "For POSIX-based operating systems path::value_type is char so no conversion from char value type arguments or to char value type return values is performed." Frédéric Frédéric
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Artyom Beilis via Boost Sent: Donnerstag, 22. Juni 2017 22:43 To: boost@lists.boost.org Cc: Artyom Beilis
Subject: Re: [boost] [review] Review of Nowide (Unicode) On Wed, Jun 21, 2017 at 3:18 AM, Groke, Paul via Boost
wrote: * I'm a big fan of "try" names, so I'd suggest to rename the array versions of widen/narrow to try_widen/try_narrow. Makes it more clear that they can fail without throwing. Also I'm not sure about the benefit of returning the output buffer pointer or NULL. I'd simply return bool instead.
These aren't try functions as they succeed in general (unless you are out of memory after conversion policy update) and returing char */wchar_t* makes code shorter.
They will fail if the user-provided buffer isn't big enough.
* nowide_filesystem... to be honest I'm not sure what this does. If it does something related to UTF-8 on POSIX systems, then I smell a potential problem. If it's only for wide->narrow on Windows, it should be OK.
Probably documentation should be improved.
Boost.Filesystem uses std::codecvt facet to convert between narrow and wide encodings. On windows the default is ANSI encoding. By starting integration it installs utf8/utf-16 codecvt facet forcing boost::filesystem to use utf-8 in its narrow strings on Windows.
OK, in that case it should be fine.
What it does it uses native console Wide API to write/read to/from console:
http://cppcms.com/files/nowide/html/index.html#technical_cio
In case of redirection it just becomes alias of std::cin/cout (docs need to be updated a little)
OK. How about mixing use of the boost::nowide:: and the std:: versions? If that possible? Because that's something that I guess will be hard to avoid.
* I'm not a big fan of the basic_stackstring classes. I'd suggest to provide
template function overloads for narrow/widen instead. Like:
users_favorite::wstring_class wstr; if (boost::nowide::try_widen(wstr, str)) use(wstr); else too_bad(); They could write to the target-string using the usual duck-typing - e.g. via clear/push_back or clear/back_inserter. That way the user can use whatever SSO string class he likes and Nowide wouldn't have to provide string classes -- which IMO isn't its job. As a bonus, the user also gets an exception-free (except bad_alloc) function that doesn't require him to provide a raw array output buffer.
These are mostly internal functions and since the "target" is WinAPI you need to provide finally a pointer to string. So I don't see advantage in iterator based approach,
I don't see how they are "mostly internal". Every application that needs to do some Windows-specific things will require them. If it's internal it should go into a detail namespace and not be documented in the generated documentation (doc in the source is fine of course). Otherwise it's a public API and IMO deserves consideration.
- What is your evaluation of the implementation?
Haven't had much time to check the implementation. What I did notice is that when windows.h is not used, the WinAPI prototypes are declared in the global namespace. This can cause issues when windows.h is also included before/after the Nowide headers, and the prototypes don't match exactly (e.g. short vs. wchar_t). Also it pollutes the global namespace. I think Nowide should do what other Boost libraries do, and declare the prototypes in a private detail namespace instead.
Ok I need to check this. It something new for me.
Cool. You can find an example in
These aren't try functions as they succeed in general (unless you are out of memory after conversion policy update) and returing char */wchar_t* makes code shorter.
They will fail if the user-provided buffer isn't big enough.
I see. If some I'd rather expect to extend the boost.locale functionality rather boost.nowide - it seems to be more appropriate (since nowide functions finally call utf_to_utf functions from boost.locale library) Regarding iterator input/output interface. I must say that iterator based interface isn't right for encoding conversions. There was a long discussion regarding it during Boost.Locale review. The major reason is that you don't work on characters by themselves but rather character sequences and intermediate states. Since all conversion API gets some kind of "char *" and virtually every string class around has its "c_str()" variant I don't really see it necessary to provide arbitrary iterators (especially since under the hood I'll need to combine them into chunks of char * for processing) And if you want to process a stream you will be much better going with std::codecvt facet that works on chunks and stream.
What it does it uses native console Wide API to write/read to/from console:
http://cppcms.com/files/nowide/html/index.html#technical_cio
In case of redirection it just becomes alias of std::cin/cout (docs need to be updated a little)
OK. How about mixing use of the boost::nowide:: and the std:: versions? If that possible? Because that's something that I guess will be hard to avoid.
Shouldn't be a problem. Finally all goes to console. Artyom
They will fail if the user-provided buffer isn't big enough.
I see.
If some I'd rather expect to extend the boost.locale functionality rather boost.nowide - it seems to be more appropriate (since nowide functions finally call utf_to_utf functions from boost.locale library)
I think it's convenient to have the function that does the conversion in the Nowide library itself, even if it just calls come other function. That way the user doesn't have to know or care what kind of exact conversion rules Nowide uses - they can just call nowide::widen/narrow.
Regarding iterator input/output interface.
I must say that iterator based interface isn't right for encoding conversions. There was a long discussion regarding it during Boost.Locale review.
The major reason is that you don't work on characters by themselves but rather character sequences and intermediate states.
Since all conversion API gets some kind of "char *" and virtually every string class around has its "c_str()" variant I don't really see it necessary to provide arbitrary iterators (especially since under the hood I'll need to combine them into chunks of char * for processing)
I'm not suggesting an iterator-based interface. I'm just suggesting two changes. 1: change stack_wstring wstr; wstr.convert(input); to stack_wstring wstr; widen(wstr, input); I.e. move the conversion function out of the string class. And then 2: allow arbitrary (string) types instead of stack_wstring. Not via iterators, the widen/narrow functions could (should) simply use push_back(char) or append(char const*, char const*) to "build" the output string.
Shouldn't be a problem. Finally all goes to console.
Won't buffering cause trouble - especially for input?
On Fri, Jun 23, 2017 at 5:35 PM, Groke, Paul via Boost
I think it's convenient to have the function that does the conversion in the Nowide library itself, even if it just calls come other function. That way the user doesn't have to know or care what kind of exact conversion rules Nowide uses - they can just call nowide::widen/narrow.
I see
Regarding iterator input/output interface.
[snip]
I'm not suggesting an iterator-based interface. I'm just suggesting two changes. 1: change
stack_wstring wstr; wstr.convert(input);
to
stack_wstring wstr; widen(wstr, input);
I.e. move the conversion function out of the string class. And then 2: allow arbitrary (string) types instead of stack_wstring. Not via iterators, the widen/narrow functions could (should) simply use push_back(char) or append(char const*, char const*) to "build" the output string.
Now I understand what do you mean. I assume it can be done/
Shouldn't be a problem. Finally all goes to console.
Won't buffering cause trouble - especially for input?
It should be similar to behavior of std::cin as it uses tie call so apparently on the output side shoudn't be problem. (Also good point to test) Regarding input it of course will be much trickier. More testing would likely be needed and I see more potential problems. I'll need to check it as well and if limitation exist report them Artyom
participants (4)
-
Artyom Beilis
-
Frédéric Bron
-
Groke, Paul
-
Peter Dimov