On 23. mai 2015 15:50, Peter Dimov wrote:
Bjørn Roald wrote:
I think encoding is going to be a challenge.
On Posix I think you are right that one can assume the character encoding is defined by the system and that may be a multi or a single byte character strings, whatever is defined in the locale.
On POSIX, the system doesn't care about encodings. You get from getenv exactly the byte string you passed to setenv.
File paths in Windows are stored in double byte character strings encoded as UCS-2 which is fixed width 2 byte predecessor of UTF-16.
No, file paths on Windows are UTF-16.
OK, in that case that is good. It seems it is stated that UTF-16 has been supported since Windows 2000 in one reference I found. I must have based my misled mind on some pretty dated information then. Possibly also mixed up with the fact that the two encodings are so similar for normal use that UCS-2 is often mistakenly referred to as UTF-16., So it is hard to know for sure what statements to trust without testing. I am glad I put a disclaimer at the top.
I'm not quite sure how SetEnvironmentVariableA and SetEnvironmentVariableW interact though, I don't see it documented. The typical behavior for an A/W pair is for the A function to be implemented in terms of the W one, using the current system code page for converting the strings.
The C runtime getenv/_putenv functions actually maintain two separate copies of the environment, one narrow, one wide.
https://msdn.microsoft.com/en-us/library/tehxacec.aspx
The problem therefore is that it's not quite possible to provide a portable interface.
One possible, but certainly not perfect approach is to convert in the interface as needed from an external to the internal encoding. The external encoding is explicitly requested by the user, or UTF-8 is assumed. The internal encoding will always use the UTF-16 on Windows and UTF-8 on Posix. How bad would that be? If the Windows implementation convert to/from UTF-16 when needed and then use Set/GetEnvironmentVariableW, then the windows back-end is taken care of, simple enough. However, with this scheme, on Posix systems it is harder to assure a formal guaranty of correctness. But it is hard to see how just assuming stored environment variables are UTF-8 are any are worst than alternatives unless you know the variable producer used another encoding. If you know, it is should be possible to convert anyway. Non UTF-8 variables will likely be a less and less common problem with time. You will still have the same abilities to recover as before with the current Posix char* interface with no statements of expected encoding. The external encoding (used in API parameters) can depend on the width of the character type used in the API, the library could have functions using both char and wchar_t based strings. The char based string parameters assume UTF-8 and wchar_t based parameters assume UTF-16 or UTF-32 depending on how many bit wchar_t is on the platform.
On POSIX, programs have to use the char* functions, because they don't encode/decode and therefore guarantee a perfect round-trip.
Right, but I question how much value that perfect round-trip has if the consumer have to guess the encoding. That is basically saying that I kept the encoding, therefore I am happy even if I may have lost the correct value.
Using wchar_t* may fail if the contents of the environment do not correspond to the encoding that the library uses.
On Windows, programs have to use the wchar_t* versions, for the same reason. Using char* may give you a mangled result in the case the environment contains a file name that cannot be represented in the current encoding.
(If the library uses the C runtime getenv/_putenv functions, those will likely guarantee a perfect round-trip, but this will not solve the problem with a preexisting wide environment that is not representable.)
Many people - me included - have adopted a programming model in which char[] strings are assumed to be UTF-8 on Windows, and the char[] API calls the wide Windows API internally, then converts between UTF-16 and UTF-8 as appropriate. Since the OS X POSIX API is UTF-8 based and most Linux systems are transitioning or have already transitioned to UTF-8 as default, using UTF-8 and char[] results in reasonably portable programs.
I have also followed this pattern for portable code in the past, and I think it is a good pattern to support in a new library.
This however doesn't appeal to people who prefer to use another encoding, and makes the char[] API not correspond to the Windows char[] API (the A functions) as those use the "ANSI code page" which can't be UTF-8.
I though at least some ANSI and ISO code pages where ASCII based, are they not? Given that all values in the range 0 though 127 are the same as in ASCII, then those encodings are just as much UTF-8 as pure ASCII texts.
Boost.Filesystem sidesteps the problem by letting you choose whatever encoding you wish. I don't particularly like this approach.
I guess it adds complexity to the API that possibly could discourage users that only need 1 or 2 common UTF encoding(s). A separate string conversion library could do the rest of the job when the odd encodings are needed. Are there any other disadvantages? -- Bjørn