Re: [boost] Environment Variables Library?

23 May 2015

      On 23. mai 2015 15:50, Peter Dimov wrote:
...
Bjørn Roald wrote:
...
I think encoding is going to be a challenge.
On Posix I think you are right that one can assume the character
encoding is defined by the system and that may be a multi or a single
byte character strings, whatever is defined in the locale.
On POSIX, the system doesn't care about encodings. You get from getenv
exactly the byte string you passed to setenv.
...
File paths in Windows are stored in double byte character strings
encoded as UCS-2 which is fixed width 2 byte predecessor of UTF-16.
No, file paths on Windows are UTF-16.
OK, in that case that is good. It seems it is stated that UTF-16 has 
been supported since Windows 2000 in one reference I found. I must have 
based my misled mind on some pretty dated information then.  Possibly 
also mixed up with the fact that the two encodings are so similar for 
normal use that UCS-2 is often mistakenly referred to as UTF-16., So it 
is hard to know for sure what statements to trust without testing.  I am 
glad I put a disclaimer at the top.
...
I'm not quite sure how SetEnvironmentVariableA and
SetEnvironmentVariableW interact though, I don't see it documented. The
typical behavior for an A/W pair is for the A function to be implemented
in terms of the W one, using the current system code page for converting
the strings.
The C runtime getenv/_putenv functions actually maintain two separate
copies of the environment, one narrow, one wide.
https://msdn.microsoft.com/en-us/library/tehxacec.aspx
The problem therefore is that it's not quite possible to provide a
portable interface.
One possible, but certainly not perfect approach is to convert in the 
interface as needed from an external to the internal encoding.  The 
external encoding is explicitly requested by the user, or UTF-8 is 
assumed.  The internal encoding will always use the UTF-16 on Windows 
and UTF-8 on Posix. How bad would that be?

If the Windows implementation convert to/from UTF-16 when needed and 
then use Set/GetEnvironmentVariableW, then the windows back-end is taken 
care of, simple enough.

However, with this scheme, on Posix systems it is harder to assure a 
formal guaranty of correctness.  But it is hard to see how just assuming 
stored environment variables are UTF-8 are any are worst than 
alternatives unless you know the variable producer used another 
encoding.  If you know, it is should be possible to convert anyway. Non 
UTF-8 variables will likely be a less and less common problem with time. 
You will still have the same abilities to recover as before with the 
current Posix char* interface with no statements of expected encoding.

The external encoding (used in API parameters) can depend on the width 
of the character type used in the API, the library could have functions 
using both char and wchar_t based strings. The char based string 
parameters assume UTF-8 and wchar_t based parameters assume UTF-16 or 
UTF-32 depending on how many bit wchar_t is on the platform.
...
On POSIX, programs have to use the char* functions, because they don't
encode/decode and therefore guarantee a perfect round-trip.
Right, but I question how much value that perfect round-trip has if the
consumer have to guess the encoding.  That is basically saying that I 
kept the encoding, therefore I am happy even if I may have lost the 
correct value.
...
Using
wchar_t* may fail if the contents of the environment do not correspond
to the encoding that the library uses.
On Windows, programs have to use the wchar_t* versions, for the same
reason. Using char* may give you a mangled result in the case the
environment contains a file name that cannot be represented in the
current encoding.
(If the library uses the C runtime getenv/_putenv functions, those will
likely guarantee a perfect round-trip, but this will not solve the
problem with a preexisting wide environment that is not representable.)
Many people - me included - have adopted a programming model in which
char[] strings are assumed to be UTF-8 on Windows, and the char[] API
calls the wide Windows API internally, then converts between UTF-16 and
UTF-8 as appropriate. Since the OS X POSIX API is UTF-8 based and most
Linux systems are transitioning or have already transitioned to UTF-8 as
default, using UTF-8 and char[] results in reasonably portable programs.
I have also followed this pattern for portable code in the past, and I 
think it is a good pattern to support in a new library.
...
This however doesn't appeal to people who prefer to use another
encoding, and makes the char[] API not correspond to the Windows char[]
API (the A functions) as those use the "ANSI code page" which can't be
UTF-8.
I though at least some ANSI and ISO code pages where ASCII based, are 
they not?  Given that all values in the range 0 though 127 are the same 
as in ASCII, then those encodings are just as much UTF-8 as pure ASCII 
texts.
...
Boost.Filesystem sidesteps the problem by letting you choose whatever
encoding you wish. I don't particularly like this approach.
I guess it adds complexity to the API that possibly could discourage 
users that only need 1 or 2 common UTF encoding(s).  A separate string 
conversion library could do the rest of the job when the odd encodings 
are needed. Are there any other disadvantages?

--
Bjørn