Re: [boost] [review] Review of Nowide (Unicode) starts today
How would you therefore do this simple task: 1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows, unknown narrow on Posix) 2. create a file in base_dir which name is file_name encoded in UTF-8 (because it is created by the application). If I understand well, I should NOT do this: auto f = nowide::ofstream((boost::filesystem::path(base_dir) / file_name).string()); because this is guaranteed to work only on Windows where I have the guarantee that base_dir is UTF-8, right? Actually it is what exactly you should do. Once you turn on **nowide integration** with boost filesystem path::string does not perform any conversions on POSIX platforms. And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding. Under windows you'll get the correct conversions from narrow to wide using the utf8/16 facet installed by boost.nowide/Filesystem integration as on Windows boost filesystem uses wide strings Artyom
And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.
but it seems to me that in this case, what I need is UTF-8->ISO 8859-1 conversion of the file name before concatenation with the directory. Otherwise, OK I will get a file because the system just ask for narrow string but its name will be wrong in the OS user interface. Frédéric
On Fri, Jun 16, 2017 at 12:42 PM, Frédéric Bron
And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.
but it seems to me that in this case, what I need is UTF-8->ISO 8859-1 conversion of the file name before concatenation with the directory. Otherwise, OK I will get a file because the system just ask for narrow string but its name will be wrong in the OS user interface.
Frédéric
You actually **assume** that the encoding you received (like getenv) from the system actually matches current locale encoding. But it is not necessary the same: 1. The file/directory was created by user running in different locale 2. The locale isn't defined properly or was modified 3. You get these files directories from some other location (like unzipped some stuff) In reality the OS does not care about encoding (most of the time). Unlike Windows where wchar_t also defines the encoding UTF-16 under POSIX platforms "char *" can contain whatever encoding and it can be changed. Also UTF-8 is the most common encoding on all modern Unix like systems: Linux, BSD, Mac OS X So I don't think it is necessary to perform any conversions between UTF-8 and whatever "char *" encoding you get because: (a) You can't reliable know what kind of encoding you use. (b) Same "char *" may contain parts from different encoding and actually be valid path. Artyom
Artyom Beilis wrote:
Actually it is what exactly you should do. Once you turn on **nowide integration** with boost filesystem
path::string does not perform any conversions on POSIX platforms.
And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.
"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always UTF-8, not merely by convention, and no mater what locale is set. (They're stored in UTF-16 in the filesystem, just like in NTFS. Unlike NTFS, it's _valid_ UTF-16.) For encoding-neutral Unix-es such as Linux and Solaris, the above is correct. In any case, a library that attempts to reencode from the current locale to UTF-8 in the narrow case would be useless. (If the locale isn't UTF-8, nothing will work anyway, the library trying to handle this case would just make things worse.) This is well established empirically by now.
And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.
"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always UTF-8, not merely by convention, and no mater what locale is set.
You were right but are now wrong I am afraid Peter: https://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/ Apple's shiny new APFS treats paths as a dumb sequence of bytes, and they are retrofitting HFS+ to behave the same to match.
(They're stored in UTF-16 in the filesystem, just like in NTFS. Unlike NTFS, it's _valid_ UTF-16.)
You should assume that invalid UTF will present on all filing systems because almost all of them treat paths as dumb byte strings. It's the only portable assumption. It's why AFIO v2 opts out of case insensitivity on Windows which no doubt will surprise quite a few end users, but it does present many security attacks at the AFIO layer (by pushing the problem further up the stack). Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
participants (4)
-
Artyom Beilis
-
Frédéric Bron
-
Niall Douglas
-
Peter Dimov