Re: [boost] [review] Review of Nowide (Unicode) starts today

newer
[review] More reviews of Nowide...

Artyom Beilis

16 Jun 2017 16 Jun '17

9:27 a.m.

How would you therefore do this simple task: 1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows, unknown narrow on Posix) 2. create a file in base_dir which name is file_name encoded in UTF-8 (because it is created by the application). If I understand well, I should NOT do this: auto f = nowide::ofstream((boost::filesystem::path(base_dir) / file_name).string()); because this is guaranteed to work only on Windows where I have the guarantee that base_dir is UTF-8, right? Actually it is what exactly you should do. Once you turn on **nowide integration** with boost filesystem path::string does not perform any conversions on POSIX platforms. And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding. Under windows you'll get the correct conversions from narrow to wide using the utf8/16 facet installed by boost.nowide/Filesystem integration as on Windows boost filesystem uses wide strings Artyom

Show replies by date

Frédéric Bron

16 Jun 16 Jun

9:42 a.m.

New subject: [review] Review of Nowide (Unicode) starts today

...

And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.

but it seems to me that in this case, what I need is UTF-8->ISO 8859-1 conversion of the file name before concatenation with the directory. Otherwise, OK I will get a file because the system just ask for narrow string but its name will be wrong in the OS user interface. Frédéric

Artyom Beilis

12:20 p.m.

New subject: [review] Review of Nowide (Unicode) starts today

On Fri, Jun 16, 2017 at 12:42 PM, Frédéric Bron <frederic.bron@m4x.org> wrote:

...

...
And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.

but it seems to me that in this case, what I need is UTF-8->ISO 8859-1 conversion of the file name before concatenation with the directory. Otherwise, OK I will get a file because the system just ask for narrow string but its name will be wrong in the OS user interface.

Frédéric

You actually **assume** that the encoding you received (like getenv) from the system actually matches current locale encoding. But it is not necessary the same: 1. The file/directory was created by user running in different locale 2. The locale isn't defined properly or was modified 3. You get these files directories from some other location (like unzipped some stuff) In reality the OS does not care about encoding (most of the time). Unlike Windows where wchar_t also defines the encoding UTF-16 under POSIX platforms "char *" can contain whatever encoding and it can be changed. Also UTF-8 is the most common encoding on all modern Unix like systems: Linux, BSD, Mac OS X So I don't think it is necessary to perform any conversions between UTF-8 and whatever "char *" encoding you get because: (a) You can't reliable know what kind of encoding you use. (b) Same "char *" may contain parts from different encoding and actually be valid path. Artyom

Peter Dimov

12:40 p.m.

New subject: [review] Review of Nowide (Unicode) starts today

Artyom Beilis wrote:

...

Actually it is what exactly you should do. Once you turn on **nowide integration** with boost filesystem

path::string does not perform any conversions on POSIX platforms.

And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.

"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always UTF-8, not merely by convention, and no mater what locale is set. (They're stored in UTF-16 in the filesystem, just like in NTFS. Unlike NTFS, it's _valid_ UTF-16.) For encoding-neutral Unix-es such as Linux and Solaris, the above is correct. In any case, a library that attempts to reencode from the current locale to UTF-8 in the narrow case would be useless. (If the locale isn't UTF-8, nothing will work anyway, the library trying to handle this case would just make things worse.) This is well established empirically by now.

Niall Douglas

2:15 p.m.

New subject: [review] Review of Nowide (Unicode) starts today

...

...
And you can safely concatenate two different strings and valid file will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be valid even if not representable in any encoding.

"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always UTF-8, not merely by convention, and no mater what locale is set.

You were right but are now wrong I am afraid Peter: https://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/ Apple's shiny new APFS treats paths as a dumb sequence of bytes, and they are retrofitting HFS+ to behave the same to match.

...

(They're stored in UTF-16 in the filesystem, just like in NTFS. Unlike NTFS, it's _valid_ UTF-16.)

You should assume that invalid UTF will present on all filing systems because almost all of them treat paths as dumb byte strings. It's the only portable assumption. It's why AFIO v2 opts out of case insensitivity on Windows which no doubt will surprise quite a few end users, but it does present many security attacks at the AFIO layer (by pushing the problem further up the stack). Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

2948

Age (days ago)

2948

Last active (days ago)

List overview

Download

4 comments

4 participants

participants (4)

Artyom Beilis
Frédéric Bron
Niall Douglas
Peter Dimov