What happened to Boost.Nowide?

newer
[Graph] Large incoming changes to...

Alexander Grund

6 Nov 2019 6 Nov '19

10:10 a.m.

Hi, I'm a user of (a fork of) Boost.Nowide and already fixed some issues I found and was looking into getting them upstream. I also wanted to know, if it is finally in Boost. Unfortunately this does not seem to be the case. I found https://lists.boost.org/Archives/boost//2017/06/236475.php which accepted it into Boost. This is from mid-2017 and nothing has happened since. Does anyone know what the status of Boost.Nowide is? It seems the filestream parts are now incorporated into Boost.FileSystem. So it seems only cin/cout/cerr, the args wrapper and the C-functions (fopen, ...) are missing. Especially the first 2 are very useful in writing cross-platform code. Might those be integrated into some other Boost Library? Thanks, Alex

Attachments:

smime.p7s (application/pkcs7-signature — 5.1 KB)

Show replies by date

Yakov Galka

7 Nov 7 Nov

2:39 a.m.

On Wed, Nov 6, 2019 at 2:11 AM Alexander Grund via Boost < boost@lists.boost.org> wrote:

...

Does anyone know what the status of Boost.Nowide is? It seems the filestream parts are now incorporated into Boost.FileSystem. So it seems only cin/cout/cerr, the args wrapper and the C-functions (fopen, ...) are missing. Especially the first 2 are very useful in writing cross-platform code.

I don't know the status of Boost.Nowide. However, it's usefulness has diminished with the introduction of UTF-8 codepage support in Windows 10 in May this year. See https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod... . -- Yakov Galka http://stannum.co.il/

Alexander Grund

11 Nov 11 Nov

9:09 a.m.

Am 07.11.19 um 03:39 schrieb Yakov Galka:

...

On Wed, Nov 6, 2019 at 2:11 AM Alexander Grund via Boost <boost@lists.boost.org <mailto:boost@lists.boost.org>> wrote:

Does anyone know what the status of Boost.Nowide is? It seems the filestream parts are now incorporated into Boost.FileSystem. So it seems only cin/cout/cerr, the args wrapper and the C-functions (fopen, ...) are missing. Especially the first 2 are very useful in writing cross-platform code.

I don't know the status of Boost.Nowide. However, it's usefulness has diminished with the introduction of UTF-8 codepage support in Windows 10 in May this year. See https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod....

Interesting, thanks! Great to see that someone at MS finally made the right decision so that Windows is no longer the only OS not supporting UTF8. However it does require Win10 1903 minimum and a change to the manifest(s). So maybe Nowide still has some use. Alex

Yakov Galka

3 Dec 3 Dec

4:29 p.m.

On Mon, Nov 11, 2019 at 1:09 AM Alexander Grund via Boost < boost@lists.boost.org> wrote:

...

However it does require Win10 1903 minimum and a change to the manifest(s). So maybe Nowide still has some use.

Looks like there is a way to set UTF-8 globally for existing applications: https://stackoverflow.com/q/56419639/277176 Though I didn't try it yet. -- Yakov Galka http://stannum.co.il/

Alexander Grund

11 Nov 11 Nov

12:18 p.m.

Am 07.11.19 um 03:39 schrieb Yakov Galka:

...

However, it's usefulness has diminished with the introduction of UTF-8 codepage support in Windows 10 in May this year. See https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod....

I just noticed that it is very unfortunate, that this didn't happen 3 years (or so) ago. Now not only `boost::filesystem::path` is using `wchar` but also the C++17 `std::filesystem::path` does so. So we now have costly conversions and wasting half the space on windows for no gain :/

Yakov Galka

3 Dec 3 Dec

4:25 p.m.

On Mon, Nov 11, 2019 at 4:19 AM Alexander Grund via Boost < boost@lists.boost.org> wrote:

...

I just noticed that it is very unfortunate, that this didn't happen 3 years (or so) ago. Now not only `boost::filesystem::path` is using `wchar` but also the C++17 `std::filesystem::path` does so. So we now have costly conversions and wasting half the space on windows for no gain :/

I raised this issue many years ago. In fact boost filesystem v2 was better in this respect, because it followed the established convention of having a templated basic_path<char>, thus not committing to a specific char type. Alas, v2 was deprecated and v3 was lobbied into WG21 for standardization. It was an unprecedented case of introducing a "char on some platforms, wchar_t on others" interface into the standard, which is a bad decision from portability stand point. While we are at it, I would like to say that boost filesystem should have never introduced a path class in the first place. filesystem::path is just a glorified string with no extra invariants. Any string -> path conversion copies the data, even if it's already in the right encoding, even on operating systems that don't need any conversions at all. There goes your "don't pay for what you don't use" principle. Most can agree that C++'s spirit is to separate containers from algorithms. A proper design would introduce path manipulation functions that work on any string types, and let users use std::string or even char[] for storage. -- Yakov Galka http://stannum.co.il/

Gavin Lambert

10:19 p.m.

On 4/12/2019 05:25, Yakov Galka wrote:

...

On Mon, Nov 11, 2019 at 4:19 AM Alexander Grund wrote: I raised this issue many years ago. In fact boost filesystem v2 was better in this respect, because it followed the established convention of having a templated basic_path<char>, thus not committing to a specific char type. Alas, v2 was deprecated and v3 was lobbied into WG21 for standardization. It was an unprecedented case of introducing a "char on some platforms, wchar_t on others" interface into the standard, which is a bad decision from portability stand point.

While I agree in principle, the simple fact is that performing string transcoding on filesystem paths is a Very Bad Idea™, since both Windows and Linux treat them as opaque byte sequences -- but Windows' native encoding is UTF-16 and Linux' is (mostly) UTF-8. So, while unfortunate, v3 made the correct choice. Paths have to be kept in their original encoding between original source (command line, file, or UI) and file API usage, otherwise you can get weird errors when transcoding produces a different byte sequence that appears identical when actually rendered, but doesn't match the filesystem. Transcoding is only safe when you're going to do something with the string other than using it in a file API.

...

While we are at it, I would like to say that boost filesystem should have never introduced a path class in the first place. filesystem::path is just a glorified string with no extra invariants. Any string -> path conversion copies the data, even if it's already in the right encoding, even on operating systems that don't need any conversions at all. There goes your "don't pay for what you don't use" principle. Most can agree that C++'s spirit is to separate containers from algorithms. A proper design would introduce path manipulation functions that work on any string types, and let users use std::string or even char[] for storage.

While copying is unfortunate, these things are rarely on a performance-critical path, and the benefits of having consistent compose/decompose operations on paths vastly outweighs that, in my opinion. Combined with the need to maintain native encoding for paths, separated algorithms don't seem particularly useful -- just less convenient to use.

Yakov Galka

7 Jan 7 Jan

1:58 a.m.

On Tue, Dec 3, 2019 at 2:19 PM Gavin Lambert via Boost < boost@lists.boost.org> wrote:

...

While I agree in principle, the simple fact is that performing string transcoding on filesystem paths is a Very Bad Idea™, since both Windows and Linux treat them as opaque byte sequences -- but Windows' native encoding is UTF-16 and Linux' is (mostly) UTF-8.

Unix paths can be stored in a narrow string already, where fopen() always magically worked for any text. Windows paths can be transcoded losslessy into WTF-8 and back. So, while unfortunate, v3 made the correct choice. Paths have to be

...

kept in their original encoding between original source (command line, file, or UI) and file API usage, otherwise you can get weird errors when transcoding produces a different byte sequence that appears identical when actually rendered, but doesn't match the filesystem. Transcoding is only safe when you're going to do something with the string other than using it in a file API.

See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset) and back losslessly. The unprecedented introduction of a platform specific interface into the standard was, still is, and will always be, a horrible mistake.

...

While copying is unfortunate, these things are rarely on a performance-critical path, and the benefits of having consistent compose/decompose operations on paths vastly outweighs that, in my opinion. Combined with the need to maintain native encoding for paths, separated algorithms don't seem particularly useful -- just less convenient to use.

The path parsing and modification functions could be storage agnostic. Some prefer the x.join(y) syntax over join(x,y), but that's just a preference originating from the OOP crowd. -- Yakov Galka http://stannum.co.il/

Gavin Lambert

11:16 p.m.

On 7/01/2020 14:58, Yakov Galka wrote:

...

...
So, while unfortunate, v3 made the correct choice. Paths have to be kept in their original encoding between original source (command line, file, or UI) and file API usage, otherwise you can get weird errors when transcoding produces a different byte sequence that appears identical when actually rendered, but doesn't match the filesystem. Transcoding is only safe when you're going to do something with the string other than using it in a file API.

See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset) and back losslessly. The unprecedented introduction of a platform specific interface into the standard was, still is, and will always be, a horrible mistake.

Given that WTF-8 is not itself supported by the C++ standard library (and the other formats are), that doesn't seem like a valid argument. You'd have to campaign for that to be added first. The main problem though is that once you start allowing transcoding of any kind, it's a slippery slope to other conversions that can make lossy changes (such as applying different canonicalisation formats, or adding/removing layout codepoints such as RTL markers). Also, if you read the WTF-8 spec, it notes that it is not legal to directly concatenate two WTF-8 strings (you either have to convert it back to UCS-16 first, or execute some special handling for the trailing characters of the first string), which immediately renders it a poor choice for a path storage format. And indeed a poor choice for any purpose. (I suspect many people who are using it have conveniently forgotten that part.) Although on a related note, I think C++11/17 dropped the ball a bit on the new encoding-specific character types. It's definitely an improvement on the prior method, but it would have been better to do something like: struct ansi_encoding_t; struct utf_encoding_t; typedef encoded_char<ansi_encoding_t, 8> char_t; typedef encoded_char<utf_encoding_t, 8> char8_t; typedef encoded_char<utf_encoding_t, 16> char16_t; Where "encoded_char<E,N>" has storage size equal to N bits (blittable, and otherwise behaves like a standard integer type) but also carries around an arbitrary encoding tag type E. This could be used to distinguish "a string encoded in UTF-8" from "a string encoded in WTF-8" or "a string encoded in EDBDIC". And supplemental libraries could define additional encodings and conversion functions, and algorithms could operate on generic strings of any encoding.

Peter Dimov

11:34 p.m.

Gavin Lambert wrote:

...

The main problem though is that once you start allowing transcoding of any kind, it's a slippery slope to other conversions that can make lossy changes (such as applying different canonicalisation formats, or adding/removing layout codepoints such as RTL markers).

There's no such slippery slope, no canonicalization, no adding or removing anything. You just WTF-8 encode whatever Windows gives you, and WTF-8 decode the path before passing it to Windows.

...

Also, if you read the WTF-8 spec, it notes that it is not legal to directly concatenate two WTF-8 strings (you either have to convert it back to UCS-16 first, or execute some special handling for the trailing characters of the first string), which immediately renders it a poor choice for a path storage format.

Do you have a specific example in which concatenation won't work for the use outlined above? Because I can't think of any.

Yakov Galka

11:57 p.m.

On Tue, Jan 7, 2020 at 3:17 PM Gavin Lambert via Boost < boost@lists.boost.org> wrote:

...

...
See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset) and back losslessly. The unprecedented introduction of a platform specific interface into the standard was, still is, and will always be, a horrible mistake.

Given that WTF-8 is not itself supported by the C++ standard library (and the other formats are), that doesn't seem like a valid argument. You'd have to campaign for that to be added first.

It doesn't need to be added to the standard. My claim was that instead of adding a wchar_t/char Heisenstring into the standard and proliferating the amount of fstream constructors, one could stick to char interfaces and demand that "basic execution character set would be capable of storing any Unicode data". An Windows implementation could do that with WTF-8 to allow lossless transcoding. The main problem though is that once you start allowing transcoding of

...

any kind, it's a slippery slope to other conversions that can make lossy changes (such as applying different canonicalisation formats, or adding/removing layout codepoints such as RTL markers).

The truth is that there's already transcoding happening. Mount a Windows partition on Unix or vice versa. It's expected to have some breakage there if the filenames contain invalid sequences.

...

Also, if you read the WTF-8 spec, it notes that it is not legal to directly concatenate two WTF-8 strings (you either have to convert it back to UCS-16 first, or execute some special handling for the trailing characters of the first string), which immediately renders it a poor choice for a path storage format. And indeed a poor choice for any purpose. (I suspect many people who are using it have conveniently forgotten that part.)

Paths are, almost always, concatenated with ASCII separators (or other valid strings) in-between. Even when concatenating malformed strings directly, the issue isn't there if the result is passed immediately back to the "UTF-16" system.

...

Although on a related note, I think C++11/17 dropped the ball a bit on the new encoding-specific character types. [...]

C++11 over-engineered it, and you keep over-engineering it even further. Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC strings in one program *at compile time*. -- Yakov Galka http://stannum.co.il/

Gavin Lambert

8 Jan 8 Jan

1:07 a.m.

On 8/01/2020 12:57, Yakov Galka wrote:

...

Paths are, almost always, concatenated with ASCII separators (or other valid strings) in-between. Even when concatenating malformed strings directly, the issue isn't there if the result is passed immediately back to the "UTF-16" system.

But the conversion from WTF-8 to UCS-16 can interpret the joining point as a different character, resulting in a different sequence. Unless I've misread something, this could occur if the first string ended in an unpaired high surrogate and the second started with an unpaired low surrogate (or rather the WTF-8 equivalents thereof). Unlikely, perhaps, but not impossible.

...

...
Although on a related note, I think C++11/17 dropped the ball a bit on the new encoding-specific character types. [...]

C++11 over-engineered it, and you keep over-engineering it even further. Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC strings in one program *at compile time*.

You've just suggested cases where apps will contain both UTF-8 and WTF-8, which would be useful to distinguish between at compile time -- both to allow overloading to automatically select the correct conversion function and to give you compile errors if you accidentally try to pass a WTF-8 string to a function that expects pure UTF-8, or vice versa. The same applies for other cases. That's why C++20 introduced char8_t, so that you wouldn't accidentally pass UTF-8 strings to methods expecting other char formats. This could even be extended to other forms of two-way data encoding, such as UUEncoding or Base64. I don't think that's over-engineering, that's just basic data conversion and type safety.

Peter Dimov

1:16 a.m.

Gavin Lambert wrote:

...

But the conversion from WTF-8 to UCS-16 can interpret the joining point as a different character, resulting in a different sequence. Unless I've misread something, this could occur if the first string ended in an unpaired high surrogate and the second started with an unpaired low surrogate (or rather the WTF-8 equivalents thereof).

I don't see why do you think this would present a problem. The conversion of the first string will end in an unpaired high surrogate. The conversion of the second string will start with an unpaired low surrogate. The two, when concatenated, will form a valid UTF-16 encoding of a non-BMP character. Where is the issue here?

Yakov Galka

1:24 a.m.

On Tue, Jan 7, 2020 at 5:16 PM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Gavin Lambert wrote:

...
But the conversion from WTF-8 to UCS-16 can interpret the joining point as a different character, resulting in a different sequence. Unless I've misread something, this could occur if the first string ended in an unpaired high surrogate and the second started with an unpaired low surrogate (or rather the WTF-8 equivalents thereof).

I don't see why do you think this would present a problem. The conversion of the first string will end in an unpaired high surrogate. The conversion of the second string will start with an unpaired low surrogate. The two, when concatenated, will form a valid UTF-16 encoding of a non-BMP character. Where is the issue here?

That's my point essentially. However Gavin refers to the fact that the current WTF-8 spec explicitly says that an encoding of high/low surrogate pairs is invalid in WTF-8. For example UTF-16: d83d de09 should be encoded as WTF-8: f0 9f 98 89 But if one "UTF-16" string ended in d83d and the other in de09, concatenating in WTF-8 would yield "Invalid WTF-8": ed a0 bd ed b8 89 The spec explicitly prohibits this. The rationale behind this is to have a unique representation of any "UTF-16" stream, just like UTF-8 requires shortest representations. It might be important for security reasons if you're going to compare those "invalid WTF-8" strings, but it is not an issue if the next thing you do is converting them back to UTF-16. -- Yakov Galka http://stannum.co.il/

Peter Dimov

1:43 a.m.

Yakov Galka wrote:

...

That's my point essentially. However Gavin refers to the fact that the current WTF-8 spec explicitly says that an encoding of high/low surrogate pairs is invalid in WTF-8.

Ah that. Yes, concatenating two character sequences can result in technically invalid WTF-8. But that's not an issue unique to Windows. You can do the same on any non-Windows platform. It's still not clear how this prevents a `path` class from storing ~WTF-8 on Windows, or exposing a char-based API that ~WTF-8 decodes when passing to Windows, and encodes on the reverse trip.

Gavin Lambert

3:24 a.m.

On 8/01/2020 14:43, Peter Dimov wrote:

...

Yes, concatenating two character sequences can result in technically invalid WTF-8. But that's not an issue unique to Windows. You can do the same on any non-Windows platform. It's still not clear how this prevents a `path` class from storing ~WTF-8 on Windows, or exposing a char-based API that ~WTF-8 decodes when passing to Windows, and encodes on the reverse trip.

It could. And if you're only round-tripping it to file APIs and doing nothing else, then you can probably get away with that. But there's probably other code that wants to do manipulation on the path (swapping extensions, passing to some UI, truncating the filename to 10 characters, etc). Now there's more parts of the system that needs to know you have data in not-legal-WTF-8 format, and how to deal with that. (Or more likely you end up passing it to something that expects legal UTF-8 without telling it otherwise, and it mostly works -- until it doesn't.)

Peter Dimov

3:37 a.m.

Gavin Lambert wrote:

...

On 8/01/2020 14:43, Peter Dimov wrote:

...
Yes, concatenating two character sequences can result in technically invalid WTF-8. But that's not an issue unique to Windows. You can do the same on any non-Windows platform. It's still not clear how this prevents a `path` class from storing ~WTF-8 on Windows, or exposing a char-based API that ~WTF-8 decodes when passing to Windows, and encodes on the reverse trip.

It could. And if you're only round-tripping it to file APIs and doing nothing else, then you can probably get away with that.

But there's probably other code that wants to do manipulation on the path (swapping extensions, passing to some UI, truncating the filename to 10 characters, etc). Now there's more parts of the system that needs to know you have data in not-legal-WTF-8 format, and how to deal with that.

No, there aren't any (new) problems with that. That is, there aren't problems you wouldn't have otherwise, on other platforms. Vanilla POSIX can have any NTBS at all as a path/file name; macOS has UTF-8 NFD paths/file names. Any code you have that tries to truncate the filename to 10 characters (for whatever definition of character) is already broken. This is simply not an operation that can be done portably on a path or file name. (And any code that assumes that a file name will roundtrip, or that two different file names can't refer to the same file/directory entry, is also broken.)

2001

Age (days ago)

2064

Last active (days ago)

List overview

Download

16 comments

4 participants

participants (4)

Alexander Grund
Gavin Lambert
Peter Dimov
Yakov Galka