We've been discussing some UUID design issues in, well, issues in the Uuid repo, namely https://github.com/boostorg/uuid/issues/96 and, for context, https://github.com/boostorg/uuid/issues/113. This topic might be of wide interest to developers, so I thought it would be better to have this list know about it. Anyone who has an opinion on the matter is welcome to express it in the issue, or here. In addition to the questions raised in #96, namely, how should the constructors and the accessors of `uuid` look like (once we somehow move away from the public `data` member), here are some more topics of interest: * Should a default-constructed `uuid` remain uninitialized, or should the default constructor produce the nil Uuid? (Uninitialized is not a good look in 2024 because safety. On the other hand, `new uuid[ 1048576 ]` will start doing a `memset`.) * At the moment wide strings are processed by the name generators by converting every wchar_t to 32 bit, then hashing the bytes, zeroes and all. This doesn't strike me as correct. I think that the string should be converted to UTF-8 on the fly (with 32 bit wchar_t assumed UTF-16 and 32 bit wchar_t assumed UTF-32.)
From: Boost
Rob Boehne wrote:
* At the moment wide strings are processed by the name generators by converting every wchar_t to 32 bit, then hashing the bytes, zeroes and all. This doesn't strike me as correct. I think that the string should be converted to UTF-8 on the fly (with 32 bit wchar_t assumed UTF-16 and 32 bit wchar_t assumed UTF-32.)
To my thinking – a string should just be treated as binary data and it should not have its encoding changed – this should also make less work.
This behavior makes name UUIDs produced by e.g. "www.example.org" and L"www.example.org" different, which is unlikely to be what one wants in practice, and is against the recommendation of RFC 4122, which says o Convert the name to a canonical sequence of octets (as defined by the standards or conventions of its name space); put the name space ID in network byte order. I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00 0x00 as the "canonical sequence of octets" for U"A".
On 4/25/24 17:53, Peter Dimov via Boost wrote:
Rob Boehne wrote:
* At the moment wide strings are processed by the name generators by converting every wchar_t to 32 bit, then hashing the bytes, zeroes and all. This doesn't strike me as correct. I think that the string should be converted to UTF-8 on the fly (with 32 bit wchar_t assumed UTF-16 and 32 bit wchar_t assumed UTF-32.)
To my thinking – a string should just be treated as binary data and it should not have its encoding changed – this should also make less work.
This behavior makes name UUIDs produced by e.g. "www.example.org" and L"www.example.org" different, which is unlikely to be what one wants in practice, and is against the recommendation of RFC 4122, which says
o Convert the name to a canonical sequence of octets (as defined by the standards or conventions of its name space); put the name space ID in network byte order.
I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00 0x00 as the "canonical sequence of octets" for U"A".
Perhaps, we should simply assume that whatever form of the string the user provided to the generator is the "canonical" form. That is, if the user wants "www.example.org" and L"www.example.org" to produce the same UUID, it is his responsibility to convert those strings to the same representation before passing it to the generator. I think, in some regions, Unicode might not be the first encoding of choice, and there also are incorrectly encoded strings that cannot be converted to UTF-8. I don't think that Boost.UUID should deal with those issues.
Andrey Semashev wrote:
On 4/25/24 17:53, Peter Dimov via Boost wrote:
This behavior makes name UUIDs produced by e.g. "www.example.org" and L"www.example.org" different, which is unlikely to be what one wants in practice, and is against the recommendation of RFC 4122, which says
o Convert the name to a canonical sequence of octets (as defined by the standards or conventions of its name space); put the name space ID in network byte order.
I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00 0x00 as the "canonical sequence of octets" for U"A".
Perhaps, we should simply assume that whatever form of the string the user provided to the generator is the "canonical" form. That is, if the user wants "www.example.org" and L"www.example.org" to produce the same UUID, it is his responsibility to convert those strings to the same representation before passing it to the generator.
I think, in some regions, Unicode might not be the first encoding of choice, and there also are incorrectly encoded strings that cannot be converted to UTF-8. I don't think that Boost.UUID should deal with those issues.
The right way to not deal with these issues is to simply not take wide strings in the first place. This forces the user to supply "the canonical octet representation". Since we do take wide strings, we have implicitly accepted the responsibility to produce the canonical octet representation for them. And inserting zeroes randomly is simply wrong.
On 4/25/24 18:28, Peter Dimov wrote:
Andrey Semashev wrote:
On 4/25/24 17:53, Peter Dimov via Boost wrote:
This behavior makes name UUIDs produced by e.g. "www.example.org" and L"www.example.org" different, which is unlikely to be what one wants in practice, and is against the recommendation of RFC 4122, which says
o Convert the name to a canonical sequence of octets (as defined by the standards or conventions of its name space); put the name space ID in network byte order.
I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00 0x00 as the "canonical sequence of octets" for U"A".
Perhaps, we should simply assume that whatever form of the string the user provided to the generator is the "canonical" form. That is, if the user wants "www.example.org" and L"www.example.org" to produce the same UUID, it is his responsibility to convert those strings to the same representation before passing it to the generator.
I think, in some regions, Unicode might not be the first encoding of choice, and there also are incorrectly encoded strings that cannot be converted to UTF-8. I don't think that Boost.UUID should deal with those issues.
The right way to not deal with these issues is to simply not take wide strings in the first place. This forces the user to supply "the canonical octet representation".
Since we do take wide strings, we have implicitly accepted the responsibility to produce the canonical octet representation for them. And inserting zeroes randomly is simply wrong.
Ok, so maybe we should simply deprecate the support for wide string inputs?
Andrey Semashev wrote:
The right way to not deal with these issues is to simply not take wide strings in the first place. This forces the user to supply "the canonical octet representation".
Since we do take wide strings, we have implicitly accepted the responsibility to produce the canonical octet representation for them. And inserting zeroes randomly is simply wrong.
Ok, so maybe we should simply deprecate the support for wide string inputs?
That's one possible way to deal with it, yes. Although I think that for char16_t and char32_t inputs the canonical representation is unambiguous. This leaves wchar_t and while nobody on POSIX will shed a tear, Windows users will probably be disappointed if we take that away. That's why I thought that treating wchar_t as char16_t or char32_t depending on size was an acceptable compromise. (That's almost always true in practice, with the exception of weird IBM systems that use wide EBCDIC which aren't exactly our target audience.)
On 4/25/24 18:48, Peter Dimov wrote:
Andrey Semashev wrote:
The right way to not deal with these issues is to simply not take wide strings in the first place. This forces the user to supply "the canonical octet representation".
Since we do take wide strings, we have implicitly accepted the responsibility to produce the canonical octet representation for them. And inserting zeroes randomly is simply wrong.
Ok, so maybe we should simply deprecate the support for wide string inputs?
That's one possible way to deal with it, yes.
Although I think that for char16_t and char32_t inputs the canonical representation is unambiguous.
If you mean that char16_t and char32_t strings will still be converted to UTF-8 internally then you still have the issue of incorrect UTF-16/32 strings on input. I think, the only input character type we should allow is char. And we should not care what encoding it is or whether it is valid or not. IOW, we should take it as an opaque sequence of bytes.
This leaves wchar_t and while nobody on POSIX will shed a tear, Windows users will probably be disappointed if we take that away.
Well, we do provide libraries for character encoding conversion, so users are free to use those. Sure, it adds a bit of work, but not that much. And I doubt name generator is very popular.
On Thu, Apr 25, 2024 at 9:01 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:
I think, the only input character type we should allow is char. And we should not care what encoding it is or whether it is valid or not. IOW, we should take it as an opaque sequence of bytes.
I don't know much about the specifics of Boost.UUID but this solution appeals to me because it pushes all responsibility for figuring out Unicode or encoding equivalences onto the caller where it belongs. Separation of concerns is better than forcing every library that interacts with strings to have to solve these same problems over and over again (usually inconsistently). Thanks
From: Peter Dimov
* At the moment wide strings are processed by the name generators by converting every wchar_t to 32 bit, then hashing the bytes, zeroes and all. This doesn't strike me as correct. I think that the string should be converted to UTF-8 on the fly (with 32 bit wchar_t assumed UTF-16 and 32 bit wchar_t assumed UTF-32.)
To my thinking – a string should just be treated as binary data and it should not have its encoding changed – this should also make less work.
This behavior makes name UUIDs produced by e.g. "www.example.orghttp://www.example.org" and L"www.example.orghttp://www.example.org" different, which is unlikely to be what one wants in practice, and is against the recommendation of RFC 4122, which says o Convert the name to a canonical sequence of octets (as defined by the standards or conventions of its name space); put the name space ID in network byte order. I don't think anyone can justify the choice of e.g. 0x41 0x00 0x00 0x00 as the "canonical sequence of octets" for U"A". Ok – I withdraw my comment.
On 25.04.24 16:33, Peter Dimov via Boost wrote:
We've been discussing some UUID design issues in, well, issues in the Uuid repo, namely
https://github.com/boostorg/uuid/issues/96 and, for context, https://github.com/boostorg/uuid/issues/113.
This topic might be of wide interest to developers, so I thought it would be better to have this list know about it.
Anyone who has an opinion on the matter is welcome to express it in the issue, or here.
In addition to the questions raised in #96, namely, how should the constructors and the accessors of `uuid` look like (once we somehow move away from the public `data` member), here are some more topics of interest:
* Should a default-constructed `uuid` remain uninitialized, or should the default constructor produce the nil Uuid? (Uninitialized is not a good look in 2024 because safety. On the other hand, `new uuid[ 1048576 ]` will start doing a `memset`.)
Initialize it. I can't imagine producing enough UUIDs that the performance cost of initialization could be significant.
* At the moment wide strings are processed by the name generators by converting every wchar_t to 32 bit, then hashing the bytes, zeroes and all. This doesn't strike me as correct. I think that the string should be converted to UTF-8 on the fly (with 32 bit wchar_t assumed UTF-16 and 32 bit wchar_t assumed UTF-32.)
When creating a UUID from a name instead of by a random process, the same input name is guaranteed to produce the same output UUID. Silently changing the current behavior breaks that guarantee, which can break user code. Better to either completely remove wchar_t support or to leave the current behavior in place. -- Rainer Deyke (rainerd@eldwood.com)
participants (5)
-
Andrey Semashev
-
Peter Dimov
-
Rainer Deyke
-
Rob Boehne
-
Vinnie Falco