[review] Review of Nowide (Unicode) continues - summary of the dicussions so far
Dear all, The formal review of Artyom Beilis' Nowide library continues until Wed. 21st of June. Here is a summary of the discussions so far: 7 people participate to the review (not counting Artyom and my-self): - Degski - Niall Douglas - Paul Groke - Peter Dimov - Vadim Zeitlin - Yakov Galka - Zach Laine 2 official reviews have been shared, both positive for inclusion in boost (Degski, Peter Dimov). Discussions: Everybody is positive about the usefullness of the library but some major questions arose: 1. how should we handle invalid Unicode sequences (bytes or multi-bytes)? Allow? throws error? 2. there is an unsymmetric way of treatment between Windows and Posix: - on Windows, the conversion from UTF-16 checks for Unicode conformance and returns only valid UTF-8 or fails - on Posix, no check is performed that the input is a valid UTF-8 sequence. 3. in particular, on Windows, the roundtrip conversion UTF-16 -> UTF-8 -> UTF-16 is not guaranted if the initial string is non conformant One main issue is that existing files may have non conformant names and would therefore not been reachable by the nowide API on Windows. On Posix platforms, non conformant paths would work transparently. Different proposals have been made to address this: - convert from UTF-16 to a superset of UTF-8 so that the round-tip conversion is possible, this would mean that the library accepts non conformant strings on Windows as is done on Posix. Question: which enconding should be used (Modified UTF-8, WTF-8, CESU-8)? WTF-8 has the issue of being difficult to handle with string concatenations. - use RtlUTF8ToUnicodeN functions which replace wrong UTF-16 characters by U+FFFD but is this OK for round-trip? - see glib approach - add function to explicitly convert from wide to WTF-8 However, for the Posix case, the issue is that we cannot guarantee that the encoding is always UTF-8 so checking for conformance may be impossible. Hence the choice for not checking on Posix. Minor points: - Missing some documentation on what happens if invalid UTF-8 is provided (getenv, setenv, cout, cerr). - ::setenv on cygwin gives a compile time error (Peter Dimov proposed a fix) - suggestion to add stat and readdir Frédéric
participants (1)
-
Frédéric Bron