Re: [boost] [nowide] Library Updates and Boost's brokenUTF-8codecvt facet

9 Oct 2015

      On 09.10.2015 19:27, Peter Dimov wrote:
...
Andrey Semashev wrote:
...
...
string fn = get_file_name();
    fopen( fn.c_str() );
What I'm saying is that get_file_name implementation should not even
spell UTF-8 anywhere, as the encoding it has to deal with is not
UTF-8. Whatever the original encoding of the file name is (broken
UTF-16, obtained from WinAPI, true UTF-8 obtained from network or a
file), the target encoding has to match what fopen expects.
'fopen' here is a function that decodes 'fn' and calls _wfopen, or
CreateFileW, or whatever is appropriate.
get_file_name and fopen work in tandem to make it so that the file
selected by the first function is opened by the latter. And to do that,
they may need to put invalid UTF-8 in 'fn'.
Right. Just don't call it UTF-8 anymore.
...
...
There should be no such thing as 'least invalid' or 'almost valid' data.
There exists a legitimate notion of more valid or less valid UTF-8
because it can be invalid in different ways, some more basic than others.
Could you point me to a definition of these degrees of validity? In my 
understanding the string is valid if it can be decoded by a conforming 
parser. E.g. it should not contain invalid code points (i.e. those not 
allowed by the standard) or sequences thereof.
...
...
There are normalization and string collation algorithms to deal with
this. What's important is that the input to these and other algorithms
is valid.
This depends on the notion of valid. UTF-8 that encodes codepoints in
more bytes than necessary corresponds to a valid codepoint sequence.
AFAIU, no, it is not a valid encoding. At least, not according to this:

https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings
...
Strict handling rejects it not because it's invalid Unicode, but because
it's not the canonical representation of the codepoint sequence. But the
codepoint sequence itself can be non-canonical, and hence code that
assumes that "validated" UTF-8 is canonical is wrong.
Well, my Unicode kung fu is not very strong, but if the standard only 
allows minimal encoding then anything that doesn't follow is not 
conforming and should be rejected.

For convenience we could provide a separate tool that would tolerate 
some deviations from the spec and produce valid UTF-8. But the user 
should have control over what exact deviations are allowed. Don't call 
the tool utf_to_utf though, as the name doesn't make sense to me.
...
The policy of strict UTF-8 is not a bad idea in general, but it's merely
a first line of defense as far as security is concerned. Properly
written code should not need it.
You mean all string-related code should be prepared for invalid input? 
Seems like too much overhead to me.