Re: [boost] Change to guidelines for characters in C++ source files

29 Jun 2015

      On 26/06/2015 11:33, Mateusz Loskot wrote:
...
On 26 June 2015 at 01:15, Andrey Semashev <andrey.semashev@gmail.com> wrote:
...
Why not just always assume UTF-8, whether there is BOM or not? I don't think
UTF-8 BOM makes much sense, and I don't think editors commonly insert one.
Also, http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf says:
"Use of a BOM is neither required nor recommended for UTF-8,
but may be encountered in contexts where UTF-8 data is converted from
other encoding forms that use a BOM or where the BOM is used as a
UTF-8 signature. "
That last part is relevant here -- a UTF-8 BOM at the start of the file 
is used as a signature that the file contains UTF-8 content.  In the 
absence of that signature, any reader of the file must guess at the 
content encoding.

The problem is that various parties are inconsistent about what a file 
without a BOM actually means, but most commonly it means that the file 
is assumed to be in some default system locale.  On modern Linux, that 
usually means UTF-8 anyway, but that is not universal, and it is never 
the case on Windows (it means that the file will be interpreted in 
whatever the user's chosen "language for non-Unicode programs" is, which 
will vary depending on the user's country, preferred languages, and 
whether they've been playing Japanese novel games recently or not).  As 
such it is vastly safer to include the BOM than to omit it.  (One 
exception might be for shell scripts and other text-like files that care 
about their first few bytes and aren't expecting BOMs.)

In some cases the reader is expected to try to parse the file as UTF-8 
and then fall back to some other encoding if an invalid UTF-8 character 
sequence is encountered.  This is quite aggravating both for the people 
expected to write such software and also for the users who get their 
text misinterpreted by such heuristics, and whoever suggests that was a 
sensible choice for a default action should get thwapped upside the 
head.  (As an explicit "try to recover unknown format document" option, 
sure.  But not a default.)

If you're looking for authority, you might want to read 
http://unicode.org/faq/utf_bom.html#BOM as well.  The key point being 
that the recommendation to not use BOMs is for situations in which the 
encoding is already known in advance (such as databases, or protocols 
that explicitly transmit an encoding in an envelope).  Files are not an 
example of that.

Re: [boost] Change to guidelines for characters in C++ source files

Gavin Lambert