On 26/06/2015 11:33, Mateusz Loskot wrote:
On 26 June 2015 at 01:15, Andrey Semashev
wrote: Why not just always assume UTF-8, whether there is BOM or not? I don't think UTF-8 BOM makes much sense, and I don't think editors commonly insert one.
Also, http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf says:
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. "
That last part is relevant here -- a UTF-8 BOM at the start of the file is used as a signature that the file contains UTF-8 content. In the absence of that signature, any reader of the file must guess at the content encoding. The problem is that various parties are inconsistent about what a file without a BOM actually means, but most commonly it means that the file is assumed to be in some default system locale. On modern Linux, that usually means UTF-8 anyway, but that is not universal, and it is never the case on Windows (it means that the file will be interpreted in whatever the user's chosen "language for non-Unicode programs" is, which will vary depending on the user's country, preferred languages, and whether they've been playing Japanese novel games recently or not). As such it is vastly safer to include the BOM than to omit it. (One exception might be for shell scripts and other text-like files that care about their first few bytes and aren't expecting BOMs.) In some cases the reader is expected to try to parse the file as UTF-8 and then fall back to some other encoding if an invalid UTF-8 character sequence is encountered. This is quite aggravating both for the people expected to write such software and also for the users who get their text misinterpreted by such heuristics, and whoever suggests that was a sensible choice for a default action should get thwapped upside the head. (As an explicit "try to recover unknown format document" option, sure. But not a default.) If you're looking for authority, you might want to read http://unicode.org/faq/utf_bom.html#BOM as well. The key point being that the recommendation to not use BOMs is for situations in which the encoding is already known in advance (such as databases, or protocols that explicitly transmit an encoding in an envelope). Files are not an example of that.