Change to guidelines for characters in C++ source files
Since the very early days of Boost the guidelines for acceptable characters in C++ source files has been the 96 characters of C++ standard's basic source character set encoded in 7-bit ASCII. The inspect program also allowed several additional 7-bit ASCII characters that sometimes appear in comments. The rationale was to ensure that Boost code was portable to all compilers available at that time. We had gotten complaints that even a character as innocuous as a copyright sign (U+00A9) was causing compiles to fail on some compiler releases targeting Asian languages. UTF-8 support was far from universal. Times have changed: * Source files encoded in UTF-8 with a leading byte order mark (BOM) of the byte sequence 0xEF,0xBB,0xBF are supported by all C++ compilers that we are aware of, and this has been true for many years now. * As of C++11, the C++ language now includes types and literals directly supporting UTF-8, UTF-16, and UTF-32, and creating code points above 7-bit ASCII in such literals is much easier if UTF-8 source encoding is used. Even editors as dumb as Windows Notepad have supported UTF-8 with BOM for some time now. * As Boost Libraries start to incorporate C++11 Unicode related features, it becomes difficult to write test programs if limited to 7-bit ASCII. For example, incorporating the Filesystem TS into Boost.Filesystem requires test cases with UTF-8, UTF-16, and UTF-32 and that's painful under the current 7-bit ASCII guidelines. So... It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly. Comments? --Beman
On 6/25/2015 7:12 AM, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
BOM is evil. Regards, Paul Mensonides
On 6/25/2015 7:12 AM, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
BOM is evil. The Microsoft compiler will treat files without a BOM as encoded in its local codepage, with no way to override. If you want MSVC to read the
On 26.06.2015 00:15, Paul Mensonides wrote: source as UTF-8, you need a BOM. They have no plans to change this either, see https://connect.microsoft.com/VisualStudio/Feedback/Details/888437 . The bug is closed as wontfix. "Unfortunately, we currently have no plans to implement the support of UTF-8 files without byte order marks." Thus, we need a BOM in our source files if they contain UTF-8. That's just a sad fact. And yes, it makes me angry. Sebastian
On 26/06/2015 10:24, Sebastian Redl wrote:
On 6/25/2015 7:12 AM, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
BOM is evil. The Microsoft compiler will treat files without a BOM as encoded in its local codepage, with no way to override. If you want MSVC to read
On 26.06.2015 00:15, Paul Mensonides wrote: the source as UTF-8, you need a BOM.
They have no plans to change this either, see https://connect.microsoft.com/VisualStudio/Feedback/Details/888437 . The bug is closed as wontfix.
"Unfortunately, we currently have no plans to implement the support of UTF-8 files without byte order marks."
Thus, we need a BOM in our source files if they contain UTF-8. That's just a sad fact.
This is a real issue too - I've had bug reports in the past from non-English users who were unable to compile Boost source that had accidently acquired something other than a 7-bit character. John.
On 26 Jun 2015 at 11:24, Sebastian Redl wrote:
Thus, we need a BOM in our source files if they contain UTF-8. That's just a sad fact.
This suggests, I suppose, that the Boost lint script will need to error out on any source files without a BOM right? I suppose at least this makes life simple. Everything gets the BOM. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 26.06.2015 13:04, Niall Douglas wrote:
On 26 Jun 2015 at 11:24, Sebastian Redl wrote:
Thus, we need a BOM in our source files if they contain UTF-8. That's just a sad fact. This suggests, I suppose, that the Boost lint script will need to error out on any source files without a BOM right?
Only if they contain characters outside the basic source set, I suppose. Sebastian
On 26 Jun 2015 at 13:25, Sebastian Redl wrote:
Thus, we need a BOM in our source files if they contain UTF-8. That's just a sad fact. This suggests, I suppose, that the Boost lint script will need to error out on any source files without a BOM right?
Only if they contain characters outside the basic source set, I suppose.
It'll have to be at the per-git repo level, so either all the source in a git repo is UTF-8 BOMed or it isn't. This is because .gitattributes selects UTF-8 encoding based on file extension, and that applies for an entire git repo. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On Fri, Jun 26, 2015 at 5:24 AM, Sebastian Redl < sebastian.redl@getdesigned.at> wrote:
On 26.06.2015 00:15, Paul Mensonides wrote:
On 6/25/2015 7:12 AM, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++
source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
BOM is evil.
The Microsoft compiler will treat files without a BOM as encoded in its local codepage, with no way to override. If you want MSVC to read the source as UTF-8, you need a BOM.
They have no plans to change this either, see https://connect.microsoft.com/VisualStudio/Feedback/Details/888437 . The bug is closed as wontfix.
"Unfortunately, we currently have no plans to implement the support of UTF-8 files without byte order marks."
Thus, we need a BOM in our source files if they contain UTF-8. That's just a sad fact.
It isn't just Microsoft. My first draft of N3463, Portable Program Source Files http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3463.html specified UTF-8 without BOM, but when a draft was circulated several non-Microsoft compiler writers from the committee's core working group explained that to them the BOM was essential. The scenario they were concerned with was environments in Asia where the default encoding is commonly not UTF-8 and most files that go into a translation unit are encoded in that default encoding, but one file is encoded in UTF-8 without a BOM. The compiler needs to be able to identify that file as UTF-8 without all files be UTF-8 encoded, and the compiler writers believe that is not possible 100% of the time without a BOM. Some compilers or IDEs, including Visual Studio do have an opt-in option "Auto-detect UTF-8 encoding without signature", but Boost can't count on such an option being turned on. I wasn't present in core when N3463 was discussed, but the unofficial feedback I got was that CWG saw no need to explicitly state an environmental feature as a requirement that users forced required all compilers to support anyhow. --Beman
On 25.06.2015 17:12, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
Why not just always assume UTF-8, whether there is BOM or not? I don't think UTF-8 BOM makes much sense, and I don't think editors commonly insert one. Also, while we're at it, are tabs still banned?
On 26 June 2015 at 01:15, Andrey Semashev
On 25.06.2015 17:12, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
Why not just always assume UTF-8, whether there is BOM or not? I don't think UTF-8 BOM makes much sense, and I don't think editors commonly insert one.
Also, http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf says: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. " Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
On 26/06/2015 11:33, Mateusz Loskot wrote:
On 26 June 2015 at 01:15, Andrey Semashev
wrote: Why not just always assume UTF-8, whether there is BOM or not? I don't think UTF-8 BOM makes much sense, and I don't think editors commonly insert one.
Also, http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf says:
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. "
That last part is relevant here -- a UTF-8 BOM at the start of the file is used as a signature that the file contains UTF-8 content. In the absence of that signature, any reader of the file must guess at the content encoding. The problem is that various parties are inconsistent about what a file without a BOM actually means, but most commonly it means that the file is assumed to be in some default system locale. On modern Linux, that usually means UTF-8 anyway, but that is not universal, and it is never the case on Windows (it means that the file will be interpreted in whatever the user's chosen "language for non-Unicode programs" is, which will vary depending on the user's country, preferred languages, and whether they've been playing Japanese novel games recently or not). As such it is vastly safer to include the BOM than to omit it. (One exception might be for shell scripts and other text-like files that care about their first few bytes and aren't expecting BOMs.) In some cases the reader is expected to try to parse the file as UTF-8 and then fall back to some other encoding if an invalid UTF-8 character sequence is encountered. This is quite aggravating both for the people expected to write such software and also for the users who get their text misinterpreted by such heuristics, and whoever suggests that was a sensible choice for a default action should get thwapped upside the head. (As an explicit "try to recover unknown format document" option, sure. But not a default.) If you're looking for authority, you might want to read http://unicode.org/faq/utf_bom.html#BOM as well. The key point being that the recommendation to not use BOMs is for situations in which the encoding is already known in advance (such as databases, or protocols that explicitly transmit an encoding in an envelope). Files are not an example of that.
On 06/25/2015 04:15 PM, Andrey Semashev wrote:
On 25.06.2015 17:12, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
Why not just always assume UTF-8, whether there is BOM or not? I don't think UTF-8 BOM makes much sense, and I don't think editors commonly insert one.
Also, while we're at it, are tabs still banned?
Tabs are still banned. -- Michael Caisse ciere consulting ciere.com
On 25 Jun 2015 at 10:12, Beman Dawes wrote:
It looks to me like it is high time to change the Boost guideline for C++ source file encoding to 7-bit ASCII without BOM or UTF-8 with BOM, and to change the inspect program accordingly.
Comments?
+10. I'd suggest UTF-8 with or without BOM. Just assume it is, if it's 7-bit clean now it's automatically UTF-8 clean. Don't forget to adjust all the .gitattributes in all boost git repos to say UTF-8 encoding for source files, some git tools won't render diffs right without that. As an aside, I recently had merry hell getting AFIO to get sent per commit to the wandbox online compiler and wasted a full day trying. Louis' python script which worked finally, while Krzysztof shell script did not. I had been trying to get the latter working. I figured out the problem: I suspect AFIO has a unicode char somewhere in its source code which only appears when you try to pipe it through scripting in order to upload it to a JSON REST API. Python errored out on the bad string, whilst the shell script just magically failed silently. That alerted me to the need to tell Python the source code has a codec, and then it all worked. Bear these sorts of problem in mind if you switch on UTF-8, as debugging that stuff was not obvious. BTW wandbox scripting instructions are now at https://svn.boost.org/trac/boost/wiki/BestPracticeHandbook#a14.USERFRI ENDLINESS:Considerlettingpotentialuserstryyourlibrarywithasinglemousec lick for anyone interested. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
participants (9)
-
Andrey Semashev
-
Beman Dawes
-
Gavin Lambert
-
John Maddock
-
Mateusz Loskot
-
Michael Caisse
-
Niall Douglas
-
Paul Mensonides
-
Sebastian Redl