BOM should be ignored by Python

Neil Hodgson neilh at scintilla.org
Tue May 2 09:08:57 EDT 2000


> Further I thought that (3) it was pointless
> having a BOM in UTF-8 which is an 8-bit-unit encoding and endian-ness
> is not a question and

   This is what I thought as well until quite recently. Then I encountered
Notepad on W2Ks behaviour of putting the BOM on when saving as UTF-8 and
then reread the Unicode FAQ at
http://www.unicode.org/unicode/faq/
which says """
Q: When a BOM is used, is it only in 16-bit Unicode text?
No, a BOM can be used as a signature no matter how the Unicode text is
transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM
will be whatever the Unicode character FEFF is converted into by that
transformation format. In that form, the BOM serves to indicate both that it
is a Unicode file, and which of the formats it is in.
"""

> (5) a
> reader of UTF-8 data should be prepared to regard a BOM as legal, not
> a "syntax error".

   This is what I want to see changed. The Python interpreter currently is
not defined to be a UTF-8 reader when reading scripts. I'd like to see it
accept these scripts.

> I also thought (6) that by careful design of UTF-8,
> ASCII data when "converted" to UTF-8 was unchanged so I don't see the
> point (for an application that is going to use Unicode internally) in
> knowing/caring whether an input file is in ASCII or UTF-8.

   I'm more concerned with how to display it in an editor. There are other
issues like what encoding are the contents of doc-strings in?

   Neil





More information about the Python-list mailing list