[Python-3000] Pre-PEP: Easy Text File Decoding

Wed Sep 13 20:23:33 CEST 2006

On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.

John, you're mistaken about the reason this BOM is here.

In Notepad at least, the BOM is intentionally generated when writing
the file.  It's not a "mistake" or "laziness".  It's metadata.  (I
admit the BOM was not originally invented for this purpose.)

> There is no MEANINGFUL definition of BOM in a UTF-8
> string.

This thread is about files, not strings.  At the start of a file, a
UTF-8 BOM is meaningful.  It means the file is UTF-8.

On Windows, there's a system default encoding, and it's never UTF-8.
Notepad writes the BOM so that later, when you open the file in
Notepad again, it can identify the file as UTF-8.

> You can see the logical fallacy if you imagine emitting UTF-16
> text in an environment of one byte sex, reducing that text to
> UTF-8, carrying it to an environment of the other byte sex and
> raising it back to UTF-16.

It sounds as if you think this will corrupt the BOM, but it works fine:

 >>> import codecs
 # "Emitting UTF-16 text" in little-endian environment
 >>> s1 = codecs.BOM_UTF16_LE + u'hello world'.encode('utf-16-le')
 # "Reducing that text to UTF-8"
 >>> s2 = s1.decode('utf-16-le').encode('utf-8')
 >>> s2
 '\xef\xbb\xbfhello world'
 # "Raising it back to UTF-16" in big-endian environment
 >>> s3 = s2.decode('utf-8').encode('utf-16-be')
 >>> s3[:2] == codecs.BOM_UTF16_BE
 True

The BOM is still correct: the data is UTF-16-BE, and the BOM agrees.

A UTF-8 string or file will contain exactly the same bytes (including
the BOM, if any) whether it is generated from UTF-16-BE or -LE.  All
three are lossless representations in bytes of the same abstract
ideal, which is a sequence of Unicode codepoints.

-j