Is this a bug? BOM decoded with UTF8

pekka niiranen pekka.niiranen at wlanmail.com
Thu Feb 10 11:58:50 EST 2005


Hi there,

I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
	EFBBBF6161

Contents of "my.utf16" in HEX:
	FEFF6161


For some reason Python2.4 decodes the BOM for UTF8
but not for UTF16. See below:

 >>> fh = codecs.open("my.uft8", "rb", "utf8")
 >>> fh.readlines()
[u'\ufeffaa']  	# BOM is decoded, why
 >>> fh.close()
 >>> fh = codecs.open("my.utf16", "rb", "utf16")
 >>> fh.readlines()
[u'\u6161']	# No BOM here
 >>> fh.close()

Is there a trick to read UTF8 encoded file with BOM not decoded?

-pekka-



More information about the Python-list mailing list