Read file that starts with '\xff\xfe'

Wed Sep 10 06:24:30 EDT 2003

>>>>> Bob Gailer <bgailer at alum.rpi.edu> (BG) wrote:

BG> On Win 2K the Task Scheduler writes a log file that appears to be encoded.
BG> The first line is:

BG> '\xff\xfe"\x00T\x00a\x00s\x00k\x00
BG> \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
BG> \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

BG> My goal is to read this file and process it using Python string
BG> processing.

BG> I am disappointed in the codecs module documentation. I had hoped to find
BG> the answer there, but can't.

BG> I presume this is an encoding, and that '\xff\xfe' defines the encoding.
BG> How does one map '\xff\xfe' to an "encoding".

It's Unicode, actually Little Endian UTF-16, which is the standard encoding
on Win2K. The '\xff\xfe' is the Byte Order mark (BOM) which signifies it
as Little Endian.

>>> import codecs
>>> codecs.BOM_UTF16_LE
'\xff\xfe'

But there is a trailing 0 byte missing (it should have an even number of
bytes, as each character occupies two bytes). Of course this comes because
you think a line ends with '\n', whereas in UTF-16LE it ends with '\n\x00'.
This also means you cannot read them with methods like readline().

>>> st='\xff\xfe"\x00T\x00a\x00s\x00k\x00 \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00 \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n\x00'
>>> stu=unicode(st,"utf_16le")
>>> stu
u'"Task Scheduler Service"\r\n'
>>> stu.encode('iso-8859-1')
'"Task Scheduler Service"\r\n'

-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl