fileinput not Unicode compatible? / UTF16 codec problems

Jurie Horneman jhorneman at pobox.com
Fri Mar 1 08:07:00 EST 2002


Is it possible that the fileinput module is not Unicode compatible?

Because I have a little endian 16-bit Unicode file and have trouble
reading it in. Decoding it with the UTF16 LE decoder gives me a
'truncated data' error. This is because the string ends with '0x0a'
and '0x0a00'. The string is read in using the fileinput module, which
apparently call C stdio getc(), which read a byte and not a 16-bit
Unicode character.

Oddly, this problem doesn't occur for every line.

Is there a solution for this, apart from rewriting a number of modules
myself?

Is there any documentation on which Python modules are Unicode-aware
or not?

Oh, and how does one handle big endian / little endian Unicode when
the UTF16 decoders look for BOMs at the start of each string, but I
only have on at the start of the file? There seems to be no way for me
to tell it which endianness I have, apart from circumventing the codec
and calling the right version myself.

Thanks,

Jurie Horneman



More information about the Python-list mailing list