fileinput not Unicode compatible? / UTF16 codec problems
Jurie Horneman
jhorneman at pobox.com
Fri Mar 1 08:07:00 EST 2002
Is it possible that the fileinput module is not Unicode compatible?
Because I have a little endian 16-bit Unicode file and have trouble
reading it in. Decoding it with the UTF16 LE decoder gives me a
'truncated data' error. This is because the string ends with '0x0a'
and '0x0a00'. The string is read in using the fileinput module, which
apparently call C stdio getc(), which read a byte and not a 16-bit
Unicode character.
Oddly, this problem doesn't occur for every line.
Is there a solution for this, apart from rewriting a number of modules
myself?
Is there any documentation on which Python modules are Unicode-aware
or not?
Oh, and how does one handle big endian / little endian Unicode when
the UTF16 decoders look for BOMs at the start of each string, but I
only have on at the start of the file? There seems to be no way for me
to tell it which endianness I have, apart from circumventing the codec
and calling the right version myself.
Thanks,
Jurie Horneman
More information about the Python-list
mailing list