fileinput not Unicode compatible? / UTF16 codec problems

Fri Mar 1 08:50:12 EST 2002

jhorneman at pobox.com (Jurie Horneman) writes:

> Is it possible that the fileinput module is not Unicode compatible?

That is certainly possible. You'd need to tell it the encoding for
opening files; that is currently not supported.

> Because I have a little endian 16-bit Unicode file and have trouble
> reading it in. Decoding it with the UTF16 LE decoder gives me a
> 'truncated data' error. 

I assume you first split the input into lines, then try the decoding?
That does not work with UTF-16; you first need to decode, then split
into lines.

> Oddly, this problem doesn't occur for every line.

No, but for every second line. The UTF-16 decoder will complain if you
don't give it an even number of bytes. After the first line is read,
the second will (incorrectly) start with a NUL byte, which fills this
line to an even number of bytes, again. Decoding as UTF-16 will
succeed, but will give you garbage: the wrong bytes will get grouped
to form a character.

> Is there a solution for this, apart from rewriting a number of modules
> myself?

As long as it is fileinput only, I recommend to rewrite your code to
not use that module; this is probably simpler than rewriting the
module to support encodings in full generality.

Of course, patches will be welcome; if you do change fileinput, please
submit a patch to sf.net/projects/python.

> Is there any documentation on which Python modules are Unicode-aware
> or not?

Not that I'm aware of. In most cases, if issues become known, they
problems will be corrected instead of being documented.

> Oh, and how does one handle big endian / little endian Unicode when
> the UTF16 decoders look for BOMs at the start of each string, but I
> only have on at the start of the file? There seems to be no way for me
> to tell it which endianness I have, apart from circumventing the codec
> and calling the right version myself.

You cannot decode UTF-16 on a line-by-line basis. Instead, you need to
use a stream reader, which will remember the right encoding across
.read or .readline invocations (only since Python 2.2, AFAIR). The
most convenient way to open a Unicode stream is to use codecs.open,
passing the encoding. In case of UTF-16, the endianness will be
determined on first .read* invocation.

HTH,
Martin