Python 3.0 automatic decoding of UTF16

John Machin sjmachin at lexicon.net
Sun Dec 7 17:20:03 EST 2008


On Dec 8, 2:05 am, Johannes Bauer <dfnsonfsdu... at gmx.de> wrote:
> John Machin schrieb:
>
> > He did. Ugly stuff using readline() :-) Should still work, though.
>
> Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f))
> kinda loops :-)
>
> But, seriously - I find that whole "while True:" and "if line == """
> construct ugly as hell, too. How can reading a file line by line be
> achieved in a more pythonic kind of way?

By using
   for line in open(.....)
as mentioned in (1) my message that you were replying to (2) the
tutorial:
http://docs.python.org/3.0/tutorial/inputoutput.html#reading-and-writing-files
... skip the stuff on readline() and readlines() this time :-)

While waiting for the bug to be fixed, you'll need something like the
following:

def utf16_getlines(fname, newline_terminated=True):
    f = open(fname, 'rb')
    raw_bytes = f.read()
    f.close()
    decoded = raw_bytes.decode('utf16')
    if newline_terminated:
        normalised = decoded.replace('\r\n', '\n')
        lines = normalised.splitlines(True)
    else:
        lines = decoded.splitlines()
    return lines

That avoids the chunk-reading problem by reading the whole file in one
go. In fact given the way I've written it, there can be 4 copies of
the file contents. Fortunately your files are tiny.

HTH,
John




More information about the Python-list mailing list