_csv.Error: string with NUL bytes

Thu May 3 15:00:15 EDT 2007

dustin at v.igoro.us wrote:

> I'm guessing that your file is in UTF-16, then -- Windows seems to do
> that a lot.  It kind of makes it *not* a CSV file, but oh well.  Try
> 
>   print open("test.csv").decode('utf-16').read().replace("\0",
>   ">>>NUL<<<")
> 
> I'm not terribly unicode-savvy, so I'll leave it to others to suggest a
> way to get the CSV reader to handle such encoding without reading in the
> whole file, decoding it, and setting up a StringIO file.

Not pretty, but seems to work:

from __future__ import with_statement

import csv
import codecs

def recoding_reader(stream, from_encoding, args=(), kw={}):
    intermediate_encoding = "utf8"
    efrom = codecs.lookup(from_encoding)
    einter = codecs.lookup(intermediate_encoding)
    rstream = codecs.StreamRecoder(stream, einter.encode, efrom.decode,
        efrom.streamreader, einter.streamwriter)

    for row in csv.reader(rstream, *args, **kw):
        yield [unicode(column, intermediate_encoding) for column in row]

def main():
    file_encoding = "utf16"

    # generate sample data:
    data = u"\xe4hnlich,\xfcblich\r\nalpha,beta\r\ngamma,delta\r\n"
    with open("tmp.txt", "wb") as f:
        f.write(data.encode(file_encoding))

    # read it
    with open("tmp.txt", "rb") as f:
        for row in recoding_reader(f, file_encoding):
            print u" | ".join(row)

if __name__ == "__main__":
    main()

Data from the file is recoded to UTF-8, then passed to a csv.reader() whose
output is decoded to unicode.

Peter