From skip at pobox.com Sat Mar 12 06:25:11 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri, 11 Mar 2005 23:25:11 -0600 Subject: [Csv] Re: csv module and unicode, when or workaround? In-Reply-To: References: Message-ID: <16946.32055.285417.303195@montanaro.dyndns.org> Chris> the current csv module cannot handle unicode the docs say, is Chris> there any workaround or is unicode support planned for the near Chris> future? Skip> True, it can't. Hmmm... I think the following should be a reasonable workaround in most situations: #!/usr/bin/env python import csv class UnicodeReader: def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): self.reader = csv.reader(f, dialect=dialect, **kwds) self.encoding = encoding def next(self): row = self.reader.next() return [unicode(s, self.encoding) for s in row] def __iter__(self): return self class UnicodeWriter: def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): self.writer = csv.writer(f, dialect=dialect, **kwds) self.encoding = encoding def writerow(self, row): self.writer.writerow([s.encode("utf-8") for s in row]) def writerows(self, rows): for row in rows: self.writerow(row) if __name__ == "__main__": try: oldurow = [u'\u65E5\u672C\u8A9E', u'Hi Mom -\u263a-!', u'A\u2262\u0391.'] writer = UnicodeWriter(open("uni.csv", "wb")) writer.writerow(oldurow) del writer reader = UnicodeReader(open("uni.csv", "rb")) newurow = reader.next() print "trivial test", newurow == oldurow and "passed" or "failed" finally: import os os.unlink("uni.csv") If people don't find any egregious flaws with the concept I'll at least add it as an example to the csv module docs. Maybe they would even work as additions to the csv.py module, assuming the api is palatable. Skip From skip at pobox.com Fri Mar 18 18:06:53 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri, 18 Mar 2005 11:06:53 -0600 Subject: [Csv] Example workaround classes for using Unicode with csv module... Message-ID: <16955.2733.160226.850562@montanaro.dyndns.org> I added UnicodeReader and UnicodeWriter example classes to the csv module docs just now. They mention problems with ASCII NUL characters (which I vaguely remember - NUL-terminated strings are used internally, right?). Do NULs still present a problem? I saw nothing in the log messages that mentioned "ascii" or "nul" so I presume it is. Here's what I added. Let me know if you think it needs any corrections, especially if there's a better way to word "as long as you avoid encodings like utf-16 that use NULs". Can that just be "as long as you avoid multi-byte encodings other than utf-8"? I'd like to have something like this in the docs to demonstrate a reasonable workaround for the current no-Unicode code without casting it in stone by adding it to csv.py. -------------------------------------------------------------------------- The \module{csv} module doesn't directly support reading and writing Unicode, but it is 8-bit clean save for some problems with \ASCII{} NUL characters, so you can write classes that handle the encoding and decoding for you as long as you avoid encodings like utf-16 that use NULs. \begin{verbatim} import csv class UnicodeReader: def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): self.reader = csv.reader(f, dialect=dialect, **kwds) self.encoding = encoding def next(self): row = self.reader.next() return [unicode(s, self.encoding) for s in row] def __iter__(self): return self class UnicodeWriter: def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): self.writer = csv.writer(f, dialect=dialect, **kwds) self.encoding = encoding def writerow(self, row): self.writer.writerow([s.encode("utf-8") for s in row]) def writerows(self, rows): for row in rows: self.writerow(row) \end{verbatim} They should work just like the \class{csv.reader} and \class{csv.writer} classes but add an \var{encoding} parameter. -------------------------------------------------------------------------- Thx, Skip From andrewm at object-craft.com.au Mon Mar 21 00:28:02 2005 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 21 Mar 2005 10:28:02 +1100 Subject: [Csv] Example workaround classes for using Unicode with csv module... In-Reply-To: <16955.2733.160226.850562@montanaro.dyndns.org> References: <16955.2733.160226.850562@montanaro.dyndns.org> Message-ID: <20050320232802.6C7333C091@coffee.object-craft.com.au> >I added UnicodeReader and UnicodeWriter example classes to the csv module >docs just now. They mention problems with ASCII NUL characters (which I >vaguely remember - NUL-terminated strings are used internally, right?). Do >NULs still present a problem? I saw nothing in the log messages that >mentioned "ascii" or "nul" so I presume it is. That's right - it still uses null terminated strings internally, and the various special characters (quotechar, escapechar, etc) use null to mean "not specified". Fixing this would cause much upheaval. >Here's what I added. Let me know if you think it needs any corrections, >especially if there's a better way to word "as long as you avoid encodings >like utf-16 that use NULs". Can that just be "as long as you avoid >multi-byte encodings other than utf-8"? I think only utf-8 provides the guarantees needed for this to work - specifically, multi-byte characters need to have the high bit set (otherwise a delimiter or other special character appearing within a multi-byte character would upset the parsing), while at the same time having single byte characters for the characters with special meaning to the parser: note also that none of the special characters (quotechar, delimiter, escapechar, etc) can be a multi-byte sequence. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/