[ python-Bugs-1606092 ] csv module broken for unicode

Wed Dec 6 16:56:48 CET 2006

Bugs item #1606092, was opened at 2006-11-30 08:46
Message generated for change (Comment added) made by montanaro
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1606092&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: JettLogic (jettlogic)
Assigned to: Nobody/Anonymous (nobody)
Summary: csv module broken for unicode

Initial Comment:
The csv module does not accept data to write/read as anything other than ascii-or-utf-8 str, and the do-it-yourself example in the Python 2.5 Manual to write in another encoding is extremely clunky: 

1) convert unicode to utf-8
2) use csv on utf-8 with cStringIO output
3) convert utf-8 to unicode
4) convert unicode to target encoding (may be utf-8...)

So clunky as to be a bug - csv clearly can't handle unicode at all.  The module functions are in dire need of either accepting unicode objects (letting the output stream worry about the encoding, like codecs.StreamWriter), or at the very least accepting data directly in a target encoding instead of roundabout utf-8.

To read another encoding is a bit less onerous than writing:

1) wrap file to return utf-8
2) use csv, getting utf-8 output
3) convert utf-8 to unicode object

Anyone willing to fix the csv module?

----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2006-12-06 09:56

Message:
Logged In: YES 
user_id=44345
Originator: NO

> Anyone know why it uses a C extension?

Performance.  A number of people (among them the authors of the _csv
extension and me, a contributor to the Python csv module that fronts
it) routinely read and write large (several megabytes) CSV files.  We
had all had experience with earlier Python-only CSV readers and
writers.  Their performance was just too poor.

If you wrote a new module in Python that's compatible with the
existing module -- and performed acceptably -- I see no reason it
couldn't replace the current module.  There are already a number
of test cases.  You'd certainly have to embellish them, but if the
current set passed that would be a good indication your code was at
least on the right track compatibility-wise.

There are other reasons to desire a Python-based solution other
than Unicode support.  It would be much more likely that such a
module would work with other Python implementations (e.g., PyPy,
IronPython and Jython).

Skip

----------------------------------------------------------------------

Comment By: JettLogic (jettlogic)
Date: 2006-12-05 05:53

Message:
Logged In: YES 
user_id=1345991
Originator: YES

Anyone know why it uses a C extension?  The C code apparently appends
fields to a writable byte buffer (so patching for unicode is impossible),
reallocated as it grows.  How much efficiency is gained by doing that,
with its many lines of logic overhead, versus careful use of python
strings?  For montanaro, the UnicodeWriter with three coding conversions
and a StringIO shows there is however much efficiency to be lost.

Perhaps lemburg's suggestion of a pure-python re-implementation of _csv is
the way to go.  It does not look like a fun task, after adding in
back-compatibility, benchmarks and tests, and I couldn't commit to it just
yet.  Are C->Py patches typically accepted?  (assume quality code and
comparable benchmarks)

I'll have to leave it at that.  If you leave this open, someone might take
it up at some point.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2006-12-03 12:22

Message:
Logged In: YES 
user_id=44345
Originator: NO

I must admit I don't understand the criticism of the UnicodeReader and
UnicodeWriter example classes in the module documentation.  Sure, their
implementations jump through some hoops, but that's so you don't have to. 
If you use them as written I believe their API's should be about the same
as the csv.reader and csv.writer classes with the added improvement that
the reader returns Unicode and the writer accepts Unicode.  If your desire
is to read and write Unicode why do you care that those objects are encoded
using utf-8 in the file?

Like Martin asked, are you willing to come up with better examples? 
Better yet, are you willing to provide a patch for the underlying
extension module so it handles Unicode?  Hint: I'm fairly certain that if
it was trivial it would have been done by now.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-12-03 06:54

Message:
Logged In: YES 
user_id=38388
Originator: NO

It should be easy to provide a wrapper class which implements the above in
plain Python.

However, if noone volunteers to write such code, it's not going to happen.

I've found that the builtin csv module is not flexible enough to deal with
the often broken CSV data you typically find in practice, so perhaps adding
a pure Python implementation which works with Unicode might prove to be a
better approach.

Unassigning the report, since I don't have time for this.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-12-03 04:34

Message:
Logged In: YES 
user_id=21627
Originator: NO

Are you willing to fix it?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1606092&group_id=5470