[Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test unicode_test.py,NONE,1.1

Skip Montanaro skip at pobox.com
Sat Feb 8 19:48:17 CET 2003


(redirecting to the csv mailing list so this stuff gets archived.)

    >> http://mail.python.org/pipermail/python-list/2003-February/145151.html

    mal> Why not convert the input data to UTF-8 and take it from there ?

Good suggestion, thanks.  The only issue is the variable width nature of
utf-8.  I think if we are going to convert to a concrete encoding it would
be easier to convert to something which has constant-width characters
wouldn't it?  Of course, if I can convince the guys in Australia writing the
actual code to deal with a variable-width encoding, it can't be far from
there to allowing multi-character delimiters. ;-)

    mal> Are you sure that Unicode objects will be lower in processing ?

Operating on Python string or unicode objects without converting them to
some sort of C string will almost certainly be slower than the current code
which is a relatively modest finite state machine operating on individual
bytes.

    mal> (Is there a standard for encodings in CSV files ?)

No, there is none, hence the use of codecs.EncodedFile to allow the
programmer to specify the encoding.  Excel can export to two formats it
calls "Unicode CSV" and "Unicode Text".  Exporting a spreadsheet containing
nothing but ASCII as Unicode CSV produced exactly the same comma-separated
file as would have been dumped using the usual CSV export format.  Exporting
the same spreadsheet as Unicode Text produced a tab-separated file which I
guessed to be utf-16.  It started with a little-endian utf-16 BOM and all
the characters were two bytes wide with one byte being an ASCII NUL.

Thanks for the feedback,

Skip


More information about the Csv mailing list