Problems with csv module

John Machin sjmachin at lexicon.net
Wed May 11 20:55:40 EDT 2005


On Wed, 11 May 2005 20:02:25 +0200, "Fredrik Lundh"
<fredrik at pythonware.com> wrote:

>Skip Montanaro wrote:
>
>>     Fredrik> does the CSV format even support Unicode-encoded data streams?
>>
>> Based on the requests I've seen here and on the csv at mojam.com mailing list,
>> it appears people are certainly generating CSV files which contain Unicode-
>> encoded data.
>
>in what encodings?
>
>is the encoding specified inside the file?  if so, how?
>
>(it should be noted that the phrase "Unicode-encoded data" that I
>used doesn't make much sense, even in the original context.  what
>I meant to say was that CSV, as far as I know, isn't defined as a
>stream of Unicode character, but rather as a stream of bytes in an
>ASCII-compatible encoding.  this means that you can use e.g. ISO-
>8859-1 or UTF-8 for string values, but not that you can encode the
>whole thing as, say UTF-16 or UCS-4).

The CSV format is not defined at all, AFAIK.

Empirically, writing CSV works more-or-less like this, for each row:
# pseudocode, untested
control_chars = '\r\n' # or maybe more or maybe just '\n'
out_list = []
for each field:
    if field contains quote_char:
        out_field = quote_char + \
            field.replace(quote_char, quote_char + quote_char) + \
            quote_char
    elif field contains any one of delimiter or control_chars:
         out_field = quote_char + field + quote_char
    else:
         out_field = field
    out_list.append(out_field)

then you write delimiter.join(out_list) followed by "\r\n"

So there is no reason at all why a writer and a reader couldn't use
the above quoting mechanism to transfer columnar data containing
Unicode -- they just have to agree on the encoding, control
characters, quote_char, delimiter, and line terminator.

Excel (see my other post in this thread) provides a writing ("save as
Unicode text") and reading mechanism which uses u'\t' as the
delimiter, u'\r\n' as the line terminator, u'\"' as the quote_char,
and utf-16 as the encoding. I haven't done an exhaustive check to see
what its definition of control_chars would be.





More information about the Python-list mailing list