Problems with csv module
John Machin
sjmachin at lexicon.net
Wed May 11 20:55:40 EDT 2005
On Wed, 11 May 2005 20:02:25 +0200, "Fredrik Lundh"
<fredrik at pythonware.com> wrote:
>Skip Montanaro wrote:
>
>> Fredrik> does the CSV format even support Unicode-encoded data streams?
>>
>> Based on the requests I've seen here and on the csv at mojam.com mailing list,
>> it appears people are certainly generating CSV files which contain Unicode-
>> encoded data.
>
>in what encodings?
>
>is the encoding specified inside the file? if so, how?
>
>(it should be noted that the phrase "Unicode-encoded data" that I
>used doesn't make much sense, even in the original context. what
>I meant to say was that CSV, as far as I know, isn't defined as a
>stream of Unicode character, but rather as a stream of bytes in an
>ASCII-compatible encoding. this means that you can use e.g. ISO-
>8859-1 or UTF-8 for string values, but not that you can encode the
>whole thing as, say UTF-16 or UCS-4).
The CSV format is not defined at all, AFAIK.
Empirically, writing CSV works more-or-less like this, for each row:
# pseudocode, untested
control_chars = '\r\n' # or maybe more or maybe just '\n'
out_list = []
for each field:
if field contains quote_char:
out_field = quote_char + \
field.replace(quote_char, quote_char + quote_char) + \
quote_char
elif field contains any one of delimiter or control_chars:
out_field = quote_char + field + quote_char
else:
out_field = field
out_list.append(out_field)
then you write delimiter.join(out_list) followed by "\r\n"
So there is no reason at all why a writer and a reader couldn't use
the above quoting mechanism to transfer columnar data containing
Unicode -- they just have to agree on the encoding, control
characters, quote_char, delimiter, and line terminator.
Excel (see my other post in this thread) provides a writing ("save as
Unicode text") and reading mechanism which uses u'\t' as the
delimiter, u'\r\n' as the line terminator, u'\"' as the quote_char,
and utf-16 as the encoding. I haven't done an exhaustive check to see
what its definition of control_chars would be.
More information about the Python-list
mailing list