csv Parser Question - Handling of Double Quotes

John Machin sjmachin at lexicon.net
Thu Mar 27 18:00:01 EDT 2008


On Mar 28, 7:37 am, Aaron Watters <aaron.watt... at gmail.com> wrote:
>
> If you want fame and admiration you could fix
> the arguably bug in the csv module and send
> the patch to the python bugs mailing list.
> However, I just had a perusal of csv.py....
> good luck :).

It is *NOT* a bug in the Python CSV module. The data is the problem.
The admittedly arcane way that the admittedly informal CSV writing
protocol works for each field is (simplified by ignoring \n and other
quotables):

QUOTE = '"'
DELIM = ','
if QUOTE in field:
   emit(QUOTE + field.replace(QUOTE, QUOTE+QUOTE) + QUOTE)
elif DELIM in field:
   emit(QUOTE + field + QUOTE)
else:
   emit(field)

Example: database query, customer's surname recorded as O"Brien

This should be written as ...,"O""Brien",...
and read back as ['...', 'O"Brien', '...']

Aside: this quote-doubling caper is not restricted to CSV and not
exactly an uncommon occurrence:
SELECT * FROM cust WHERE surname = 'O''Brien';

A common mistake in CSV writing is to omit the quote-doubling step
above. If that is done, it is impossible to recover the original
contents unambiguously in all cases without further knowledge,
assumptions, heuristics, or look-ahead e.g. (1) the original field had
an even number of quotes or (2) the intended number of fields is known
or (3) there is only one quote in the line and there are no embedded
newlines ...

The Python csv module emulates Excel in delivering garbage silently in
cases when the expected serialisation protocol has (detectably) not
been followed. Proffering fame and admiration might be better directed
towards introducing a "strict" option than patching a non-existing bug
(which would introduce new ones).

Cheers,
John



More information about the Python-list mailing list