[Python-ideas] csv dialect enhancement (repost)

rurpy at yahoo.com rurpy at yahoo.com
Fri Jan 11 18:51:09 CET 2013


[Sorry for the duplicated text in the previous post, please ignore 
that one in favor of this one]

There is a common dialect of CSV, often used in database
applications [*1], that distinguishes between an empty
(quoted) string,
 
  e.g., the second field in  "abc","",3
 
and an empty field,
 
  e.g., the second field in "abc",,3
 
This distinction is needed to specify or tell the
difference between 0-length strings and NULLs, when sending
csv data to or receiving it from a database application.

AFAICT, Python's csv module does not distinguish between
empty fields and empty quoted strings.  Both of the examples
above, when parsed by csv.Reader, will return ['abc', '', 3]
(or possibly '3' for the last item depending on options). 
Similarly, csv.Writer produces the same output csv text
(nothing or a quoted empty string depending on Dialect.quoting)
for row items '' or None.

csv.Reader could distinguish between the above cases by
using an empty string ('') to report an empty (quoted) string
field, and None to report an empty field.  Thus the second
example would produce ['abc', None, 3] (or ...,'3').  Similarly,
csv.Writer could produce alternate text (nothing or a quoted
empty string) depending on whether a row item was None or
an empty string.
 
I propose that a new dialect attribute be added, "nulls" [*2],
which when false (default) will cause csv to behave as it currently
does.  When true it will have the following effect:
 
Reader:
  When two adjacent delimiters occur, or two white-space
  separated delimiters when Dialect.skipinitialspace is true,
  a value of None will be returned for that field.
 
Writer:
  When a None is present in the the list of items being
  formatted, it will result in an empty output field
  (two adjacent delimiters) regardless of other options
  (eg a QUOTE_ALL setting.)

Sniffer:
  Will set "nulls" to True when both adjacent delimiters and
  quoted empty strings are seen in the input text.
  (Perhaps this behaviour needs to be optional for backward
  compatibility reasons?)

I think this will allow the csv module to generate the csv
dialect(s) commonly used by databases applications.

A specific use case:

I am migrating data from a MS Access database to Postgresql.
I run a tool that extracts table data from Access and correctly
produces CSV files in the dialect used by Postgresql with some
(nullable) column values having empty fields and other non-
nullable column values having empty string fields.

But I need to modify some values before import.  So I write a
Python program that parses the csv data, modifies some of it
and writes it back out, using the csv module.  But the result
is that all empty fields and empty strings are written out
identically as one or the other (the distinction is not preserved). 
Result is that information is lost and the output cannot be
used.  I would be able to do this if the csv module provide a
"nulls" option as proposed above.

----
[*1] One of the two most important open-source databases,
Postgresql, uses this dialect.  See:
  http://www.postgresql.org/docs/9.2/interactive/sql-copy.html#AEN66692
I don't know about the other.

[*2] I don't really care what the attribute name is; I chose
"nulls" as a trial balloon because I wanted to avoid something
with "none" in it to avoid confusion with QUOTE_NONE.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130111/7c83e73e/attachment.html>


More information about the Python-ideas mailing list