From dpeastman at gmail.com  Sat Aug 20 02:38:47 2011
From: dpeastman at gmail.com (Donovan Eastman)
Date: Fri, 19 Aug 2011 17:38:47 -0700
Subject: [Csv] Quoting non-existent values
Message-ID: <CACXNDU--1WhS2DP629M6u=4GVPMunfFThLHcFa9mvMbT8OUCdQ@mail.gmail.com>

I have a feature request, and I hope this is the appropriate place to
propose it.

Currently there is no way (that I can figure out) for csv to create a
file that looks like this:

"name","age_if_known","city_if_known"
"Bob",,"Boston"
"Alice",34,""

The rub is the blank unquoted entry for Bob's age -- otherwise
QUOTE_NONNUMERIC does exactly what I need it to.

I know that there has been some discussion in the past about allowing
explicit type specification for each column.  That would be cool, and
would presumably solve my problem, but I have a simpler suggestion:
Add a new quoting mode that is exactly the same as QUOTE_NONNUMERIC,
except that it wouldn't quote None.  So to get the output described
above, I would pass:
[["name", "age_if_known", "city_if_known"],
["Bob", None, "Boston"],
["Alice" ,34, ""]]

I don't have any great suggestions on a name for this new mode...
QUOTE_NONNUMERIC_NOTNONE? QUOTE_NONNUMERIC2?  Hopefully somebody else
has a better idea.

For my own purposes, I really only care about the writer, but it seems
like it should be implemented in the reader as well, if only to
provide the symmetry.  In addition, this would make it possible to
extract potentially important information out of certain source files
(Example: a file originating from a database table where NULL and ""
have distinctly different meanings).

Here's my use case:
I have some relatively large csv files that are processed by several
3rd party programs as well as my own scripts.  Each round of
processing tends to make subtle changes to the data.  I want to be
able to use standard diff tools to spot and analyze the changes
(mostly for debugging purposes).  Unfortunately, I can't find a common
denominator quoting style among these tools.   Every time the file
moves between Python and another program, the change in quoting styles
generates a lot noise in the diff, making it hard to read and hard to
spot problems.

Admittedly, this is a fairly narrow use case, but the changes required
to implement it seem fairly minor.  It would add functionality that is
otherwise impossible in Python short of completely re-implementing the
csv module.

I'd write a patch myself, but my knowledge of C is limited at best.

Thanks,
Donovan

From skip at pobox.com  Sat Aug 20 05:15:54 2011
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 19 Aug 2011 22:15:54 -0500
Subject: [Csv] Quoting non-existent values
In-Reply-To: <CACXNDU--1WhS2DP629M6u=4GVPMunfFThLHcFa9mvMbT8OUCdQ@mail.gmail.com>
References: <CACXNDU--1WhS2DP629M6u=4GVPMunfFThLHcFa9mvMbT8OUCdQ@mail.gmail.com>
Message-ID: <20047.9962.493086.346708@montanaro.dyndns.org>


    Donovan> I want to be able to use standard diff tools to spot and
    Donovan> analyze the changes (mostly for debugging purposes).
    Donovan> Unfortunately, I can't find a common denominator quoting style
    Donovan> among these tools.  Every time the file moves between Python
    Donovan> and another program, the change in quoting styles generates a
    Donovan> lot noise in the diff, making it hard to read and hard to spot
    Donovan> problems.

I imagine we could come up with something, but I think you might find it
easier (and would help you sooner), to pass your csv files through a
normalization process, and only compare normalized files.  So, you create a
file, donovan.csv, then send it to Bob.  He uses Excel on a Mac and when
he's done with it and sends back bob.csv, the quoting is different, the line
endings have changed, he reordered the columns, etc, so comparing
donovan.csv and bob.csv is an exercise in futility.

I wrote a script some time ago called csv2csv.  I use it all the time to
look at a subset of a large file, though it will also work as a normalizer.
The docstring is:

    Transform a CSV file into another form, adjusting the fields displayed,
    field quoting, field separators, etc.

    Usage: csv2csv -f f1,f2,f3,... [ options ] [ infile [ outfile ] ]
        -f lists field names to dump (quote if names contain spaces)
        -o sep - alternate output field separator (default is a comma)
        -i sep - alternate input field separator (default is a comma)
        -n - don't quote fields
        -D - don't use DOS/Windows line endings
        -H - do not emit the header line
      if given, infile specifies the input CSV file
      if given, outfile specifies the output CSV file

One feature I would add to use it as a full file normalizer is to make the
-f flag optional if you gave a -s flag, in which case it would display all
fields, but in sorted order.

I'll have to check at work to see if I can release it, but I doubt that will
be a problem.  If you think it would be useful to you, let me know.

-- 
Skip Montanaro - skip at pobox.com - http://www.smontanaro.net/

From skip at pobox.com  Mon Aug 22 21:16:08 2011
From: skip at pobox.com (skip at pobox.com)
Date: Mon, 22 Aug 2011 14:16:08 -0500
Subject: [Csv] Quoting non-existent values
In-Reply-To: <20047.9962.493086.346708@montanaro.dyndns.org>
References: <CACXNDU--1WhS2DP629M6u=4GVPMunfFThLHcFa9mvMbT8OUCdQ@mail.gmail.com>
	<20047.9962.493086.346708@montanaro.dyndns.org>
Message-ID: <20050.43768.556876.516477@montanaro.dyndns.org>


    skip> I'll have to check at work to see if I can release it, but I doubt
    skip> that will be a problem.  If you think it would be useful to you,
    skip> let me know.

Now available here:

    http://www.smontanaro.net/python/

It's the first item in the More or Less Current Stuff section.

Skip