From dpeastman at gmail.com Sat Aug 20 02:38:47 2011 From: dpeastman at gmail.com (Donovan Eastman) Date: Fri, 19 Aug 2011 17:38:47 -0700 Subject: [Csv] Quoting non-existent values Message-ID: I have a feature request, and I hope this is the appropriate place to propose it. Currently there is no way (that I can figure out) for csv to create a file that looks like this: "name","age_if_known","city_if_known" "Bob",,"Boston" "Alice",34,"" The rub is the blank unquoted entry for Bob's age -- otherwise QUOTE_NONNUMERIC does exactly what I need it to. I know that there has been some discussion in the past about allowing explicit type specification for each column. That would be cool, and would presumably solve my problem, but I have a simpler suggestion: Add a new quoting mode that is exactly the same as QUOTE_NONNUMERIC, except that it wouldn't quote None. So to get the output described above, I would pass: [["name", "age_if_known", "city_if_known"], ["Bob", None, "Boston"], ["Alice" ,34, ""]] I don't have any great suggestions on a name for this new mode... QUOTE_NONNUMERIC_NOTNONE? QUOTE_NONNUMERIC2? Hopefully somebody else has a better idea. For my own purposes, I really only care about the writer, but it seems like it should be implemented in the reader as well, if only to provide the symmetry. In addition, this would make it possible to extract potentially important information out of certain source files (Example: a file originating from a database table where NULL and "" have distinctly different meanings). Here's my use case: I have some relatively large csv files that are processed by several 3rd party programs as well as my own scripts. Each round of processing tends to make subtle changes to the data. I want to be able to use standard diff tools to spot and analyze the changes (mostly for debugging purposes). Unfortunately, I can't find a common denominator quoting style among these tools. Every time the file moves between Python and another program, the change in quoting styles generates a lot noise in the diff, making it hard to read and hard to spot problems. Admittedly, this is a fairly narrow use case, but the changes required to implement it seem fairly minor. It would add functionality that is otherwise impossible in Python short of completely re-implementing the csv module. I'd write a patch myself, but my knowledge of C is limited at best. Thanks, Donovan From skip at pobox.com Sat Aug 20 05:15:54 2011 From: skip at pobox.com (skip at pobox.com) Date: Fri, 19 Aug 2011 22:15:54 -0500 Subject: [Csv] Quoting non-existent values In-Reply-To: References: Message-ID: <20047.9962.493086.346708@montanaro.dyndns.org> Donovan> I want to be able to use standard diff tools to spot and Donovan> analyze the changes (mostly for debugging purposes). Donovan> Unfortunately, I can't find a common denominator quoting style Donovan> among these tools. Every time the file moves between Python Donovan> and another program, the change in quoting styles generates a Donovan> lot noise in the diff, making it hard to read and hard to spot Donovan> problems. I imagine we could come up with something, but I think you might find it easier (and would help you sooner), to pass your csv files through a normalization process, and only compare normalized files. So, you create a file, donovan.csv, then send it to Bob. He uses Excel on a Mac and when he's done with it and sends back bob.csv, the quoting is different, the line endings have changed, he reordered the columns, etc, so comparing donovan.csv and bob.csv is an exercise in futility. I wrote a script some time ago called csv2csv. I use it all the time to look at a subset of a large file, though it will also work as a normalizer. The docstring is: Transform a CSV file into another form, adjusting the fields displayed, field quoting, field separators, etc. Usage: csv2csv -f f1,f2,f3,... [ options ] [ infile [ outfile ] ] -f lists field names to dump (quote if names contain spaces) -o sep - alternate output field separator (default is a comma) -i sep - alternate input field separator (default is a comma) -n - don't quote fields -D - don't use DOS/Windows line endings -H - do not emit the header line if given, infile specifies the input CSV file if given, outfile specifies the output CSV file One feature I would add to use it as a full file normalizer is to make the -f flag optional if you gave a -s flag, in which case it would display all fields, but in sorted order. I'll have to check at work to see if I can release it, but I doubt that will be a problem. If you think it would be useful to you, let me know. -- Skip Montanaro - skip at pobox.com - http://www.smontanaro.net/ From skip at pobox.com Mon Aug 22 21:16:08 2011 From: skip at pobox.com (skip at pobox.com) Date: Mon, 22 Aug 2011 14:16:08 -0500 Subject: [Csv] Quoting non-existent values In-Reply-To: <20047.9962.493086.346708@montanaro.dyndns.org> References: <20047.9962.493086.346708@montanaro.dyndns.org> Message-ID: <20050.43768.556876.516477@montanaro.dyndns.org> skip> I'll have to check at work to see if I can release it, but I doubt skip> that will be a problem. If you think it would be useful to you, skip> let me know. Now available here: http://www.smontanaro.net/python/ It's the first item in the More or Less Current Stuff section. Skip