From skip at pobox.com Mon Jan 27 01:33:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 26 Jan 2003 18:33:11 -0600 Subject: DSVWizard.py In-Reply-To: <1043622397.25146.2910.camel@software1.logiplex.internal> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> Message-ID: <15924.32327.631412.57615@montanaro.dyndns.org> I'm adding Dave Cole to the distribution list on this note. Dave, Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a few messages about trying to develop a CSV API for Python. >> I suspect most of the differences I see between the DSV and csv >> modules are due to interpretation differences between Cliff and Dave. Cliff> Or a bug in an older version of DSV. If you have anything that Cliff> differs using 1.4, please pass it on so I can take a look at it. I downloaded 1.4 just now. The sfsample.csv file is now processed identically by the two modules. The nastiness.csv file generates three differences though: % python shootout.py nastiness.csv DSV: 0.01 seconds, 13 rows csv: 0.00 seconds, 13 rows 2 DSV: ['Test 1', 'Fred said "hey!", and left the room', ''] csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""'] 10 DSV: ['Test 9', 'no spaces around this', ' but single spaces around this '] csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this '] 12 DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted'] csv: [' "Test 11" ', ' "has no spaces around anything" ', ' "because the data is quoted" '] All the three lines have white space immediately following separating commas. DSV appears to skip over this white space, while csv treats it as part of the field contents. Skip PS, Just so Dave has the same "test harness", I've attached shootout.py and nastiness.csv. The shootout.py script now assumes DSV is installed with the package structure of DSV 1.4.0. -------------- next part -------------- A non-text attachment was scrubbed... Name: shootout.py Type: application/octet-stream Size: 730 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030126/a4de7492/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: nastiness.csv Type: application/octet-stream Size: 600 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030126/a4de7492/attachment-0001.obj From skip at pobox.com Mon Jan 27 01:37:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 26 Jan 2003 18:37:24 -0600 Subject: DSVWizard.py In-Reply-To: <1043622397.25146.2910.camel@software1.logiplex.internal> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> Message-ID: <15924.32580.130562.578623@montanaro.dyndns.org> Cliff> I think even Excel has the option to import files using "/'/none Cliff> for text qualifiers. This was the only shortcoming I saw in csv Cliff> (only " is used for quoting). Actually, csv's parser objects have a writable quote_char attribute: >>> import csv >>> p = csv.parser() >>> p.quote_char '"' >>> p.quote_char = "'" >>> p.quote_char "'" Skip From LogiplexSoftware at earthlink.net Mon Jan 27 02:47:46 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 26 Jan 2003 17:47:46 -0800 Subject: DSVWizard.py In-Reply-To: <15924.32327.631412.57615@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> Message-ID: <1043632066.25146.2950.camel@software1.logiplex.internal> On Sun, 2003-01-26 at 16:33, Skip Montanaro wrote: > I'm adding Dave Cole to the distribution list on this note. Dave, Kevin > Altis, Cliff Wells (author of DSV) and I have exchanged a few messages about > trying to develop a CSV API for Python. > > >> I suspect most of the differences I see between the DSV and csv > >> modules are due to interpretation differences between Cliff and Dave. > > Cliff> Or a bug in an older version of DSV. If you have anything that > Cliff> differs using 1.4, please pass it on so I can take a look at it. > > I downloaded 1.4 just now. The sfsample.csv file is now processed > identically by the two modules. The nastiness.csv file generates three > differences though: > > % python shootout.py nastiness.csv > DSV: 0.01 seconds, 13 rows > csv: 0.00 seconds, 13 rows > 2 > DSV: ['Test 1', 'Fred said "hey!", and left the room', ''] > csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""'] IMO, Dave's is incorrect in this one (unless he has specific reasons otherwise). The original line (from the csv file) is: Test 1, "Fred said ""hey!"", and left the room", "" The "" at the end is an empty, quoted field. Maybe someone should run this through Excel to see what it claims (I'd be willing to accept Dave's interpretation if Excel does it this way, although I'd still feel it was incorrect). I handled this case specifically at a user's request. > 10 > DSV: ['Test 9', 'no spaces around this', ' but single spaces around this '] > csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this '] > 12 > DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted'] > csv: [' "Test 11" ', ' "has no spaces around anything" ', ' "because the data is quoted" '] > > All the three lines have white space immediately following separating > commas. DSV appears to skip over this white space, while csv treats it as > part of the field contents. Again, this was at a user's request, and is special-case code in DSV that can easily be removed. The user noted, and I concurred, that given a quoted field with whitespace around it, the whitespace should be ignored. However, once again I'd be willing to follow Excel's lead in this because I'd also consider this to be malformed or at least ambiguous data. > > Skip > > PS, Just so Dave has the same "test harness", I've attached shootout.py and > nastiness.csv. The shootout.py script now assumes DSV is installed with the > package structure of DSV 1.4.0. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Mon Jan 27 06:08:21 2003 From: djc at object-craft.com.au (Dave Cole) Date: 27 Jan 2003 16:08:21 +1100 Subject: DSVWizard.py In-Reply-To: <1043632066.25146.2950.camel@software1.logiplex.internal> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043632066.25146.2950.camel@software1.logiplex.internal> Message-ID: > On Sun, 2003-01-26 at 16:33, Skip Montanaro wrote: > > I'm adding Dave Cole to the distribution list on this note. Dave, > > Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a > > few messages about trying to develop a CSV API for Python. > > > > >> I suspect most of the differences I see between the DSV and csv > > >> modules are due to interpretation differences between Cliff and Dave. > > > > Cliff> Or a bug in an older version of DSV. If you have anything that > > Cliff> differs using 1.4, please pass it on so I can take a look at it. > > > > I downloaded 1.4 just now. The sfsample.csv file is now processed > > identically by the two modules. The nastiness.csv file generates > > three differences though: > > > > % python shootout.py nastiness.csv > > DSV: 0.01 seconds, 13 rows > > csv: 0.00 seconds, 13 rows > > 2 > > DSV: ['Test 1', 'Fred said "hey!", and left the room', ''] > > csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""'] > > IMO, Dave's is incorrect in this one (unless he has specific reasons > otherwise). Andrew (who has been included on th Cc) has tested the behaviour of Excel (such as it is) and we do the same thing as Excel. As to whether Excel is doing the right thing, that is a different question entirely. One of the people we have done work for has some very nasty "CSV" data to parse. We have been trying to work out what to do to the CSV module to handle some of the silly things he sees without breaking the Excel compatibility. > The original line (from the csv file) is: > > Test 1, "Fred said ""hey!"", and left the room", "" > > The "" at the end is an empty, quoted field. Maybe someone should > run this through Excel to see what it claims (I'd be willing to > accept Dave's interpretation if Excel does it this way, although I'd > still feel it was incorrect). I handled this case specifically at a > user's request. Andrew, can you run that exact line through Excel? > > 10 > > DSV: ['Test 9', 'no spaces around this', ' but single spaces around this '] > > csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this '] > > 12 > > DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted'] > > csv: [' "Test 11" ', ' "has no spaces around anything" ', ' "because the data is quoted" '] > > > > All the three lines have white space immediately following > > separating commas. DSV appears to skip over this white space, > > while csv treats it as part of the field contents. I am fairly sure that is what Excel does. > Again, this was at a user's request, and is special-case code in DSV > that can easily be removed. The user noted, and I concurred, that > given a quoted field with whitespace around it, the whitespace > should be ignored. However, once again I'd be willing to follow > Excel's lead in this because I'd also consider this to be malformed > or at least ambiguous data. Pity there is no real specification for CSV. > > PS, Just so Dave has the same "test harness", I've attached > > shootout.py and nastiness.csv. The shootout.py script now assumes > > DSV is installed with the package structure of DSV 1.4.0. -- http://www.object-craft.com.au From djc at object-craft.com.au Mon Jan 27 06:13:34 2003 From: djc at object-craft.com.au (Dave Cole) Date: 27 Jan 2003 16:13:34 +1100 Subject: DSVWizard.py In-Reply-To: <15924.32580.130562.578623@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32580.130562.578623@montanaro.dyndns.org> Message-ID: > Cliff> I think even Excel has the option to import files using "/'/none > Cliff> for text qualifiers. This was the only shortcoming I saw in csv > Cliff> (only " is used for quoting). > > Actually, csv's parser objects have a writable quote_char attribute: > > >>> import csv > >>> p = csv.parser() > >>> p.quote_char > '"' > >>> p.quote_char = "'" > >>> p.quote_char > "'" For all sorts of fun and games you can even turn off quoting. >>> import csv >>> p = csv.parser() >>> p.join(['1','2,3','4']) '1,"2,3",4' >>> p.escape_char = '\\' >>> p.join(['1','2,3','4']) '1,"2,3",4' >>> p.quote_char = None >>> p.join(['1','2,3','4']) '1,2\\,3,4' - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Mon Jan 27 06:18:53 2003 From: djc at object-craft.com.au (Dave Cole) Date: 27 Jan 2003 16:18:53 +1100 Subject: DSVWizard.py In-Reply-To: <15924.32327.631412.57615@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> Message-ID: > I'm adding Dave Cole to the distribution list on this note. Dave, > Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a few > messages about trying to develop a CSV API for Python. Python having a CSV API would be an excellent thing. The most difficult problem to solve is how to expose all of the CSV variations so that users can work out how to drive the module. I suppose the first step would be to catalogue all of common the CSV variations and give them names. Naming variations after the applications which produce them could be the best way. - Dave -- http://www.object-craft.com.au From LogiplexSoftware at earthlink.net Mon Jan 27 18:02:04 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Jan 2003 09:02:04 -0800 Subject: DSVWizard.py In-Reply-To: References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> Message-ID: <1043686924.25146.2997.camel@software1.logiplex.internal> On Sun, 2003-01-26 at 21:18, Dave Cole wrote: > > I'm adding Dave Cole to the distribution list on this note. Dave, > > Kevin Altis, Cliff Wells (author of DSV) and I have exchanged a few > > messages about trying to develop a CSV API for Python. > > Python having a CSV API would be an excellent thing. The most > difficult problem to solve is how to expose all of the CSV variations > so that users can work out how to drive the module. > > I suppose the first step would be to catalogue all of common the CSV > variations and give them names. Naming variations after the > applications which produce them could be the best way. That doesn't sound like a bad idea, but the task of cataloging all those applications seems a bit daunting, especially since I suspect between all of us we can probably only account for a handful of them. I suppose we could have a place for users to submit csv samples from applications they want supported. The fact of the matter is, despite there being no real standard, there seems to be only minor differences between each format: delimiter, quote style, allowed spaces around quotes. A programmer who knows the specific style of the data he's importing could specify via attributes or flags how to process the file. For the general case, DSV already has heuristics for determining the first two, and adding code to test for the third case shouldn't be too difficult. Another problem with specifying styles by application name is that many apps allow the user to specify portions of the style (usually the delimiter), so that's not set in stone either. I think what I'm leaning towards at this time, if everyone is in agreement, is for Dave or myself to reimplement Dave's code (and API) in Python so that there is a pure Python implementation, and then provide Dave's C module as a faster alternative (much like Pickle and cPickle). The heuristics of DSV would be an optional feature, along with the GUI. Someone is already doing work on porting the wxPython GUI code to Qt, but it would be useful for a Tk port to appear as well (I'm *not* volunteering for that). I also have serious doubts about the GUI getting added to the core (even a Tk version), so that would have to be spun off and maintained separately on SF. I also expect that if a csv module were added to the Python library, I could get Robin Dunn to add the GUI for it to the wxPython libraries. As far as DSV's current API, I'm not too attached to it, and I think that it could be mimicked sufficiently by adding a parser.parseall() method to Dave's API so the programmer would have the option of getting the entire file as a list without having to write a loop. Something I'd also like to see, and I think Kevin mentioned this, is a generator interface for retrieving the data line by line. I think that this would provide the most complete set of features and best performance options. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Mon Jan 27 18:36:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 11:36:26 -0600 Subject: DSVWizard.py In-Reply-To: <1043686924.25146.2997.camel@software1.logiplex.internal> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> Message-ID: <15925.28186.949208.952742@montanaro.dyndns.org> (Dave, should we continue to use the csv at object-craft address for you or your djc email?) >> I suppose the first step would be to catalogue all of common the CSV >> variations and give them names. Naming variations after the >> applications which produce them could be the best way. Cliff> That doesn't sound like a bad idea, but the task of cataloging Cliff> all those applications seems a bit daunting, especially since I Cliff> suspect between all of us we can probably only account for a Cliff> handful of them. I think we should aim for Excel2000 compatibility as a bare minimum, and at least document any supported extensions and try to tie them to specific other applications. It is indeed unfortunate that the CSV file format is only operationally defined. Wild-ass idea: Maybe the API should include a query function or a data attribute which lists (as strings) the variants of CSV supported by a module (which should be supported by test cases)? The default variant would be listed first, and the constructor would take any of the listed variants as an optional argument. Something like: variants = csv.get_variants() csvl = csv.parser(variant="lotus123") csve = csv.parser(variant="excel2000") We could create an informal "registry" of valid variant names. If support for an existing variant is added, you use that name. If support for an unknown variant is added, you register a string. Cliff> ... despite there being no real standard, there seems to be only Cliff> minor differences between each format: delimiter, quote style, Cliff> allowed spaces around quotes. That's true. Perhaps selecting by variant name would do nothing more than set those specific values behind the scenes, much the same way that when you choose a particular C coding style in Emacs a number of low-level variable values are set. Cliff> Another problem with specifying styles by application name is Cliff> that many apps allow the user to specify portions of the style Cliff> (usually the delimiter), so that's not set in stone either. Yes, but there's still usually a default. Some of the stuff (like space after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't user-settable and isn't obvious without inspecting the CSV file. You might have csve2 = csv.parser(variant="excel2000", delimiter=';') to specify user-settable parameters or use "sniffing" code like DSV does to figure out what the best choice is. Cliff> I think what I'm leaning towards at this time, if everyone is in Cliff> agreement, is for Dave or myself to reimplement Dave's code (and Cliff> API) in Python so that there is a pure Python implementation, and Cliff> then provide Dave's C module as a faster alternative (much like Cliff> Pickle and cPickle). The heuristics of DSV would be an optional Cliff> feature, along with the GUI. This sounds like a reasonable idea. I also agree the GUI stuff will probably not make it into the core. Cliff> As far as DSV's current API, I'm not too attached to it, and I Cliff> think that it could be mimicked sufficiently by adding a Cliff> parser.parseall() method to Dave's API so the programmer would Cliff> have the option of getting the entire file as a list without Cliff> having to write a loop. Skip From skip at pobox.com Mon Jan 27 19:13:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 12:13:02 -0600 Subject: ASV Message-ID: <15925.30382.921990.566934@montanaro.dyndns.org> I downloaded and installed Laurie Tratt's ASV module today and extended my shootout script to try it. It's considerably slower than DSV (by about 15x on my sfsample.csv file, which makes it something like 75-150x slower than csv) and doesn't appear to handle newlines within fields, generating 17 rows instead of 13 on nastiness.csv. It also seems to ignore all whitespace at the beginning of fields, irregardless of field quoting, so for the first line of nastiness.csv it returns ['Column1', 'Column2', 'Column3'] instead of ['Column1', 'Column2', ' Column3'] It does generate the same results as DSV and csv for my sfsample.csv script, though that file is very well-behaved (fully quoted, no whitespace surrounding delimiters). I'm not aware that it has any interesting properties not available in either DSV or csv, so I'm inclined to not consider it further. Skip From skip at pobox.com Mon Jan 27 19:17:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 12:17:06 -0600 Subject: delimiters... Message-ID: <15925.30626.901414.610449@montanaro.dyndns.org> I modified shootout.py to allow specification of alternate delimiters on the command line and manually converted nastiness.csv to nastytabs.csv. Processing nastytabs.csv with TAB as the delimiter generates identical results as processing nastiness.csv with comma as the delimiter. (This is a good thing. ;-) Nastytabs.csv and modified shootout.py attached. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: nastytabs.csv Type: application/octet-stream Size: 600 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030127/1d2e6d32/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: shootout.py Type: application/octet-stream Size: 1083 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030127/1d2e6d32/attachment-0001.obj From LogiplexSoftware at earthlink.net Mon Jan 27 20:42:22 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Jan 2003 11:42:22 -0800 Subject: DSVWizard.py In-Reply-To: <15925.28186.949208.952742@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> Message-ID: <1043696542.25139.3027.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 09:36, Skip Montanaro wrote: > (Dave, should we continue to use the csv at object-craft address for you or > your djc email?) > > >> I suppose the first step would be to catalogue all of common the CSV > >> variations and give them names. Naming variations after the > >> applications which produce them could be the best way. > > Cliff> That doesn't sound like a bad idea, but the task of cataloging > Cliff> all those applications seems a bit daunting, especially since I > Cliff> suspect between all of us we can probably only account for a > Cliff> handful of them. > > I think we should aim for Excel2000 compatibility as a bare minimum, and at > least document any supported extensions and try to tie them to specific > other applications. It is indeed unfortunate that the CSV file format is > only operationally defined. > > Wild-ass idea: Maybe the API should include a query function or a data > attribute which lists (as strings) the variants of CSV supported by a module > (which should be supported by test cases)? The default variant would be > listed first, and the constructor would take any of the listed variants as > an optional argument. Something like: > > variants = csv.get_variants() > > csvl = csv.parser(variant="lotus123") > csve = csv.parser(variant="excel2000") > > We could create an informal "registry" of valid variant names. If support > for an existing variant is added, you use that name. If support for an > unknown variant is added, you register a string. Sounds reasonable, but I think the variant should be customizable in the method call: csvl = csv.parser(variant = "lotus123", delimiter = '\t') So assuming "lotus123" was defined to use commas by default, it would follow all the rules of the lotus variant except for the delimiter. This would allow for some flexibility in case the user saved the csv file from Lotus but changed an option or two. > Cliff> ... despite there being no real standard, there seems to be only > Cliff> minor differences between each format: delimiter, quote style, > Cliff> allowed spaces around quotes. > > That's true. Perhaps selecting by variant name would do nothing more than > set those specific values behind the scenes, much the same way that when you > choose a particular C coding style in Emacs a number of low-level variable > values are set. That's what I was thinking. In this case the "variant" could just be a dictionary or simple class with a few attributes. > Cliff> Another problem with specifying styles by application name is > Cliff> that many apps allow the user to specify portions of the style > Cliff> (usually the delimiter), so that's not set in stone either. > > Yes, but there's still usually a default. Some of the stuff (like space > after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't > user-settable and isn't obvious without inspecting the CSV file. You might > have > > csve2 = csv.parser(variant="excel2000", delimiter=';') Oh. Guess I should have read the entire message before replying ;) At least it looks like we are on the same page =) > to specify user-settable parameters or use "sniffing" code like DSV does to > figure out what the best choice is. The "sniffing" code in DSV is best used in conjunction with some sort of confirmation from the user. I've seen it guess incorrectly on some files (although not very often). Mostly stuff that has repeating patterns of other characters (colons and slashes in dates and times). However, given these types of files, it defaults to the more common delimiter (i.e. given a file that has both repeating colons and commas, the comma will be chosen) which weeds out the majority of false positives. Nevertheless, it would seem foolhardy for a programmer to rely on it without some sort of user intervention. It could be perhaps made a little smarter, but it's a difficult problem and I'd be reluctant to use it alone. This is why the GUI code is rather part-and-parcel with the heuristics. Nevertheless, having a separate project for maintaining the GUI solves this and the programmer can always roll his own if need be. > Cliff> I think what I'm leaning towards at this time, if everyone is in > Cliff> agreement, is for Dave or myself to reimplement Dave's code (and > Cliff> API) in Python so that there is a pure Python implementation, and > Cliff> then provide Dave's C module as a faster alternative (much like > Cliff> Pickle and cPickle). The heuristics of DSV would be an optional > Cliff> feature, along with the GUI. > > This sounds like a reasonable idea. I also agree the GUI stuff will > probably not make it into the core. Anyone else? BTW, where are we planning on hosting this project? Under one of the existing projects or somewhere else? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Mon Jan 27 20:48:13 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Jan 2003 11:48:13 -0800 Subject: ASV In-Reply-To: <15925.30382.921990.566934@montanaro.dyndns.org> References: <15925.30382.921990.566934@montanaro.dyndns.org> Message-ID: <1043696893.25146.3034.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 10:13, Skip Montanaro wrote: > I downloaded and installed Laurie Tratt's ASV module today and extended my > shootout script to try it. It's considerably slower than DSV (by about 15x > on my sfsample.csv file, which makes it something like 75-150x slower than > csv) and doesn't appear to handle newlines within fields, generating 17 rows > instead of 13 on nastiness.csv. It also seems to ignore all whitespace at > the beginning of fields, irregardless of field quoting, so for the first > line of nastiness.csv it returns > > ['Column1', 'Column2', 'Column3'] > > instead of > > ['Column1', 'Column2', ' Column3'] > > It does generate the same results as DSV and csv for my sfsample.csv script, > though that file is very well-behaved (fully quoted, no whitespace > surrounding delimiters). > > I'm not aware that it has any interesting properties not available in either > DSV or csv, so I'm inclined to not consider it further. Agreed. I assume the API didn't provide any interesting approaches either? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Mon Jan 27 21:05:49 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Jan 2003 12:05:49 -0800 Subject: DSVWizard.py In-Reply-To: References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043632066.25146.2950.camel@software1.logiplex.internal> Message-ID: <1043697949.25139.3051.camel@software1.logiplex.internal> On Sun, 2003-01-26 at 21:08, Dave Cole wrote: > > > DSV: ['Test 1', 'Fred said "hey!", and left the room', ''] > > > csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""'] > > > > IMO, Dave's is incorrect in this one (unless he has specific reasons > > otherwise). > > Andrew (who has been included on th Cc) has tested the behaviour of > Excel (such as it is) and we do the same thing as Excel. As to > whether Excel is doing the right thing, that is a different question > entirely. Okay. So the default behavior would be to *not* treat the quotes as text qualifiers in the following: data, "data", data unless the user specifies otherwise. > One of the people we have done work for has some very nasty "CSV" data > to parse. We have been trying to work out what to do to the CSV > module to handle some of the silly things he sees without breaking the > Excel compatibility. Having "variants" as Skip mentioned (and I think you did as well) would solve this. I'm also a bit curious as to the "Treat consecutive delimiters as one" option in Excel. I had planned to add support for that in DSV but never got around to it. Does csv have such an option? Is this really ever useful? I've never had anyone request that I enable that option in DSV, despite the fact that there's even a checkbox (disabled) for it in the GUI. > > > The original line (from the csv file) is: > > > > Test 1, "Fred said ""hey!"", and left the room", "" > > > > The "" at the end is an empty, quoted field. Maybe someone should > > run this through Excel to see what it claims (I'd be willing to > > accept Dave's interpretation if Excel does it this way, although I'd > > still feel it was incorrect). I handled this case specifically at a > > user's request. > > Andrew, can you run that exact line through Excel? > > > > 10 > > > DSV: ['Test 9', 'no spaces around this', ' but single spaces around this '] > > > csv: ['Test 9', ' "no spaces around this" ', ' but single spaces around this '] > > > 12 > > > DSV: ['Test 11', 'has no spaces around anything', 'because the data is quoted'] > > > csv: [' "Test 11" ', ' "has no spaces around anything" ', ' "because the data is quoted" '] > > > > > > All the three lines have white space immediately following > > > separating commas. DSV appears to skip over this white space, > > > while csv treats it as part of the field contents. > > I am fairly sure that is what Excel does. You're probably correct, but I'd like to be 100% certain on this. > Pity there is no real specification for CSV. Actually, it's only the V part of CSV that's poorly defined . Maybe CSV should stand for "comma separated vagueness". Speaking of names, since Kevin is correct in that people will look for CSV since that's the common term, we could just define C to stand for "character" rather than "comma", since this will be a general-purpose importer. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Mon Jan 27 22:02:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 15:02:23 -0600 Subject: ASV In-Reply-To: <1043696893.25146.3034.camel@software1.logiplex.internal> References: <15925.30382.921990.566934@montanaro.dyndns.org> <1043696893.25146.3034.camel@software1.logiplex.internal> Message-ID: <15925.40543.186264.281135@montanaro.dyndns.org> >> I'm not aware that it has any interesting properties not available in >> either DSV or csv, so I'm inclined to not consider it further. Cliff> Agreed. I assume the API didn't provide any interesting Cliff> approaches either? Not really. In fact, I found it a bit confusing. I couldn't figure out how to specify an alternate delimiter either. For some reason it appears Emacs didn't save any intermediate backups of my shootout script, so I can't cut-n-paste what I did use and am not going to fumble around to reproduce it at this point. Skip From djc at object-craft.com.au Tue Jan 28 00:22:28 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 10:22:28 +1100 Subject: DSVWizard.py In-Reply-To: <15925.28186.949208.952742@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> (Dave, should we continue to use the csv at object-craft address Skip> for you or your djc email?) Use the csv at object-craft.com.au address as it will ensure that Andrew gets messages as well. Andrew has spent considerable effort making the CSV module conform to Excel behaviour. Skip> I think we should aim for Excel2000 compatibility as a bare Skip> minimum, and at least document any supported extensions and try Skip> to tie them to specific other applications. It is indeed Skip> unfortunate that the CSV file format is only operationally Skip> defined. Skip> Wild-ass idea: Maybe the API should include a query function or Skip> a data attribute which lists (as strings) the variants of CSV Skip> supported by a module (which should be supported by test cases)? Skip> The default variant would be listed first, and the constructor Skip> would take any of the listed variants as an optional argument. Skip> Something like: Skip> variants = csv.get_variants() Skip> csvl = csv.parser(variant="lotus123") Skip> csve = csv.parser(variant="excel2000") What I think we should do is implement two layers; a Python layer and an extension module. The extension module should contain only the functions which are necessary to implement a fast parser. The Python layer would be the registry of variants and would configure and tweak the parser. This would allow all tweaking intelligence to be hidden from the user while keeping implementation details out of the parser. Skip> We could create an informal "registry" of valid variant names. Skip> If support for an existing variant is added, you use that name. Skip> If support for an unknown variant is added, you register a Skip> string. I suppose a torture test is the first step in defining the variants. Instead of trying to formally specify the variants up front we could define them by the way they process the torture test. Skip> That's true. Perhaps selecting by variant name would do nothing Skip> more than set those specific values behind the scenes, much the Skip> same way that when you choose a particular C coding style in Skip> Emacs a number of low-level variable values are set. My thoughts exactly. Cliff> Another problem with specifying styles by application name is Cliff> that many apps allow the user to specify portions of the style Cliff> (usually the delimiter), so that's not set in stone either. In the first instance we have to assume that people are going to choose styles which are not ambiguous. This is a big assumption - I have seen applications (database bulkcopy tools) which happily allow you to export data which cannot be unambiguously parsed back into the original fields/columns. Cliff> I think what I'm leaning towards at this time, if everyone is Cliff> in agreement, is for Dave or myself to reimplement Dave's code Cliff> (and API) in Python so that there is a pure Python Cliff> implementation, and then provide Dave's C module as a faster Cliff> alternative (much like Pickle and cPickle). The heuristics of Cliff> DSV would be an optional feature, along with the GUI. Shouldn't we first come up with a project plan. If the eventual goal is to get this into Python we are going to have to write a PEP. Rather than trying to do everything ourselves we should try to think of a method whereby we will get people to run a torture test against the applications they need to interact with. The steps would include (not sure about the order): * Develop CSV torture test. * Develop format by which people can submit results of torture test which will allow us to eventually regression test the parser against those results. * Define Python API for CSV parser. * Define extension module API. * Write PEP. * Develop CSV module. Skip> This sounds like a reasonable idea. I also agree the GUI stuff Skip> will probably not make it into the core. I agree. Cliff> As far as DSV's current API, I'm not too attached to it, and I Cliff> think that it could be mimicked sufficiently by adding a Cliff> parser.parseall() method to Dave's API so the programmer would Cliff> have the option of getting the entire file as a list without Cliff> having to write a loop. I think that we should be prepared to go back to the drawing board on the API if necessary. Once we have enough variants registered we will be in a better position to come up with the "right" API. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Tue Jan 28 00:25:00 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 28 Jan 2003 10:25:00 +1100 Subject: DSVWizard.py In-Reply-To: Message from Dave Cole of "27 Jan 2003 16:08:21 +1100." References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043632066.25146.2950.camel@software1.logiplex.internal> Message-ID: >> > DSV: ['Test 1', 'Fred said "hey!", and left the room', ''] >> > csv: ['Test 1', ' "Fred said ""hey!""', ' and left the room"', ' ""'] >> >> IMO, Dave's is incorrect in this one (unless he has specific reasons >> otherwise). > >Andrew (who has been included on th Cc) has tested the behaviour of >Excel (such as it is) and we do the same thing as Excel. As to >whether Excel is doing the right thing, that is a different question >entirely. [...] >> The original line (from the csv file) is: >> >> Test 1, "Fred said ""hey!"", and left the room", "" Excel (at least, Excel 97) only gives the quote character a special meaning when it appears directly after the field separator. In this example, you have a space between the comma and the quote - removing the space, CSV gives you: ['Test 1', 'Fred said "hey!", and left the room', ''] Older versions of CSV, in fact, behaved as DSV does (since that makes more sense), but in the name of Excel compatibility... >> The "" at the end is an empty, quoted field. Maybe someone should >> run this through Excel to see what it claims (I'd be willing to >> accept Dave's interpretation if Excel does it this way, although I'd >> still feel it was incorrect). I handled this case specifically at a >> user's request. > >Andrew, can you run that exact line through Excel? Excel and CSV are behaving the same way on this line. As I mention above, the space after the field separator is the problem. I probably should add a "gobble leading space option" (sigh). >> > All the three lines have white space immediately following >> > separating commas. DSV appears to skip over this white space, >> > while csv treats it as part of the field contents. > >I am fairly sure that is what Excel does. Indeed. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Tue Jan 28 00:25:22 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 10:25:22 +1100 Subject: DSVWizard.py In-Reply-To: <1043696542.25139.3027.camel@software1.logiplex.internal> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> <1043696542.25139.3027.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: >> This sounds like a reasonable idea. I also agree the GUI stuff >> will probably not make it into the core. Cliff> Anyone else? BTW, where are we planning on hosting this Cliff> project? Under one of the existing projects or somewhere else? If we are trying to get this into Python shouldn't we use something like sourceforge. Has anyone been through the process of getting code into Python before? - Dave -- http://www.object-craft.com.au From altis at semi-retired.com Tue Jan 28 00:35:45 2003 From: altis at semi-retired.com (Kevin Altis) Date: Mon, 27 Jan 2003 15:35:45 -0800 Subject: DSVWizard.py In-Reply-To: Message-ID: > From: Dave Cole > > >>>>> "Cliff" == Cliff Wells writes: > > >> This sounds like a reasonable idea. I also agree the GUI stuff > >> will probably not make it into the core. > > Cliff> Anyone else? BTW, where are we planning on hosting this > Cliff> project? Under one of the existing projects or somewhere else? > > If we are trying to get this into Python shouldn't we use something > like sourceforge. Has anyone been through the process of getting code > into Python before? Either just use the Python DSV project Cliff already has setup http://sourceforge.net/projects/python-dsv or create a new one python-csv Either way, everyone should have write privs. and a new cvs dir needs to be created to hold the working code. Originally, I thought the task of making a standard module was going to be relatively trivial, but I'm guessing now that there will be enough effort required in deciding on the API, test cases, a PEP, etc. that it won't be appropriate to try and make it part of Python 2.3, but will have to wait for Python 2.4 instead. So, in the meantime, the project will just follow the lead of other projects prior to being incorporated in the Python core. Skip has the most experience in this area, do you agree with the assessment above Skip? Public discussions can take place on the db-sig and/or c.l.py ka From djc at object-craft.com.au Tue Jan 28 00:35:55 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 10:35:55 +1100 Subject: DSVWizard.py In-Reply-To: <1043697949.25139.3051.camel@software1.logiplex.internal> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043632066.25146.2950.camel@software1.logiplex.internal> <1043697949.25139.3051.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> Okay. So the default behavior would be to *not* treat the Cliff> quotes as text qualifiers in the following: Cliff> data, "data", data Cliff> unless the user specifies otherwise. I believe that is how Excel works. >> One of the people we have done work for has some very nasty "CSV" >> data to parse. We have been trying to work out what to do to the >> CSV module to handle some of the silly things he sees without >> breaking the Excel compatibility. Cliff> Having "variants" as Skip mentioned (and I think you did as Cliff> well) would solve this. Cliff> I'm also a bit curious as to the "Treat consecutive delimiters Cliff> as one" option in Excel. I had planned to add support for that Cliff> in DSV but never got around to it. Does csv have such an Cliff> option? Is this really ever useful? I've never had anyone Cliff> request that I enable that option in DSV, despite the fact that Cliff> there's even a checkbox (disabled) for it in the GUI. I suppose there is no reason why we could not allow people to invoke variants like this; p = csv.parser(app='Excel', consecutive_delimiters=1) The API could be as simple as def parser(**kwargs): app = kwargs.get('app', 'Excel') Cliff> Actually, it's only the V part of CSV that's poorly defined Cliff> . Maybe CSV should stand for "comma separated Cliff> vagueness". LOL. Cliff> Speaking of names, since Kevin is correct in that people will Cliff> look for CSV since that's the common term, we could just define Cliff> C to stand for "character" rather than "comma", since this will Cliff> be a general-purpose importer. Or use both. As long as you use include "comma separated values" and "character separated values" google will find it. - Dave -- http://www.object-craft.com.au From skip at pobox.com Tue Jan 28 00:59:42 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 17:59:42 -0600 Subject: DSVWizard.py In-Reply-To: References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> Message-ID: <15925.51182.467769.765511@montanaro.dyndns.org> Dave> Shouldn't we first come up with a project plan. If the eventual Dave> goal is to get this into Python we are going to have to write a Dave> PEP. I'm working on a PEP... ;-) This thread is all good grist for the mill. I'll try to get something minimal you can throw tomatoes at tonight or tomorrow. Dave> * Define Python API for CSV parser. Dave> * Define extension module API. I'm not sure you need to define an extension module API. I view the extension module is essentially an implementation detail. Cliff> As far as DSV's current API, I'm not too attached to it, and I Cliff> think that it could be mimicked sufficiently by adding a Cliff> parser.parseall() method to Dave's API so the programmer would Cliff> have the option of getting the entire file as a list without Cliff> having to write a loop. Dave> I think that we should be prepared to go back to the drawing board Dave> on the API if necessary. Once we have enough variants registered Dave> we will be in a better position to come up with the "right" API. Hmmm... I'd like to get something into 2.3 without a wholesale rewrite if possible. I see two basic operations: * suck the contents of a file-like object opened for reading into a list of lists (or iterable returning lists) * write a list of lists to to a file-like object opened for writing I view the rest of the API as essentially just tweaks to the formatting parameters. I think Dave's csv module (should I be calling it Object Craft's csv module? I don't mean to slight other contributors) is fairly close to this already, though it would be nice to be able to read a CSV file like so: import csv csvreader = csv.parser(file("nastiness.csv")) # csvreader.setparams(dialect="excel2000", quote='"', delimiter='/') for row in csvreader: process(row) and write it like so: import csv csvwriter = csv.writer(file("newnastiness.csv", "w")) # csvwriter.setparams(dialect="lotus123", quote='"', delimiter='/') for row in someiterable: csvwriter.write(row) The .setparams() method can obviously be collapsed into the constructors. I could thus implement a CSV dialect converter (do others like "dialect" better than "variant"?) thus: import csv csvreader = csv.parser(file("nastiness.csv"), dialect="excel2000") csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect="lotus123", delimiter='/') for row in csvreader: csvwriter.write(row) Skip From skip at pobox.com Tue Jan 28 01:03:21 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 18:03:21 -0600 Subject: DSVWizard.py In-Reply-To: References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> <1043696542.25139.3027.camel@software1.logiplex.internal> Message-ID: <15925.51401.651798.820598@montanaro.dyndns.org> Cliff> BTW, where are we planning on hosting this project? Under one of Cliff> the existing projects or somewhere else? Dave> If we are trying to get this into Python shouldn't we use Dave> something like sourceforge. Has anyone been through the process Dave> of getting code into Python before? I have checkin privileges on the Python repository. I doubt it will be difficult to get all of you set up similarly. The Python CVS sandbox would then make a logical place to host it: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/ I can just create a "csv" subdirectory there to get us started. Skip From djc at object-craft.com.au Tue Jan 28 01:23:37 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 11:23:37 +1100 Subject: DSVWizard.py In-Reply-To: <15925.51182.467769.765511@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> <15925.51182.467769.765511@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> I'm working on a PEP... ;-) This thread is all good grist for Skip> the mill. I'll try to get something minimal you can throw Skip> tomatoes at tonight or tomorrow. Excellent. Dave> * Define Python API for CSV parser. Dave> * Define extension module API. Skip> I'm not sure you need to define an extension module API. I view Skip> the extension module is essentially an implementation detail. True. Skip> Hmmm... I'd like to get something into 2.3 without a wholesale Skip> rewrite if possible. I see two basic operations: Skip> * suck the contents of a file-like object opened for reading Skip> into a list of lists (or iterable returning lists) Skip> * write a list of lists to to a file-like object opened for Skip> writing Skip> I view the rest of the API as essentially just tweaks to the Skip> formatting parameters. Sounds easy :-) Skip> I think Dave's csv module (should I be calling it Object Craft's Skip> csv module? I don't mean to slight other contributors) Call it Object Craft's. I did the initial work but Andrew has his fingerprints all over it now. Skip> import csv Skip> Skip> csvreader = csv.parser(file("nastiness.csv")) Skip> # csvreader.setparams(dialect="excel2000", quote='"', delimiter='/') Skip> Skip> for row in csvreader: Skip> process(row) That is a really nice interface. I like it a lot. Skip> import csv Skip> Skip> csvwriter = csv.writer(file("newnastiness.csv", "w")) Skip> # csvwriter.setparams(dialect="lotus123", quote='"', delimiter='/') Skip> Skip> for row in someiterable: Skip> csvwriter.write(row) Very nice. Skip> The .setparams() method can obviously be collapsed into the Skip> constructors. Skip> Skip> I could thus implement a CSV dialect converter (do others like Skip> "dialect" better than "variant"?) thus: Skip> Skip> import csv Skip> Skip> csvreader = csv.parser(file("nastiness.csv"), dialect="excel2000") Skip> csvwriter = csv.writer(file("newnastiness.csv", "w"), Skip> dialect="lotus123", delimiter='/') Skip> Skip> for row in csvreader: Skip> csvwriter.write(row) This is excellent stuff. I am not very good at naming, but "dialect" looks good to me. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Tue Jan 28 01:43:33 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 11:43:33 +1100 Subject: DSVWizard.py In-Reply-To: <15925.51401.651798.820598@montanaro.dyndns.org> References: <15921.59181.892148.382610@montanaro.dyndns.org> <15922.5903.628119.997022@montanaro.dyndns.org> <1043622397.25146.2910.camel@software1.logiplex.internal> <15924.32327.631412.57615@montanaro.dyndns.org> <1043686924.25146.2997.camel@software1.logiplex.internal> <15925.28186.949208.952742@montanaro.dyndns.org> <1043696542.25139.3027.camel@software1.logiplex.internal> <15925.51401.651798.820598@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Cliff> BTW, where are we planning on hosting this project? Under one Cliff> of the existing projects or somewhere else? Dave> If we are trying to get this into Python shouldn't we use Dave> something like sourceforge. Has anyone been through the process Dave> of getting code into Python before? Skip> I have checkin privileges on the Python repository. I doubt it Skip> will be difficult to get all of you set up similarly. The Skip> Python CVS sandbox would then make a logical place to host it: Skip> Skip> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/ Skip> I can just create a "csv" subdirectory there to get us started. I like that plan. I would be more than happy to have our code moved into the sandbox with the goal of having this go into Python 2.3. Unless I am missing the point, I assume you plan to have something like the following as a starting point: * A new csv.py Python module which exports the interface defined in the PEP. * Our current CSV parser renamed to something like _csvparser. * The torture test. - Dave -- http://www.object-craft.com.au From skip at pobox.com Tue Jan 28 02:57:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 19:57:05 -0600 Subject: SF ids please Message-ID: <15925.58225.712028.494438@montanaro.dyndns.org> Please confirm your Sourceforge usernames for me: Dave Cole davecole Cliff Wells cliffwells18 Kevin Altis kasplat I will see about getting you checkin privileges for Python CVS. Dave, what about Andrew? Skip From djc at object-craft.com.au Tue Jan 28 03:06:18 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 13:06:18 +1100 Subject: SF ids please In-Reply-To: <15925.58225.712028.494438@montanaro.dyndns.org> References: <15925.58225.712028.494438@montanaro.dyndns.org> Message-ID: > Please confirm your Sourceforge usernames for me: > Dave Cole davecole That is me. Andrew is getting an account set up now. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Tue Jan 28 03:12:07 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 28 Jan 2003 13:12:07 +1100 Subject: SF ids please In-Reply-To: Message from Dave Cole of "28 Jan 2003 13:06:18 +1100." References: <15925.58225.712028.494438@montanaro.dyndns.org> Message-ID: >> Please confirm your Sourceforge usernames for me: > >Andrew is getting an account set up now. Done: "andrewmcnamara" -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Tue Jan 28 03:12:07 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 28 Jan 2003 13:12:07 +1100 Subject: SF ids please In-Reply-To: Message from Dave Cole of "28 Jan 2003 13:06:18 +1100." References: <15925.58225.712028.494438@montanaro.dyndns.org> Message-ID: <20030128021207.5D8AA3C1F4@coffee.object-craft.com.au> >> Please confirm your Sourceforge usernames for me: > >Andrew is getting an account set up now. Done: "andrewmcnamara" -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Tue Jan 28 03:23:43 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 20:23:43 -0600 Subject: sandbox created Message-ID: <15925.59823.804408.159618@montanaro.dyndns.org> I created the sandbox with a handful of stub files. You can browse them at http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv I've also asked the Python admins for checkin privileges for each of you. Skip From skip at pobox.com Tue Jan 28 04:15:32 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 21:15:32 -0600 Subject: Checkin privileges for a few other people please? (fwd) Message-ID: <15925.62932.544759.35012@montanaro.dyndns.org> Hey folks, Guido says it's a go if you're cool with the PSF license. This will likely affect your current code. Let me know, yea or nay. Skip -------------- next part -------------- An embedded message was scrubbed... From: Guido van Rossum Subject: Re: Checkin privileges for a few other people please? Date: Mon, 27 Jan 2003 21:24:55 -0500 Size: 5804 Url: http://mail.python.org/pipermail/csv/attachments/20030127/b0896ab6/attachment.mht From djc at object-craft.com.au Tue Jan 28 04:57:07 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 14:57:07 +1100 Subject: Checkin privileges for a few other people please? (fwd) In-Reply-To: <15925.62932.544759.35012@montanaro.dyndns.org> References: <15925.62932.544759.35012@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> Guido says it's a go if you're cool with the PSF license. This Skip> will likely affect your current code. Let me know, yea or nay. I have skimmed through the psf-contributor-agreement. It looks like we lose nothing by contributing - we just grant PSF equal copyright. That is fine by us (that is a yea). I suppose we should fax some signed copies of the various agreements. - Dave Guido> I'd like to make sure that they will assign the copyright to the Guido> PSF. This is especially important since two of these are Guido> already authors of 3rd party code with possibly different Guido> licenses. All new code in the Python CVS *must* be under the Guido> standard PSF license. Guido> If they all agree with the drafts at Guido> http://www.python.org/psf/psf-contributor-agreement.html Guido> it's a deal, as far as I'm concerned. (Oh, and the usual Guido> caution for checking in outside the area for which they are Guido> responsible.) -- http://www.object-craft.com.au From djc at object-craft.com.au Tue Jan 28 05:08:00 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 15:08:00 +1100 Subject: sandbox created In-Reply-To: <15925.59823.804408.159618@montanaro.dyndns.org> References: <15925.59823.804408.159618@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> I created the sandbox with a handful of stub files. You can Skip> browse them at Skip> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv Skip> I've also asked the Python admins for checkin privileges for Skip> each of you. If Skip is prepared to do it, I think he should act as project leader. I think that it is important to have someone who does not have a personal attachment to any existing code. I have my own ideas about how we should proceed. I suspect I am not alone :-) - Dave -- http://www.object-craft.com.au From skip at pobox.com Tue Jan 28 05:20:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 27 Jan 2003 22:20:23 -0600 Subject: First Cut at CSV PEP Message-ID: <15926.1287.36487.12649@montanaro.dyndns.org> I'm ready to toddle off to bed, so I'm stopping here for tonight. Attached is what I've come up with so far in the way of a PEP. Feel free to flesh out, rewrite or add new sections. After a brief amount of cycling, I'll check it into CVS. Skip -------------- next part -------------- PEP: NNN Title: CSV file API Version: $Revision: $ Last-Modified: $Date: $ Author: Skip Montanaro , Kevin Altis , Cliff Wells Status: Active Type: Draft Content-Type: text/x-rst Created: 26-Jan-2003 Python-Version: 2.3 Post-History: Abstract ======== The Comma Separated Values (CSV) file format is the most common import and export format for spreadsheets and databases. Although many CSV files are simple to parse, the format is not formally defined by a stable specification and is subtle enough that parsing lines of a CSV file with something like ``line.split(",")`` is bound to fail. This PEP defines an API for reading and writing CSV files which should make it possible for programmers to select a CSV module which meets their requirements. Existing Modules ================ Three widely available modules enable programmers to read and write CSV files: - Dave Cole's csv module [1]_ - Cliff Wells's Python-DSV module [2]_ - Laurence Tratt's ASV module [3]_ They have different APIs, making it somewhat difficult for programmers to switch between them. More of a problem may be that they interpret some of the CSV corner cases differently, so even after surmounting the differences in the module APIs, the programmer has to also deal with semantic differences between the packages. Rationale ========= By defining common APIs for reading and writing CSV files, we make it easier for programmers to choose an appropriate module to suit their needs, and make it easier to switch between modules if their needs change. This PEP also forms a set of requirements for creation of a module which will hopefully be incorporated into the Python distribution. Module Interface ================ The module supports two basic APIs, one for reading and one for writing. The reading interface is:: reader(fileobj [, dialect='excel2000'] [, quotechar='"'] [, delimiter=','] [, skipinitialspace=False]) A reader object is an iterable which takes a file-like object opened for reading as the sole required parameter. It also accepts four optional parameters (discussed below). Readers are typically used as follows:: csvreader = csv.parser(file("some.csv")) for row in csvreader: process(row) The writing interface is similar:: writer(fileobj [, dialect='excel2000'] [, quotechar='"'] [, delimiter=','] [, skipinitialspace=False]) A writer object is a wrapper around a file-like object opened for writing. It accepts the same four optional parameters as the reader constructor. Writers are typically used as follows:: csvwriter = csv.writer(file("some.csv", "w")) for row in someiterable: csvwriter.write(row) Optional Parameters ------------------- Both the reader and writer constructors take four optional keyword parameters:: - dialect is an easy way of specifying a complete set of format constraints for a reader or writer. Most people will know what application generated a CSV file or what application will process the CSV file they are generating, but not the precise settings necessary. The only dialect defined initially is "excel2000". The dialect parameter is interpreted in a case-insensitive manner. - quotechar specifies a one-character string to use as the quoting character. It defaults to '"'. - delimiter specifies a one-character string to use as the field separator. It defaults to ','. - skipinitialspace specifies how to interpret whitespace which immediately follows a selimiter. It defaults to False, which means that whitespace immediate following a delimiter is part of the following field. When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed first, then the others are processed. This makes it easy to choose a dialect, then override one or more of the settings. For example, if a CSV file was generated by Excel 2000 using single quotes as the quote character, you could create a reader like:: csvreader = csv.parser(file("some.csv"), dialect="excel2000", quotechar="'") Testing ======= TBD. Issues ====== - Should a parameter control how consecutive delimiters are interpreted? (My thought is "no".) References ========== .. [1] csv module, Object Craft (http://www.object-craft.com.au/projects/csv) .. [2] Python-DSV module, Wells (http://sourceforge.net/projects/python-dsv/) .. [3] ASV module, Tratt (http://tratt.net/laurie/python/asv/) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End: From djc at object-craft.com.au Tue Jan 28 05:56:39 2003 From: djc at object-craft.com.au (Dave Cole) Date: 28 Jan 2003 15:56:39 +1100 Subject: First Cut at CSV PEP In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> I'm ready to toddle off to bed, so I'm stopping here for Skip> tonight. Attached is what I've come up with so far in the way Skip> of a PEP. Feel free to flesh out, rewrite or add new sections. Skip> After a brief amount of cycling, I'll check it into CVS. I only have one issue with the PEP as it stands. It is still aiming too low. One of the things that we support in our parser is the ability to handle CSV without quote characters. field1,field2,field3\, field3,field4 One of our customers has data like the above. To handle this we would need something like the following: # Use the 'raw' dialect to get access to all tweakables. writer(fileobj, dialect='raw', quotechar=None, delimiter=',', escapechar='\\') I think that we need some way to handle a potentially different set of options on each dialect. When you CSV export from Excel, do you have the ability to use a delimiter other than comma? Do you have the ability to change the quotechar? Should the wrapper protect you from yourself so that when you select the Excel dialect you are limited to the options available within Excel? Maybe the dialect should not limit you, it should just provide the correct defaults. Since we are going to have one parsing engine in an extension module below the Python layer, we are probably going to evolve more tweakable settings in the parser over time. It would be nice if we could hide new tweakables from application code by associating defaults values with dialect names in the Python layer. We should not be exposing the low level parser interface to user code if it can be avoided. - Dave -- http://www.object-craft.com.au From altis at semi-retired.com Tue Jan 28 06:50:20 2003 From: altis at semi-retired.com (Kevin Altis) Date: Mon, 27 Jan 2003 21:50:20 -0800 Subject: First Cut at CSV PEP In-Reply-To: Message-ID: > From: Dave Cole > > >>>>> "Skip" == Skip Montanaro writes: > > I only have one issue with the PEP as it stands. It is still aiming > too low. One of the things that we support in our parser is the > ability to handle CSV without quote characters. > > field1,field2,field3\, field3,field4 Excel certainly can't handle that, nor do I think Access can. If a field contains a comma, then the field must be quoted. Now, that isn't to say that we shouldn't be able to support the idea of escaped characters, but when exporting if you do want something that a tool like Excel could read, you would need to generate an exception if quoting wasn't specified. The same would probably apply for embedded newlines in a field without quoting. Being able to generate exceptions on import and export operations could be one of the big benefits of this module. You won't accidentally export something that someone on the other end won't be able to use and you'll know on import that you have garbage before you try and use it. For example, when I first started trying to import Access data that was tab-separated, I didn't realize there were embedded newlines until much later, at which point I was able to go back and export as CSV with quote delimitters and the data became usable. > I think that we need some way to handle a potentially different set of > options on each dialect. I'm not real comfortable with the dialect idea, it doesn't seem to add any value over simply specifying a separator and delimiter. We aren't dealing with encodings, so anything other than 7-bit ASCII unless specified as a delimiter or separator would be undefined, yes? The only thing that really matters is the delimiter and separator and then how quoting is handled of either of those characters and embedded returns and newlines within a field. Correct me if I'm wrong, but I don't think the MS CSV formats can deal with embedded CR or LF unless fields are quoted and that will be done with a " character. Now with Access, you are actually given more control. See the attached screenshot. Ignorning everything except the top File format section you have: Delimited or Fixed Width. If Delimited you have a Field Delimiter choice of comma, semi-colon, tab and space or a user-specified character and the text qualifier can be double-quote, apostrophe, or None. > When you CSV export from Excel, do you have the ability to use a > delimiter other than comma? Do you have the ability to change the > quotechar? No, but there are a variety of text formats supported. The Excel 2000 help file for Text file formats: "Text (Tab-delimited) (*.txt) (Windows) Text (Macintosh) Text (OS/2 or MS-DOS) CSV (comma delimited) (*.csv) (Windows) CSV (Macintosh) CSV (OS/2 or MS-DOS) If you are saving a workbook as a tab-delimited or comma-delimited text file for use on another operating system, select the appropriate converter to ensure that tab characters, line breaks, and other characters are interpreted correctly." The Excel 2000 help file for CSV: "CSV (Comma delimited) format The CSV (Comma delimited) file format saves only the text and values as they are displayed in cells of the active worksheet. All rows and all characters in each cell are saved. Columns of data are separated by commas, and each row of data ends in a carriage return. If a cell contains a comma, the cell contents are enclosed in double quotation marks. If cells display formulas instead of formula values, the formulas are converted as text. All formatting, graphics, objects, and other worksheet contents are lost. Note If your workbook contains special font characters such as a copyright symbol (C), and you will be using the converted text file on a computer with a different operating system, save the workbook in the text file format appropriate for that system. For example, if you are using Windows and want to use the text file on a Macintosh computer, save the file in the CSV (Macintosh) format. If you are using a Macintosh computer and want to use the text file on a system running Windows or Windows NT, save the file in the CSV (Windows) format." The CR, CR/LF, and LF line endings probably have something to do with saving in Mac format, but it may also do some 8-bit character translation. The universal readlines support in Python 2.3 may impact the use of a file reader/writer when processing different text files, but would returns or newlines within a field be impacted? Should the PEP and API specify that the record delimiter can be either CR, LF, or CR/LF, but use of those characters inside a field requires the field to be quoted or an exception will be thrown? ka -------------- next part -------------- A non-text attachment was scrubbed... Name: access_export.png Type: image/png Size: 9504 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030127/7594f034/attachment.png From altis at semi-retired.com Tue Jan 28 07:39:28 2003 From: altis at semi-retired.com (Kevin Altis) Date: Mon, 27 Jan 2003 22:39:28 -0800 Subject: various CVS references In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: Just for reference some Google searches of "cvs spec" "comma separated values" and some other variants produced Java http://ostermiller.org/utils/CSVLexer.html Perl http://rath.ca/Misc/Perl_CSV/ http://rath.ca/Misc/Perl_CSV/CSV-2.0.html#csv%20specification A search on CPAN for csv yields a lot of different modules, some with test data. http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?request=search The TCL standard libs (whatever those are ;-) has a module http://tcllib.sourceforge.net/doc/csv.html MSDN references http://msdn.microsoft.com/library/default.asp?url=/library/en-us/netdir/ad/c omma-separated_value_csv_scripts.asp There are a variety of other things on MSDN, none of which seem particularly helpful. Apparently, the MS Commerce server actually contains ImportCSV and ExportCSV methods. I'm still searching to see if I can find further MS qualifications of CSV and/or tab-delimitted formats as supported by various tools. ka From altis at semi-retired.com Tue Jan 28 07:43:22 2003 From: altis at semi-retired.com (Kevin Altis) Date: Mon, 27 Jan 2003 22:43:22 -0800 Subject: First Cut at CSV PEP In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: > I'm ready to toddle off to bed, so I'm stopping here for tonight. > Attached > is what I've come up with so far in the way of a PEP. Feel free to flesh > out, rewrite or add new sections. After a brief amount of cycling, I'll > check it into CVS. Probably need to specify that input and output deals with string representations, but there are some differences: [[5,'Bob',None,1.0]] DSV.exportCSV produces '5,Bob,None,1.0' Data that doesn't need quoting isn't quoted. Assuming those were spreadsheet values with the third item just an empty cell, then using Excel export rules would result in a default CSV of 5,Bob,,1\r\n None is just an empty field. In Excel, the number 1.0 is just 1 in the exported file, but that may not matter, we can export 1.0 for the field. This reminds me that the boundary case of the last record just having EOF with no line ending should be tested. Importing this line with importDSV for example yields a list of lists. [['5', 'Bob', '', '1']] Its debatable whether the third field should be None or an empty string. Further interpretation of each field becomes application-specific. The API makes it easy to do further processing as each row is read. I'm still not sure about some of the database CSV handling issues, often it seems they want a string field to be quoted regardless of whether it contains a comma or newlines, but number and empty field should not be quoted. It is certainly nice to be able to import a file that contains 5,"Bob",,1.0\r\n and not need to do any further translation. Excel appears to interpret quoted numbers and unquoted numbers as numeric fields when importing. Just trying to be anal-retentive here to make sure all the issues are covered ;-) ka From altis at semi-retired.com Tue Jan 28 16:20:21 2003 From: altis at semi-retired.com (Kevin Altis) Date: Tue, 28 Jan 2003 07:20:21 -0800 Subject: First Cut at CSV PEP In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: The big issue with the MS/Excel CSV format is that MS doesn't appear to escape any characters or support import of escaped characters. A field that contains characters that you might normally escape (including a comma if that is the separator) are instead enclosed in double quotes by default and then any double quotes in the field are doubled. I found this MySQL article where the dialogs show the emphasis on escape characters. http://www.databasejournal.com/features/mysql/article.php/10897_1558731_5 It doesn't seem like you would run into a case where a file would use the MS CSV format and have escaped characters too, but perhaps these exist in the wild. On the export, I think you would want the option of specifying whether to use field qualifiers (quotes) on all fields and then only optionally enclose a field if qualifiers are needed. If you aren't generating MS CSV format and are using escape sequences, the field "quotes" aren't needed. See the Export Data as CSV dialog at the URL above. I guess MySQL could be one of the dialects and that would be closer to what everyone expects except MS? Ugh, I shouldn't try and think about this stuff before morning coffee ;-) ka > -----Original Message----- > From: Skip Montanaro [mailto:skip at pobox.com] > Sent: Monday, January 27, 2003 8:20 PM > To: LogiplexSoftware at earthlink.net; altis at semi-retired.com; > csv at object-craft.com.au > Subject: First Cut at CSV PEP > > > > I'm ready to toddle off to bed, so I'm stopping here for tonight. > Attached > is what I've come up with so far in the way of a PEP. Feel free to flesh > out, rewrite or add new sections. After a brief amount of cycling, I'll > check it into CVS. > > Skip > > From altis at semi-retired.com Tue Jan 28 16:50:53 2003 From: altis at semi-retired.com (Kevin Altis) Date: Tue, 28 Jan 2003 07:50:53 -0800 Subject: more Perl CSV - http://tit.irk.ru/perlbookshelf/cookbook/ch01_16.htm In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: From skip at pobox.com Tue Jan 28 17:56:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 10:56:26 -0600 Subject: various CVS references In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: <15926.46650.579273.539803@montanaro.dyndns.org> Kevin> Just for reference some Google searches of "cvs spec" "comma Kevin> separated values" and some other variants produced Much appreciated. I will incorporate some of them into the PEP. Kevin> Java Kevin> http://ostermiller.org/utils/CSVLexer.html Interestingly enough, the author considers Excel's format not conformant with "the generally accepted standards" and requires the programmer to use special Excel readers and writers. I wonder who he's been talking to about standards. ;-) Kevin> Perl Kevin> http://rath.ca/Misc/Perl_CSV/ Kevin> http://rath.ca/Misc/Perl_CSV/CSV-2.0.html#csv%20specification I like that this guy has a BNF diagram for CSV files. He treats delimiters and quote characters as static, which we would probably make dynamic. Perhaps I can come up with something similar for the PEP. Kind of a Gory Details appendix. Kevin> A search on CPAN for csv yields a lot of different modules, some Kevin> with test data. Kevin> http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?request=search CPAN is great if you know what you're looking for but is a morass otherwise. It gives you lots of choices, but not enough information to decide which packages are high quality. The Vaults of Parnassus has the same problem, but fewer choices. Kevin> The TCL standard libs (whatever those are ;-) has a module Kevin> http://tcllib.sourceforge.net/doc/csv.html Looks a bit low level. Kevin> MSDN references Kevin> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/netdir/ad/comma-separated_value_csv_scripts.asp Doesn't look all that useful. Kevin> http://tit.irk.ru/perlbookshelf/cookbook/ch01_16.htm Interesting cookbook recipe, but nothing Dave and Cliff don't already know how to do. ;-) Besides, it uses regular expressions to parse fields. As Jamie Zawinski says: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. Skip From altis at semi-retired.com Tue Jan 28 18:02:06 2003 From: altis at semi-retired.com (Kevin Altis) Date: Tue, 28 Jan 2003 09:02:06 -0800 Subject: various CVS references In-Reply-To: <15926.46650.579273.539803@montanaro.dyndns.org> Message-ID: > From: Skip Montanaro [mailto:skip at pobox.com] > > Kevin> Just for reference some Google searches of "cvs spec" "comma > Kevin> separated values" and some other variants produced > > Much appreciated. I will incorporate some of them into the PEP. All this was just for reference sake so we have a better idea of current practice in other languages. I have an email out to a .NET guru friend just to see if MS has documented any better CSV as it relates to .NET methods in various products. I think we already understand the problem domain better than most and realize that handling the MS format for both import and export out of the gate is crucial for a standard lib. ka From LogiplexSoftware at earthlink.net Tue Jan 28 22:17:32 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 13:17:32 -0800 Subject: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: <1043788652.25139.3222.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 20:56, Dave Cole wrote: > I only have one issue with the PEP as it stands. It is still aiming > too low. One of the things that we support in our parser is the > ability to handle CSV without quote characters. > > field1,field2,field3\, field3,field4 > > One of our customers has data like the above. To handle this we would > need something like the following: > > # Use the 'raw' dialect to get access to all tweakables. > writer(fileobj, > dialect='raw', quotechar=None, delimiter=',', escapechar='\\') +1 on escapechar, -1 on 'raw' dialect. Why would a 'raw' dialect be needed? It isn't clear to me why escapechar would be mutually exclusive with any particular dialect. Further, not specifying a dialect (dialect=None) should be the default which would seem the same as 'raw'. > I think that we need some way to handle a potentially different set of > options on each dialect. I'm not understanding how this is different from Skip's suggestion to use reader(fileobj, dialect="excel2000", delimiter='\t') Or are you suggesting that not all options would be available on all dialects? Can you suggest an example? > When you CSV export from Excel, do you have the ability to use a > delimiter other than comma? Do you have the ability to change the > quotechar? I think it is an option to save as a TSV file (IIRC), which is the same as a CSV file, but with tabs. > Should the wrapper protect you from yourself so that when you select > the Excel dialect you are limited to the options available within > Excel? No. I think this would be unnecessarily limiting. > Maybe the dialect should not limit you, it should just provide the > correct defaults. This is what I'm thinking. > Since we are going to have one parsing engine in an extension module > below the Python layer, we are probably going to evolve more tweakable > settings in the parser over time. It would be nice if we could hide > new tweakables from application code by associating defaults values > with dialect names in the Python layer. We should not be exposing the > low level parser interface to user code if it can be avoided. +1 -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Tue Jan 28 22:25:17 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 13:25:17 -0800 Subject: Checkin privileges for a few other people please? (fwd) In-Reply-To: <15925.62932.544759.35012@montanaro.dyndns.org> References: <15925.62932.544759.35012@montanaro.dyndns.org> Message-ID: <1043789116.25146.3230.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 19:15, Skip Montanaro wrote: > Hey folks, > > Guido says it's a go if you're cool with the PSF license. This will likely > affect your current code. Let me know, yea or nay. > > Skip > DSV is already listed under the Python license on SF, and even if it weren't, I'd have no problem with this. > > ______________________________________________________________________ > > From: Guido van Rossum > To: skip at pobox.com > Cc: Barry Warsaw , Fred Drake , Jeremy Hylton , Tim Peters > Subject: Re: Checkin privileges for a few other people please? > Date: 27 Jan 2003 21:24:55 -0500 > > > I'm writing to see if you can give four people Python checkin privileges: > > > > who SF username > > --- ----------- > > Kevin Altis kasplat > > Dave Cole davecole > > Andrew McNamara andrewmcnamara > > Cliff Wells cliffwells18 > > > > We are launching on a PEP and a module to support reading and writing CSV > > files. Dave Cole, Andrew McNamara and Cliff Wells are authors of currently > > available CSV packages (csv and Python-DSV - see Parnassus for pointers). > > Kevin Altis is the author of PythonCard, and a user of CSV formats. (I also > > use CSV files a lot.) All four have contributed substantially to the Python > > community. > > > > We're currently working on a PEP to define the API. The current plan is to > > build heavily on the Object Craft (Dave and Andrew) and Cliff's modules with > > a more Pythonic API than either currently has. I created a directory in the > > sandbox just now to support this little mini-project. The goal is to have > > something which can be included in Python 2.3, though this may be a bit > > optimistic, even with a substantial body of code already written. > > I'd like to make sure that they will assign the copyright to the PSF. > This is especially important since two of these are already authors of > 3rd party code with possibly different licenses. All new code in the > Python CVS *must* be under the standard PSF license. > > If they all agree with the drafts at > > http://www.python.org/psf/psf-contributor-agreement.html > > it's a deal, as far as I'm concerned. (Oh, and the usual caution for > checking in outside the area for which they are responsible.) > > --Guido van Rossum (home page: http://www.python.org/~guido/) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Tue Jan 28 22:26:28 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 13:26:28 -0800 Subject: SF ids please In-Reply-To: <15925.58225.712028.494438@montanaro.dyndns.org> References: <15925.58225.712028.494438@montanaro.dyndns.org> Message-ID: <1043789188.25146.3232.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 17:57, Skip Montanaro wrote: > Please confirm your Sourceforge usernames for me: > > Dave Cole davecole > Cliff Wells cliffwells18 > Kevin Altis kasplat > > I will see about getting you checkin privileges for Python CVS. Dave, what > about Andrew? cliffwells18 confirmed =) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Tue Jan 28 22:45:21 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 13:45:21 -0800 Subject: First Cut at CSV PEP In-Reply-To: References: Message-ID: <1043790321.25139.3251.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 21:50, Kevin Altis wrote: > > From: Dave Cole > > > > >>>>> "Skip" == Skip Montanaro writes: > > > > I only have one issue with the PEP as it stands. It is still aiming > > too low. One of the things that we support in our parser is the > > ability to handle CSV without quote characters. > > > > field1,field2,field3\, field3,field4 > > Excel certainly can't handle that, nor do I think Access can. If a field > contains a comma, then the field must be quoted. Now, that isn't to say that > we shouldn't be able to support the idea of escaped characters, but when > exporting if you do want something that a tool like Excel could read, you > would need to generate an exception if quoting wasn't specified. The same > would probably apply for embedded newlines in a field without quoting. > > Being able to generate exceptions on import and export operations could be > one of the big benefits of this module. You won't accidentally export > something that someone on the other end won't be able to use and you'll know > on import that you have garbage before you try and use it. For example, when > I first started trying to import Access data that was tab-separated, I > didn't realize there were embedded newlines until much later, at which point > I was able to go back and export as CSV with quote delimitters and the data > became usable. Perhaps a "strict" option? I'm not sure this is necessary though. It seems that if a *programmer* specifies dialect="excel2000" and then changes some other default, that's his problem. There's a danger in too much hand-holding in added complexity and arbitrary limitations. > > I think that we need some way to handle a potentially different set of > > options on each dialect. > > I'm not real comfortable with the dialect idea, it doesn't seem to add any > value over simply specifying a separator and delimiter. Except that it gives a programmer a way to be certain that, if he does nothing else, the file will be compatible with the specified dialect. > We aren't dealing with encodings, so anything other than 7-bit ASCII unless > specified as a delimiter or separator would be undefined, yes? The only > thing that really matters is the delimiter and separator and then how > quoting is handled of either of those characters and embedded returns and > newlines within a field. Correct me if I'm wrong, but I don't think the MS > CSV formats can deal with embedded CR or LF unless fields are quoted and > that will be done with a " character. But then MS isn't the only potential target, just our initial (and primary) target. foobar87 may allow export of escaped newlines and put a extraneous space after every delimiter and we don't want someone to have to write another csv importer to deal with it. > Now with Access, you are actually given more control. See the attached > screenshot. Ignorning everything except the top File format section you > have: > Delimited or Fixed Width. If Delimited you have a Field Delimiter choice of > comma, semi-colon, tab and space or a user-specified character and the text > qualifier can be double-quote, apostrophe, or None. And this only deals with the variations the *user* is allowed to make. Applications themselves may introduce variations that we need to have the flexibility to deal with. > The universal readlines support in Python 2.3 may impact the use of a file > reader/writer when processing different text files, but would returns or > newlines within a field be impacted? Should the PEP and API specify that the > record delimiter can be either CR, LF, or CR/LF, but use of those characters > inside a field requires the field to be quoted or an exception will be > thrown? The idea of raising an exception brings up an interesting problem that I had to deal with in DSV. I've run across files that were missing fields and just had a callback so the programmer could decide how to deal with it. This can be the result of corrupted data, but it's also possible for an application to only export fields that actually contain data, for instance: 1,2,3,4,5 1,2,3 1,2,3,4 This could very well be a valid csv file. I'm not aware of any requirement that rows all be the same length. We'll need to have some fairly flexible error-handling to allow for this type of thing when required or raise an exception when it indicates corrupt/invalid data. In DSV I allowed custom error-handlers so the programmer could indicate whether to process the line as normal, discard it, etc. > ka -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Tue Jan 28 22:55:12 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 15:55:12 -0600 Subject: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: <15926.64576.481489.373053@montanaro.dyndns.org> Kevin> Probably need to specify that input and output deals with string Kevin> representations, but there are some differences: Kevin> [[5,'Bob',None,1.0]] Kevin> DSV.exportCSV produces Kevin> '5,Bob,None,1.0' I'm not so sure this mapping None to "None" on output is such a good idea because it's not reversible in all situations and hurts portability to other systems (e.g., does Excel have a concept of None? what happens if you have a text field which just happens to contain "None"?). I think we need to limit the data which can be output to strings, Unicode strings (if we use an encoded stream), floats and ints. Anything else should raise TypeError. Kevin> I'm still not sure about some of the database CSV handling Kevin> issues, often it seems they want a string field to be quoted Kevin> regardless of whether it contains a comma or newlines, but number Kevin> and empty field should not be quoted. It is certainly nice to be Kevin> able to import a file that contains Kevin> 5,"Bob",,1.0\r\n Kevin> and not need to do any further translation. Excel appears to Kevin> interpret quoted numbers and unquoted numbers as numeric fields Kevin> when importing. I like my CSV files to be fully quoted (even fields which may contain numbers), largely because it makes later (dangerous) matching using regular expressions simpler. Otherwise I wind up having to make all the quotes in the regular expressions optional. It just complicates things. Kevin> Just trying to be anal-retentive here to make sure all the issues Kevin> are covered ;-) I hear ya. I just did a little fiddling in Excel 2000 with some simple values. When I save as CSV, it doesn't give me the option to change the delimiter or quote character. Nor could I figure out how to embed a newline in a cell. It certainly doesn't seem as flexible as Gnumeric in this regard. Can someone provide me with some hints? Attached is a slight modification of the proto-PEP. Really all that's changed is the list of issues has grown. Thx, Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/octet-stream Size: 7138 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030128/ce8a1d53/attachment.obj From skip at pobox.com Tue Jan 28 23:02:37 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 16:02:37 -0600 Subject: First Cut at CSV PEP In-Reply-To: References: Message-ID: <15926.65021.926324.438352@montanaro.dyndns.org> Kevin> I'm not real comfortable with the dialect idea, it doesn't seem Kevin> to add any value over simply specifying a separator and Kevin> delimiter. I look at it as a simple way to specify a group of characteristics specific to the way a vendor reads and writes CSV files. It frees the programmer from having to know all the characteristics of their chosen vendor's file format. Think of it as the difference between Larry Wall's configure script for Perl and the GNU configure script. When I configure Perl I have to know enough about my system to know the alignment boundary of malloc, whether the system is big- or little-endian, etc, even though I know damn well it can figure that stuff out reliably. GNU configure almost never prompts you. It reliably figures out all the low-level stuff for you. Skip From LogiplexSoftware at earthlink.net Tue Jan 28 23:14:04 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 14:14:04 -0800 Subject: First Cut at CSV PEP In-Reply-To: References: Message-ID: <1043792044.14244.3280.camel@software1.logiplex.internal> On Mon, 2003-01-27 at 22:43, Kevin Altis wrote: > > I'm ready to toddle off to bed, so I'm stopping here for tonight. > > Attached > > is what I've come up with so far in the way of a PEP. Feel free to flesh > > out, rewrite or add new sections. After a brief amount of cycling, I'll > > check it into CVS. > > Probably need to specify that input and output deals with string > representations, but there are some differences: > > [[5,'Bob',None,1.0]] > > DSV.exportCSV produces > > '5,Bob,None,1.0' Hm, that would be a bug in DSV =). The None should have not been exported (it doesn't have any meaning outside of Python). However, only quoting when necessary was lifted straight from Excel. DSV also allows a "quoteAll" option on export to change this behavior. > Data that doesn't need quoting isn't quoted. Assuming those were spreadsheet > values with the third item just an empty cell, then using Excel export rules > would result in a default CSV of > > 5,Bob,,1\r\n This is the correct behavior. > None is just an empty field. In Excel, the number 1.0 is just 1 in the > exported file, but that may not matter, we can export 1.0 for the field. > This reminds me that the boundary case of the last record just having EOF > with no line ending should be tested. Is this not handled correctly by all the existing implementations? > Importing this line with importDSV for example yields a list of lists. > > [['5', 'Bob', '', '1']] > > Its debatable whether the third field should be None or an empty string. > Further interpretation of each field becomes application-specific. The API > makes it easy to do further processing as each row is read. It's also debatable whether the numbers should have been returned as strings or numbers. I lean towards the former, as csv is a text format and can't convey this sort of information by itself, which is why I chose to return only strings, including the empty string for an empty field rather than None. I agree with Kevin that this is best left to application logic rather than the module. > I'm still not sure about some of the database CSV handling issues, often it > seems they want a string field to be quoted regardless of whether it > contains a comma or newlines, but number and empty field should not be > quoted. It is certainly nice to be able to import a file that contains > 5,"Bob",,1.0\r\n > > and not need to do any further translation. Excel appears to interpret > quoted numbers and unquoted numbers as numeric fields when importing. It treats them as if the user had typed them into a cell, which is not necessarily the behavior we want. To get a number as a string in Excel, I imagine you'd have to have the following: """5""","Bob",,1.0\r\n > > Just trying to be anal-retentive here to make sure all the issues are > covered ;-) And I thought it came naturally =) > ka -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Tue Jan 28 23:21:29 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 14:21:29 -0800 Subject: First Cut at CSV PEP In-Reply-To: <15926.64576.481489.373053@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> Message-ID: <1043792488.25146.3288.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 13:55, Skip Montanaro wrote: > Kevin> Probably need to specify that input and output deals with string > Kevin> representations, but there are some differences: > > Kevin> [[5,'Bob',None,1.0]] > > Kevin> DSV.exportCSV produces > > Kevin> '5,Bob,None,1.0' > > I'm not so sure this mapping None to "None" on output is such a good idea Not unless bugs are good ideas ;) Apparently the export stuff in DSV isn't as widely used as this went unnoticed. It is incorrect behavior. > because it's not reversible in all situations and hurts portability to other > systems (e.g., does Excel have a concept of None? what happens if you have a > text field which just happens to contain "None"?). I think we need to limit > the data which can be output to strings, Unicode strings (if we use an > encoded stream), floats and ints. Anything else should raise TypeError. Or be converted to a reasonable string alternative, ie None -> '' > Kevin> I'm still not sure about some of the database CSV handling > Kevin> issues, often it seems they want a string field to be quoted > Kevin> regardless of whether it contains a comma or newlines, but number > Kevin> and empty field should not be quoted. It is certainly nice to be > Kevin> able to import a file that contains > > Kevin> 5,"Bob",,1.0\r\n > > Kevin> and not need to do any further translation. Excel appears to > Kevin> interpret quoted numbers and unquoted numbers as numeric fields > Kevin> when importing. > > I like my CSV files to be fully quoted (even fields which may contain > numbers), largely because it makes later (dangerous) matching using regular > expressions simpler. Otherwise I wind up having to make all the quotes in > the regular expressions optional. It just complicates things. Excel only quotes when necessary during export. However, it doesn't care on import which style is used. Allowing the programmer to specify the style in this regard would be a good thing. > Kevin> Just trying to be anal-retentive here to make sure all the issues > Kevin> are covered ;-) > > I hear ya. > > I just did a little fiddling in Excel 2000 with some simple values. When I > save as CSV, it doesn't give me the option to change the delimiter or quote > character. Nor could I figure out how to embed a newline in a cell. It > certainly doesn't seem as flexible as Gnumeric in this regard. Can someone > provide me with some hints? Don't save as CSV, save as TSV, which is the same, but with tabs rather than commas. I don't know that it allows specifying the quote character. IIRC, you can embed a newline in a cell by entering " in a cell to mark it as a string value, then I think you can then just hit enter (or perhaps ctrl+enter). -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Tue Jan 28 23:48:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 16:48:28 -0600 Subject: First Cut at CSV PEP In-Reply-To: <1043790321.25139.3251.camel@software1.logiplex.internal> References: <1043790321.25139.3251.camel@software1.logiplex.internal> Message-ID: <15927.2236.560883.798099@montanaro.dyndns.org> Cliff> The idea of raising an exception brings up an interesting problem Cliff> that I had to deal with in DSV. I've run across files that were Cliff> missing fields and just had a callback so the programmer could Cliff> decide how to deal with it. This can be the result of corrupted Cliff> data, but it's also possible for an application to only export Cliff> fields that actually contain data, for instance: Cliff> 1,2,3,4,5 Cliff> 1,2,3 Cliff> 1,2,3,4 Cliff> This could very well be a valid csv file. I'm not aware of any Cliff> requirement that rows all be the same length. In fact, I think Excel itself will generate such files. As I write this, XEmacs on the Windows machine is displaying a CSV file I dumped in Excel from an XLS file I got from someone (having nothing to do with the task at hand). It has seven rows of actual data, then 147 rows of commas. The comma-only rows have 13, 15 or 255 commas, nothing else. The header line of the CSV file has 15 fields with data and is terminated by a comma (empty 16th field). In short, I don't think it's an error for CSV files to have rows of differing lengths. We just have to return what we are given and expect the application is prepared to handle short rows. We could add more flags, but I think we should pause before we get too carried away with the flags. I've added another issue to the proto-PEP: - How should rows of different lengths be handled? The options seem to be:: * raise an exception when a row is encountered whose length differs from the previous row * silently return short rows * allow the caller to specify the desired row length and what to do when rows of a different length are encountered: ignore, truncate, pad, raise exception, etc. I don't think we have to address each and every issue before a first release is made, BTW. Skip From LogiplexSoftware at earthlink.net Tue Jan 28 23:50:49 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 14:50:49 -0800 Subject: First Cut at CSV PEP In-Reply-To: <15926.1287.36487.12649@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> Message-ID: <1043794249.14244.3330.camel@software1.logiplex.internal> As an aside, does anyone have any objection to prepending [CSV] to the subject line of our emails on this topic? Right now Kevin's mails are going into the folder I have set aside for him and everyone else's is going into my inbox which is making it somewhat tedious to follow. Prepending [CSV] would allow me to set up a filter and would make my life just that much better =) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Tue Jan 28 23:54:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 16:54:10 -0600 Subject: First Cut at CSV PEP In-Reply-To: <1043792044.14244.3280.camel@software1.logiplex.internal> References: <1043792044.14244.3280.camel@software1.logiplex.internal> Message-ID: <15927.2578.699647.710265@montanaro.dyndns.org> Cliff> It's also debatable whether the numbers should have been returned Cliff> as strings or numbers. I lean towards the former, as csv is a Cliff> text format and can't convey this sort of information by itself, Cliff> which is why I chose to return only strings, including the empty Cliff> string for an empty field rather than None. I agree with Kevin Cliff> that this is best left to application logic rather than the Cliff> module. I think returning strings is more Pythonic (explicit is better than implicit), while returning numbers is more Perlish. There's no particular reason the user couldn't specify a set of type converters to filter the input rows, e.g.: [int, int, str, mxDateTime.DateTimeFromString, ...] but she could do that just as easily herself: reader = csv.reader(open("some.csv")): for row in reader: for i in range(min(len(rowtypes), len(row))): row[i] = rowtypes[i](row[i]) or something similar. Here again we get into the sticky issue of row length, suggesting we should just pass the buck to the caller. Skip From djc at object-craft.com.au Tue Jan 28 23:59:29 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 09:59:29 +1100 Subject: First Cut at CSV PEP In-Reply-To: References: Message-ID: >>>>> "Kevin" == Kevin Altis writes: Kevin> The big issue with the MS/Excel CSV format is that MS doesn't Kevin> appear to escape any characters or support import of escaped Kevin> characters. A field that contains characters that you might Kevin> normally escape (including a comma if that is the separator) Kevin> are instead enclosed in double quotes by default and then any Kevin> double quotes in the field are doubled. I thought that we were trying to build a CSV parser which would deal with different dialects, not just what Excel does. Am I wrong making that assumption? If we were to only target Excel our task would be much easier. I think that we should be trying to come up with an engine wrapped by an friendly API which can be made more powerful over time in order to parse more and more dialects. Kevin> I found this MySQL article where the dialogs show the emphasis Kevin> on escape characters. Kevin> http://www.databasejournal.com/features/mysql/article.php/10897_1558731_5 Kevin> It doesn't seem like you would run into a case where a file Kevin> would use the MS CSV format and have escaped characters too, Kevin> but perhaps these exist in the wild. There are CSV formats which do not use quote characters, they instead escape the delimiters. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 00:08:17 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 10:08:17 +1100 Subject: First Cut at CSV PEP In-Reply-To: <1043790321.25139.3251.camel@software1.logiplex.internal> References: <1043790321.25139.3251.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> But then MS isn't the only potential target, just our initial Cliff> (and primary) target. foobar87 may allow export of escaped Cliff> newlines and put a extraneous space after every delimiter and Cliff> we don't want someone to have to write another csv importer to Cliff> deal with it. I agree. Excel compatibility is very important, but it is not the only format we should be supporting. >> The universal readlines support in Python 2.3 may impact the use of >> a file reader/writer when processing different text files, but >> would returns or newlines within a field be impacted? Should the >> PEP and API specify that the record delimiter can be either CR, LF, >> or CR/LF, but use of those characters inside a field requires the >> field to be quoted or an exception will be thrown? Interesting point. I think that newlines inside records are going to be the same as those separating records. Anything else would be very bizarre. Cliff> The idea of raising an exception brings up an interesting Cliff> problem that I had to deal with in DSV. I've run across files Cliff> that were missing fields and just had a callback so the Cliff> programmer could decide how to deal with it. This can be the Cliff> result of corrupted data, but it's also possible for an Cliff> application to only export fields that actually contain data, Cliff> for instance: Cliff> 1,2,3,4,5 Cliff> 1,2,3 Cliff> 1,2,3,4 I think that this is something which should be layer above the CSV parser. The technique for reading a CSV (from the PEP) looks like this: csvreader = csv.parser(file("some.csv")) for row in csvreader: process(row) Then any constraints on the content and structure of the records sits logically in the process() function. Cliff> This could very well be a valid csv file. I'm not aware of any Cliff> requirement that rows all be the same length. We'll need to Cliff> have some fairly flexible error-handling to allow for this type Cliff> of thing when required or raise an exception when it indicates Cliff> corrupt/invalid data. In DSV I allowed custom error-handlers Cliff> so the programmer could indicate whether to process the line as Cliff> normal, discard it, etc. I am convinced that this does not belong in the parser. We can always keep going up in layers and build a csvutils module on top of the parser. - Dave -- http://www.object-craft.com.au From skip at pobox.com Wed Jan 29 00:09:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 17:09:58 -0600 Subject: First Cut at CSV PEP In-Reply-To: <1043792488.25146.3288.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> <1043792488.25146.3288.camel@software1.logiplex.internal> Message-ID: <15927.3526.657543.26339@montanaro.dyndns.org> Cliff> Don't save as CSV, save as TSV, which is the same, but with tabs Cliff> rather than commas. I don't know that it allows specifying the Cliff> quote character. Looking at the choices more closely, I see Excel has multiple tabular save formats. I just saved a simple sheet in each of the formats and scp'd it to my laptop. I'll check 'em out later. Cliff> IIRC, you can embed a newline in a cell by entering " in a cell Cliff> to mark it as a string value, then I think you can then just hit Cliff> enter (or perhaps ctrl+enter). That didn't work, but I eventually figured out that ALT+ENTER allows you to enter a "hard carriage return". Skip From LogiplexSoftware at earthlink.net Wed Jan 29 00:11:16 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 15:11:16 -0800 Subject: [CSV] Number of lines in CSV files In-Reply-To: <15925.58225.712028.494438@montanaro.dyndns.org> References: <15925.58225.712028.494438@montanaro.dyndns.org> Message-ID: <1043795476.25146.3351.camel@software1.logiplex.internal> Another thing that just occurred to me is that Excel has historically been limited in the number of rows and columns that it can import. This number has increased with recent versions (I think it was 32K lines in Excel 97, Kevin informs me it's 64K in Excel 2000). Since export will be a feature of the CSV module, should we have some sort of warning or raise an exception when exporting data larger than the target application can handle, or should we just punt on this? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Wed Jan 29 00:11:19 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 10:11:19 +1100 Subject: First Cut at CSV PEP In-Reply-To: <15926.65021.926324.438352@montanaro.dyndns.org> References: <15926.65021.926324.438352@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Kevin> I'm not real comfortable with the dialect idea, it doesn't seem Kevin> to add any value over simply specifying a separator and Kevin> delimiter. Skip> I look at it as a simple way to specify a group of Skip> characteristics specific to the way a vendor reads and writes Skip> CSV files. It frees the programmer from having to know all the Skip> characteristics of their chosen vendor's file format. Think of Skip> it as the difference between Larry Wall's configure script for Skip> Perl and the GNU configure script. When I configure Perl I have Skip> to know enough about my system to know the alignment boundary of Skip> malloc, whether the system is big- or little-endian, etc, even Skip> though I know damn well it can figure that stuff out reliably. Skip> GNU configure almost never prompts you. It reliably figures out Skip> all the low-level stuff for you. Yes, I agree. Users of the module will probably want to be able to handle files from specific applications without necessarily wanted to go through the pain of learning the hard way about exactly how dialects differ. It is as Skip says, just like the autoconf stuff. - Dave -- http://www.object-craft.com.au From skip at pobox.com Wed Jan 29 00:12:52 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 17:12:52 -0600 Subject: First Cut at CSV PEP In-Reply-To: <1043794249.14244.3330.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043794249.14244.3330.camel@software1.logiplex.internal> Message-ID: <15927.3700.803751.757376@montanaro.dyndns.org> Cliff> As an aside, does anyone have any objection to prepending [CSV] Cliff> to the subject line of our emails on this topic? Nope. I could set up a Mailman list on the Mojam server if you don't think that's too much overkill. Skip From skip at pobox.com Wed Jan 29 00:21:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 17:21:06 -0600 Subject: Checkin privileges Message-ID: <15927.4194.943233.439762@montanaro.dyndns.org> I sent a second note to Guido about checkin privilege to the Python repository. All except Kevin (who said anon cvs was good enough for his needs) should get access soon enough. Don't forget, use caution if you decide you need to make changes outside the csv sandbox. (I doubt any of you need reminding but figured I ought to be anal about it.) Also, if you're not already subscribed, I urge you to subscribe to python-dev. The signup page is on the Python website. It will let you know generally what's going on with the Python developer community. You'll know when releases are impending, etc. Skip From andrewm at object-craft.com.au Wed Jan 29 00:28:03 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 10:28:03 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 10:08:17 +1100." References: <1043790321.25139.3251.camel@software1.logiplex.internal> Message-ID: <20030128232803.C6A943C1F4@coffee.object-craft.com.au> >>> The universal readlines support in Python 2.3 may impact the use of >>> a file reader/writer when processing different text files, but >>> would returns or newlines within a field be impacted? Should the >>> PEP and API specify that the record delimiter can be either CR, LF, >>> or CR/LF, but use of those characters inside a field requires the >>> field to be quoted or an exception will be thrown? > >Interesting point. I think that newlines inside records are going to >be the same as those separating records. Anything else would be very >bizarre. You should know better than to make a statement like that where Microsoft is concerned. Excel uses a single LF within fields, but CRLF at the end of lines. If you import a field containing CRLF, the CR appears within the field as a box (the "unprintable character" symbol). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Wed Jan 29 00:28:49 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 10:28:49 +1100 Subject: First Cut at CSV PEP In-Reply-To: References: Message-ID: >>>>> "Kevin" == Kevin Altis writes: >> From: Dave Cole >> >> >>>>> "Skip" == Skip Montanaro writes: >> >> I only have one issue with the PEP as it stands. It is still >> aiming too low. One of the things that we support in our parser is >> the ability to handle CSV without quote characters. >> >> field1,field2,field3\, field3,field4 Kevin> Excel certainly can't handle that, nor do I think Access Kevin> can. If a field contains a comma, then the field must be Kevin> quoted. Now, that isn't to say that we shouldn't be able to Kevin> support the idea of escaped characters, but when exporting if Kevin> you do want something that a tool like Excel could read, you Kevin> would need to generate an exception if quoting wasn't Kevin> specified. The same would probably apply for embedded newlines Kevin> in a field without quoting. Kevin> Being able to generate exceptions on import and export Kevin> operations could be one of the big benefits of this module. You Kevin> won't accidentally export something that someone on the other Kevin> end won't be able to use and you'll know on import that you Kevin> have garbage before you try and use it. For example, when I Kevin> first started trying to import Access data that was Kevin> tab-separated, I didn't realize there were embedded newlines Kevin> until much later, at which point I was able to go back and Kevin> export as CSV with quote delimitters and the data became Kevin> usable. I suppose that exporting should raise an exception if you specify any variation on the dialect in the writer function. csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000', delimiter='"') That should raise an exception. This probably shouldn't raise an exception though: csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000') csvwriter.setparams(delimiter='"') >> I think that we need some way to handle a potentially different set >> of options on each dialect. Kevin> I'm not real comfortable with the dialect idea, it doesn't seem Kevin> to add any value over simply specifying a separator and Kevin> delimiter. It makes thing *a lot* easier for module users who are not fully conversant in the vagaries of CSV. Kevin> We aren't dealing with encodings, so anything other than 7-bit Kevin> ASCII unless specified as a delimiter or separator would be Kevin> undefined, yes? The only thing that really matters is the Kevin> delimiter and separator and then how quoting is handled of Kevin> either of those characters and embedded returns and newlines Kevin> within a field. Correct me if I'm wrong, but I don't think the Kevin> MS CSV formats can deal with embedded CR or LF unless fields Kevin> are quoted and that will be done with a " character. We are not just trying to deal with MS CSV formats though. Kevin> Note If your workbook contains special font characters such as Kevin> a copyright symbol (C), and you will be using the converted Kevin> text file on a computer with a different operating system, save Kevin> the workbook in the text file format appropriate for that Kevin> system. For example, if you are using Windows and want to use Kevin> the text file on a Macintosh computer, save the file in the CSV Kevin> (Macintosh) format. If you are using a Macintosh computer and Kevin> want to use the text file on a system running Windows or Kevin> Windows NT, save the file in the CSV (Windows) format." Kevin> The CR, CR/LF, and LF line endings probably have something to Kevin> do with saving in Mac format, but it may also do some 8-bit Kevin> character translation. Should we be trying to handle unicode. I think we should since Python is now unicode capable. Kevin> The universal readlines support in Python 2.3 may impact the Kevin> use of a file reader/writer when processing different text Kevin> files, but would returns or newlines within a field be Kevin> impacted? Should the PEP and API specify that the record Kevin> delimiter can be either CR, LF, or CR/LF, but use of those Kevin> characters inside a field requires the field to be quoted or an Kevin> exception will be thrown? Should we raise an exception or just pass the data through? If it is not a newline, then it is not a newline. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Wed Jan 29 00:39:47 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 10:39:47 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 10:28:49 +1100." References: Message-ID: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> >I suppose that exporting should raise an exception if you specify any >variation on the dialect in the writer function. > > csvwriter = csv.writer(file("newnastiness.csv", "w"), > dialect='excel2000', delimiter='"') > >That should raise an exception. You mean "raise an exception because the result would be ambiguous", or "raise an exception because it's not excel2000"? BTW, I don't have access to Excel 2000, only 97. I'm going to assume they're the same until proven otherwise (bad assumption, I know). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Wed Jan 29 00:43:33 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 10:43:33 +1100 Subject: First Cut at CSV PEP In-Reply-To: <1043788652.25139.3222.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> On Mon, 2003-01-27 at 20:56, Dave Cole wrote: >> I only have one issue with the PEP as it stands. It is still >> aiming too low. One of the things that we support in our parser is >> the ability to handle CSV without quote characters. >> >> field1,field2,field3\, field3,field4 >> >> One of our customers has data like the above. To handle this we >> would need something like the following: >> >> # Use the 'raw' dialect to get access to all tweakables. >> writer(fileobj, dialect='raw', quotechar=None, delimiter=',', >> escapechar='\\') Cliff> +1 on escapechar, -1 on 'raw' dialect. See below. Cliff> Why would a 'raw' dialect be needed? It isn't clear to me why Cliff> escapechar would be mutually exclusive with any particular Cliff> dialect. Further, not specifying a dialect (dialect=None) Cliff> should be the default which would seem the same as 'raw'. >> I think that we need some way to handle a potentially different set >> of options on each dialect. Cliff> I'm not understanding how this is different from Skip's Cliff> suggestion to use Cliff> reader(fileobj, dialect="excel2000", delimiter='\t') Cliff> Or are you suggesting that not all options would be available Cliff> on all dialects? Can you suggest an example? I think it is important to keep in mind the users of the module who are not expert in the various dialects of CSV. If presented with a flat list of all options supported they are going to engage in a fair amount of head scratching. If we try to make things easier for users by mirroring the options that their application presents then they are going to have a much easier time working out how to use the module for their specific problem. By limiting the available options based upon the dialect specified by the user we will be doing them a favour. The point of the 'raw' dialect is to expose the full capabilities of the raw parser. Maybe we should use None rather than 'raw'. >> When you CSV export from Excel, do you have the ability to use a >> delimiter other than comma? Do you have the ability to change the >> quotechar? Cliff> I think it is an option to save as a TSV file (IIRC), which is Cliff> the same as a CSV file, but with tabs. Hmm... What would be the best way to handle Excel TSV. Maybe a new dialect 'excel-tsv'? >> Should the wrapper protect you from yourself so that when you >> select the Excel dialect you are limited to the options available >> within Excel? Cliff> No. I think this would be unnecessarily limiting. I am not saying that the wrapper should absolutely prevent someone from using options not available in the application. If you want to break the dialect then maybe it should be a two step process. csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000') csvwriter.setparams(delimiter='"') - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 00:59:44 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 10:59:44 +1100 Subject: First Cut at CSV PEP In-Reply-To: <15926.64576.481489.373053@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Kevin> Probably need to specify that input and output deals with Kevin> string representations, but there are some differences: Kevin> [[5,'Bob',None,1.0]] Kevin> DSV.exportCSV produces Kevin> '5,Bob,None,1.0' Skip> I'm not so sure this mapping None to "None" on output is such a Skip> good idea because it's not reversible in all situations and Skip> hurts portability to other systems (e.g., does Excel have a Skip> concept of None? what happens if you have a text field which Skip> just happens to contain "None"?). I think that None should always be written as a zero length field, and always read as the field value 'None' Skip> I think we need to limit the data which can be output to Skip> strings, Unicode strings (if we use an encoded stream), floats Skip> and ints. Anything else should raise TypeError. Is there any merit having the writer handling non-string data by producing an empty field for None, and the result of PyObject_Str() for all other values? Skip> I like my CSV files to be fully quoted (even fields which may Skip> contain numbers), largely because it makes later (dangerous) Skip> matching using regular expressions simpler. Otherwise I wind up Skip> having to make all the quotes in the regular expressions Skip> optional. It just complicates things. That raises another implementation issue. If you export from Excel, does it always quote fields? If not then the default dialect behaviour should not unconditionally quote fields. We could/should support mandatoryquote as a writer option. I am going to spend some time tonight seeing if I can fold all of my ideas into the PEP so you can all poke holes in it. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 01:02:15 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:02:15 +1100 Subject: First Cut at CSV PEP In-Reply-To: <1043792044.14244.3280.camel@software1.logiplex.internal> References: <1043792044.14244.3280.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> It's also debatable whether the numbers should have been Cliff> returned as strings or numbers. I lean towards the former, as Cliff> csv is a text format and can't convey this sort of information Cliff> by itself, which is why I chose to return only strings, Cliff> including the empty string for an empty field rather than None. Cliff> I agree with Kevin that this is best left to application logic Cliff> rather than the module. Yes. - Dave -- http://www.object-craft.com.au From LogiplexSoftware at earthlink.net Wed Jan 29 01:03:45 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 16:03:45 -0800 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: References: Message-ID: <1043798625.25139.3395.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 15:28, Dave Cole wrote: > I suppose that exporting should raise an exception if you specify any > variation on the dialect in the writer function. > > csvwriter = csv.writer(file("newnastiness.csv", "w"), > dialect='excel2000', delimiter='"') > > That should raise an exception. I still don't see a good reason for this. The programmer asked for it, let her do it. I don't see a problem with letting the programmer shoot herself in the foot, as long as the gun doesn't start out pointing at it. > This probably shouldn't raise an exception though: > > csvwriter = csv.writer(file("newnastiness.csv", "w"), > dialect='excel2000') > csvwriter.setparams(delimiter='"') While this provides a workaround, it also seems a bit non-obvious why this should work when passing delimiter as an argument raises an exception. I'm not dead-set against it, its JMHO. > >> I think that we need some way to handle a potentially different set > >> of options on each dialect. > > Kevin> I'm not real comfortable with the dialect idea, it doesn't seem > Kevin> to add any value over simply specifying a separator and > Kevin> delimiter. > > It makes thing *a lot* easier for module users who are not fully > conversant in the vagaries of CSV. I agree. > Kevin> The CR, CR/LF, and LF line endings probably have something to > Kevin> do with saving in Mac format, but it may also do some 8-bit > Kevin> character translation. > > Should we be trying to handle unicode. I think we should since Python > is now unicode capable. What issues is unicode support going to raise? > Kevin> The universal readlines support in Python 2.3 may impact the > Kevin> use of a file reader/writer when processing different text > Kevin> files, but would returns or newlines within a field be > Kevin> impacted? Should the PEP and API specify that the record > Kevin> delimiter can be either CR, LF, or CR/LF, but use of those > Kevin> characters inside a field requires the field to be quoted or an > Kevin> exception will be thrown? > > Should we raise an exception or just pass the data through? > > If it is not a newline, then it is not a newline. This seems like a particularly intractable problem. If an file can't decide what sort of newlines it is going to use, then I'm not convinced it's the parser's problem. So the question becomes whether to except or pass through. The two things to consider in this case are: 1) The data might be correct, in which case it should be passed through 2) The target for the data might be someone's mission-critical SQL server and we don't want to help them mung up their data. An exception would seem appropriate. Frankly, I think I lean towards an exception on this one. There are enough text-processing tools available (dos2unix and kin) that someone should be able to pre-process a CSV file that is raising exceptions and get it into a form acceptable to the parser. A little work up front is far more acceptable than putting out a fire on someone's database. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Wed Jan 29 01:03:52 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:03:52 +1100 Subject: First Cut at CSV PEP In-Reply-To: <15927.3700.803751.757376@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043794249.14244.3330.camel@software1.logiplex.internal> <15927.3700.803751.757376@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Cliff> As an aside, does anyone have any objection to prepending [CSV] Cliff> to the subject line of our emails on this topic? Skip> Nope. I could set up a Mailman list on the Mojam server if you Skip> don't think that's too much overkill. Do it. We can then use URL's to old messages. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 01:04:32 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:04:32 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030128232803.C6A943C1F4@coffee.object-craft.com.au> References: <1043790321.25139.3251.camel@software1.logiplex.internal> <20030128232803.C6A943C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >>>> The universal readlines support in Python 2.3 may impact the use >>>> of a file reader/writer when processing different text files, but >>>> would returns or newlines within a field be impacted? Should the >>>> PEP and API specify that the record delimiter can be either CR, >>>> LF, or CR/LF, but use of those characters inside a field requires >>>> the field to be quoted or an exception will be thrown? >> Interesting point. I think that newlines inside records are going >> to be the same as those separating records. Anything else would be >> very bizarre. Andrew> You should know better than to make a statement like that Andrew> where Microsoft is concerned. Excel uses a single LF within Andrew> fields, but CRLF at the end of lines. If you import a field Andrew> containing CRLF, the CR appears within the field as a box (the Andrew> "unprintable character" symbol). Touche :-) - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 01:07:18 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:07:18 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> References: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> I suppose that exporting should raise an exception if you specify >> any variation on the dialect in the writer function. >> >> csvwriter = csv.writer(file("newnastiness.csv", "w"), >> dialect='excel2000', delimiter='"') >> >> That should raise an exception. Andrew> You mean "raise an exception because the result would be Andrew> ambiguous", or "raise an exception because it's not Andrew> excel2000"? Because it is not 'excel2000'. Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to Andrew> assume they're the same until proven otherwise (bad Andrew> assumption, I know). This is a prime example of why we should support dialects. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 01:08:16 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:08:16 +1100 Subject: [CSV] Number of lines in CSV files In-Reply-To: <1043795476.25146.3351.camel@software1.logiplex.internal> References: <15925.58225.712028.494438@montanaro.dyndns.org> <1043795476.25146.3351.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> Another thing that just occurred to me is that Excel has Cliff> historically been limited in the number of rows and columns Cliff> that it can import. This number has increased with recent Cliff> versions (I think it was 32K lines in Excel 97, Kevin informs Cliff> me it's 64K in Excel 2000). Cliff> Since export will be a feature of the CSV module, should we Cliff> have some sort of warning or raise an exception when exporting Cliff> data larger than the target application can handle, or should Cliff> we just punt on this? Arrrgggg. My brain just dribbled out of my ears... - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Wed Jan 29 01:08:16 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 11:08:16 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 10:43:33 +1100." References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> Message-ID: <20030129000816.2C9153C1F4@coffee.object-craft.com.au> >I think it is important to keep in mind the users of the module who >are not expert in the various dialects of CSV. If presented with a >flat list of all options supported they are going to engage in a fair >amount of head scratching. > >If we try to make things easier for users by mirroring the options >that their application presents then they are going to have a much >easier time working out how to use the module for their specific >problem. By limiting the available options based upon the dialect >specified by the user we will be doing them a favour. > >The point of the 'raw' dialect is to expose the full capabilities of >the raw parser. Maybe we should use None rather than 'raw'. My feeling is that this simply changes the shape of the complexity without really helping. I think we should just stick with the "a dialect is a set of defaults" idea. >Hmm... What would be the best way to handle Excel TSV. Maybe a new >dialect 'excel-tsv'? When saving, Excel97 calls this "Text (Tab delimited)", so maybe "excel-tab" would be clear enough. CSV is "CSV (Comma delimited)". On import, it seems to just guess what the file is - I couldn't see a way under Excel97 to specify. >I am not saying that the wrapper should absolutely prevent someone >from using options not available in the application. If you want to >break the dialect then maybe it should be a two step process. > > csvwriter = csv.writer(file("newnastiness.csv", "w"), > dialect='excel2000') > csvwriter.setparams(delimiter='"') This strikes me as B&D. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Jan 29 01:15:44 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 11:15:44 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 11:07:18 +1100." References: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> Message-ID: >>> That should raise an exception. > >Andrew> You mean "raise an exception because the result would be >Andrew> ambiguous", or "raise an exception because it's not >Andrew> excel2000"? > >Because it is not 'excel2000'. I don't like it, as I mentioned in my previous e-mail. Excel (97, at least) doesn't let you tweak and tune, so *any* non-default settings are "not excel". A better idea would be to have the dialect turn on "strict_blah" if it's thought necessary. But we still need to raise exceptions on nonsense formats (like using quote as a field separator while also using it as the quote character). >Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to >Andrew> assume they're the same until proven otherwise (bad >Andrew> assumption, I know). > >This is a prime example of why we should support dialects. And every dialect should be supported by a wad of tests... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Jan 29 01:15:44 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 11:15:44 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 11:07:18 +1100." References: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> Message-ID: <20030129001544.0FD133C1F4@coffee.object-craft.com.au> >>> That should raise an exception. > >Andrew> You mean "raise an exception because the result would be >Andrew> ambiguous", or "raise an exception because it's not >Andrew> excel2000"? > >Because it is not 'excel2000'. I don't like it, as I mentioned in my previous e-mail. Excel (97, at least) doesn't let you tweak and tune, so *any* non-default settings are "not excel". A better idea would be to have the dialect turn on "strict_blah" if it's thought necessary. But we still need to raise exceptions on nonsense formats (like using quote as a field separator while also using it as the quote character). >Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to >Andrew> assume they're the same until proven otherwise (bad >Andrew> assumption, I know). > >This is a prime example of why we should support dialects. And every dialect should be supported by a wad of tests... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From LogiplexSoftware at earthlink.net Wed Jan 29 01:15:49 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 16:15:49 -0800 Subject: [CSV] Number of lines in CSV files In-Reply-To: References: <15925.58225.712028.494438@montanaro.dyndns.org> <1043795476.25146.3351.camel@software1.logiplex.internal> Message-ID: <1043799349.25146.3400.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 16:08, Dave Cole wrote: > >>>>> "Cliff" == Cliff Wells writes: > > Cliff> Another thing that just occurred to me is that Excel has > Cliff> historically been limited in the number of rows and columns > Cliff> that it can import. This number has increased with recent > Cliff> versions (I think it was 32K lines in Excel 97, Kevin informs > Cliff> me it's 64K in Excel 2000). > > Cliff> Since export will be a feature of the CSV module, should we > Cliff> have some sort of warning or raise an exception when exporting > Cliff> data larger than the target application can handle, or should > Cliff> we just punt on this? > > Arrrgggg. My brain just dribbled out of my ears... So, +1 on punt? Actually I think this particular aspect would be fairly simple to handle. Another attribute of a dialect could be sizelimits = (maxrows, maxcols) and set to (None, None) if the programmer doesn't care or just wants to bypass that check. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Wed Jan 29 01:15:56 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:15:56 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030129000816.2C9153C1F4@coffee.object-craft.com.au> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> I think it is important to keep in mind the users of the module who >> are not expert in the various dialects of CSV. If presented with a >> flat list of all options supported they are going to engage in a >> fair amount of head scratching. >> >> If we try to make things easier for users by mirroring the options >> that their application presents then they are going to have a much >> easier time working out how to use the module for their specific >> problem. By limiting the available options based upon the dialect >> specified by the user we will be doing them a favour. >> >> The point of the 'raw' dialect is to expose the full capabilities >> of the raw parser. Maybe we should use None rather than 'raw'. Andrew> My feeling is that this simply changes the shape of the Andrew> complexity without really helping. Andrew> I think we should just stick with the "a dialect is a set of Andrew> defaults" idea. Fair enough. Instead of limiting the tweakable options by raising an exception we could have an interface which allowed the user to query the options normally associated with a dialect. >> Hmm... What would be the best way to handle Excel TSV. Maybe a >> new dialect 'excel-tsv'? Andrew> When saving, Excel97 calls this "Text (Tab delimited)", so Andrew> maybe "excel-tab" would be clear enough. CSV is "CSV (Comma Andrew> delimited)". Yup. Andrew> On import, it seems to just guess what the file is - I Andrew> couldn't see a way under Excel97 to specify. Some kind of sniffing going on. Should we have a sniffer in the module? >> I am not saying that the wrapper should absolutely prevent someone >> from using options not available in the application. If you want >> to break the dialect then maybe it should be a two step process. >> >> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000') >> csvwriter.setparams(delimiter='"') Andrew> This strikes me as B&D. Just what are you trying to imply by that? :-) - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Jan 29 01:24:17 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:24:17 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <1043798625.25139.3395.camel@software1.logiplex.internal> References: <1043798625.25139.3395.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> On Tue, 2003-01-28 at 15:28, Dave Cole wrote: >> I suppose that exporting should raise an exception if you specify >> any variation on the dialect in the writer function. >> >> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000', delimiter='"') >> >> That should raise an exception. Cliff> I still don't see a good reason for this. The programmer asked Cliff> for it, let her do it. I don't see a problem with letting the Cliff> programmer shoot herself in the foot, as long as the gun Cliff> doesn't start out pointing at it. >> This probably shouldn't raise an exception though: >> >> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000') >> csvwriter.setparams(delimiter='"') Cliff> While this provides a workaround, it also seems a bit Cliff> non-obvious why this should work when passing delimiter as an Cliff> argument raises an exception. I'm not dead-set against it, its Cliff> JMHO. I think you are right - it is a bad idea in retrospect. Kevin> The CR, CR/LF, and LF line endings probably have something to Kevin> do with saving in Mac format, but it may also do some 8-bit Kevin> character translation. >> Should we be trying to handle unicode. I think we should since >> Python is now unicode capable. Cliff> What issues is unicode support going to raise? The low level parser (C code) is probably going to need to handle unicode. >> If it is not a newline, then it is not a newline. Cliff> This seems like a particularly intractable problem. If an file Cliff> can't decide what sort of newlines it is going to use, then I'm Cliff> not convinced it's the parser's problem. Cliff> So the question becomes whether to except or pass through. The Cliff> two things to consider in this case are: Cliff> 1) The data might be correct, in which case it should be passed Cliff> through 2) The target for the data might be someone's Cliff> mission-critical SQL server and we don't want to help them mung Cliff> up their data. An exception would seem appropriate. Cliff> Frankly, I think I lean towards an exception on this one. Cliff> There are enough text-processing tools available (dos2unix and Cliff> kin) that someone should be able to pre-process a CSV file that Cliff> is raising exceptions and get it into a form acceptable to the Cliff> parser. A little work up front is far more acceptable than Cliff> putting out a fire on someone's database. Should the reader have an option which turns on universal newline mode? This would allow for both behaviours - if a non-conforming newline is encountered while not in universal newline mode then an exception would be raised. According to Andrew's previous message the default setting for Excel97 would be universal newline mode turned on. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Wed Jan 29 01:25:46 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 11:25:46 +1100 Subject: [CSV] Number of lines in CSV files In-Reply-To: Message from Cliff Wells of "28 Jan 2003 16:15:49 -0800." <1043799349.25146.3400.camel@software1.logiplex.internal> References: <15925.58225.712028.494438@montanaro.dyndns.org> <1043795476.25146.3351.camel@software1.logiplex.internal> <1043799349.25146.3400.camel@software1.logiplex.internal> Message-ID: <20030129002546.493EB3C1F4@coffee.object-craft.com.au> >So, +1 on punt? +1 on punt from me. >Actually I think this particular aspect would be fairly simple to >handle. Another attribute of a dialect could be sizelimits = (maxrows, >maxcols) and set to (None, None) if the programmer doesn't care or just >wants to bypass that check. Kitchen sink - we'll end up making the dialect's too specific for the user to be able to choose ("do I have Excel2000 with SP2 applied, or..."). I bet it even varies by region of the world (for example, the Chinese edition probably has different limits). I have a sneeking suspician that Excels CSV parsing code is resonably stable - they're probably not game to make changes now that it mostly works. We might find that dialect="excel" is good enough. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Wed Jan 29 01:28:45 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:28:45 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030129001544.0FD133C1F4@coffee.object-craft.com.au> References: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> <20030129001544.0FD133C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >>>> That should raise an exception. >> Andrew> You mean "raise an exception because the result would be Andrew> ambiguous", or "raise an exception because it's not Andrew> excel2000"? >> Because it is not 'excel2000'. Andrew> I don't like it, as I mentioned in my previous e-mail. Excel Andrew> (97, at least) doesn't let you tweak and tune, so *any* Andrew> non-default settings are "not excel". Andrew> A better idea would be to have the dialect turn on Andrew> "strict_blah" if it's thought necessary. Probably not. I now think that my original idea was a bad one. Andrew> But we still need to raise exceptions on nonsense formats Andrew> (like using quote as a field separator while also using it as Andrew> the quote character). Yup. Andrew> BTW, I don't have access to Excel 2000, only 97. I'm going to Andrew> assume they're the same until proven otherwise (bad Andrew> assumption, I know). >> This is a prime example of why we should support dialects. Andrew> And every dialect should be supported by a wad of tests... 8-) We need to have a torture test suite (which is manually run against an application) with which to expose the options which apply to a dialect. The results of the torture test then are set in stone as a regression test for that dialect. - Dave -- http://www.object-craft.com.au From LogiplexSoftware at earthlink.net Wed Jan 29 01:28:46 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 16:28:46 -0800 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> Message-ID: <1043800126.25139.3411.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 16:15, Dave Cole wrote: > >>>>> "Andrew" == Andrew McNamara writes: > > >> I think it is important to keep in mind the users of the module who > >> are not expert in the various dialects of CSV. If presented with a > >> flat list of all options supported they are going to engage in a > >> fair amount of head scratching. > >> > >> If we try to make things easier for users by mirroring the options > >> that their application presents then they are going to have a much > >> easier time working out how to use the module for their specific > >> problem. By limiting the available options based upon the dialect > >> specified by the user we will be doing them a favour. > >> > >> The point of the 'raw' dialect is to expose the full capabilities > >> of the raw parser. Maybe we should use None rather than 'raw'. > > Andrew> My feeling is that this simply changes the shape of the > Andrew> complexity without really helping. > > Andrew> I think we should just stick with the "a dialect is a set of > Andrew> defaults" idea. > > Fair enough. Whew. > > Instead of limiting the tweakable options by raising an exception we > could have an interface which allowed the user to query the options > normally associated with a dialect. > > >> Hmm... What would be the best way to handle Excel TSV. Maybe a > >> new dialect 'excel-tsv'? So are we leaning towards dialects being done as simple classes? Will 'excel-tsv' simply be defined as class excel_tsv(excel_2000): delimiter = '\t' with a dictionary for lookup: settings = { 'excel-tsv': excel_tsv, 'excel-2000': excel_2000, } ? > Andrew> When saving, Excel97 calls this "Text (Tab delimited)", so > Andrew> maybe "excel-tab" would be clear enough. CSV is "CSV (Comma > Andrew> delimited)". > > Yup. > > Andrew> On import, it seems to just guess what the file is - I > Andrew> couldn't see a way under Excel97 to specify. > > Some kind of sniffing going on. > > Should we have a sniffer in the module? This hasn't been brought up, but of course one of the major selling points of DSV is the "sniffing" code. However, I think I'm with Dave on having another layer (CSVutils) that would contain this sort of thing. > >> I am not saying that the wrapper should absolutely prevent someone > >> from using options not available in the application. If you want > >> to break the dialect then maybe it should be a two step process. > >> > >> csvwriter = csv.writer(file("newnastiness.csv", "w"), dialect='excel2000') > >> csvwriter.setparams(delimiter='"') > > Andrew> This strikes me as B&D. > > Just what are you trying to imply by that? :-) We should probably leave people's personal issues out of this ;) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From altis at semi-retired.com Wed Jan 29 01:31:56 2003 From: altis at semi-retired.com (Kevin Altis) Date: Tue, 28 Jan 2003 16:31:56 -0800 Subject: [CSV] RE: Number of lines in CSV files In-Reply-To: <1043795476.25146.3351.camel@software1.logiplex.internal> Message-ID: > From: Cliff Wells > > Another thing that just occurred to me is that Excel has historically > been limited in the number of rows and columns that it can import. This > number has increased with recent versions (I think it was 32K lines in > Excel 97, Kevin informs me it's 64K in Excel 2000). > > Since export will be a feature of the CSV module, should we have some > sort of warning or raise an exception when exporting data larger than > the target application can handle, or should we just punt on this? +1 on punt The user may not actually be trying to import into Excel, they may be using Access, later versions of Excel might support more rows, whatever. Plus, Excel still imports the data, it just can't deal with more than 64K rows in Excel 2000. Now we could very well have some stats generated, maybe as a separate function if someone wanted to know all the gritty details of which columns of which rows contained embedded newlines, escaped characters, which rows had an odd number of columns, total number of rows, whatever. Sort of a CSV verifier if you will. ka From andrewm at object-craft.com.au Wed Jan 29 01:38:07 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 11:38:07 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Cliff Wells of "28 Jan 2003 16:28:46 -0800." <1043800126.25139.3411.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> Message-ID: <20030129003807.7185E3C1F4@coffee.object-craft.com.au> >So are we leaning towards dialects being done as simple classes? Will >'excel-tsv' simply be defined as > >class excel_tsv(excel_2000): > delimiter = '\t' > >with a dictionary for lookup: > >settings = { 'excel-tsv': excel_tsv, > 'excel-2000': excel_2000, } That seems reasonable. +1 The classes should be exposed by the module, however, so the application can subclass if need be (or just refer to the classes directly, rather than going via the str->class mapping). >This hasn't been brought up, but of course one of the major selling >points of DSV is the "sniffing" code. However, I think I'm with Dave on >having another layer (CSVutils) that would contain this sort of thing. Yep, +1 from me. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From LogiplexSoftware at earthlink.net Wed Jan 29 01:38:15 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 16:38:15 -0800 Subject: [CSV] Number of lines in CSV files In-Reply-To: <20030129002546.493EB3C1F4@coffee.object-craft.com.au> References: <15925.58225.712028.494438@montanaro.dyndns.org> <1043795476.25146.3351.camel@software1.logiplex.internal> <1043799349.25146.3400.camel@software1.logiplex.internal> <20030129002546.493EB3C1F4@coffee.object-craft.com.au> Message-ID: <1043800695.14244.3420.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 16:25, Andrew McNamara wrote: > >So, +1 on punt? > > +1 on punt from me. > > >Actually I think this particular aspect would be fairly simple to > >handle. Another attribute of a dialect could be sizelimits = (maxrows, > >maxcols) and set to (None, None) if the programmer doesn't care or just > >wants to bypass that check. > > Kitchen sink - we'll end up making the dialect's too specific for the user > to be able to choose ("do I have Excel2000 with SP2 applied, or..."). > I bet it even varies by region of the world (for example, the Chinese > edition probably has different limits). What do you mean by "kitchen sink"? Are you saying that CSV shouldn't have an option to play tetris while the file is loading? This is going to disappoint a lot of emacs users. Okay, +1 on punting file size. Unless anyone else cares to argue it I suppose we'll leave it out. > I have a sneeking suspician that Excels CSV parsing code is resonably > stable - they're probably not game to make changes now that it mostly > works. We might find that dialect="excel" is good enough. Probably. This can be fixed via bug reports (and dialects added) if that changes. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From altis at semi-retired.com Wed Jan 29 01:39:34 2003 From: altis at semi-retired.com (Kevin Altis) Date: Tue, 28 Jan 2003 16:39:34 -0800 Subject: [CSV] RE: First Cut at CSV PEP In-Reply-To: Message-ID: > From: Dave Cole > > >>>>> "Kevin" == Kevin Altis writes: > > Kevin> The big issue with the MS/Excel CSV format is that MS doesn't > Kevin> appear to escape any characters or support import of escaped > Kevin> characters. A field that contains characters that you might > Kevin> normally escape (including a comma if that is the separator) > Kevin> are instead enclosed in double quotes by default and then any > Kevin> double quotes in the field are doubled. > > I thought that we were trying to build a CSV parser which would deal > with different dialects, not just what Excel does. Am I wrong making > that assumption? > > If we were to only target Excel our task would be much easier. > > I think that we should be trying to come up with an engine wrapped by > an friendly API which can be made more powerful over time in order to > parse more and more dialects. Agreed, certainly support more than just Excel. I think I understand the dialects thing now. Last night I was getting rubbed the wrong way by specifying the dialect and then also allowing the specification of delimitter, quote character, etc. in the same line. I like the idea of using a dialect and then changing the properties in separate calls. I suppose there is a good reason that each dialect isn't just a subclass, if so, the reasoning for using dialects instead of subclasses of a parser might be called out in the PEP. I can go with it either way. I would be tempted to call what is currently Excel2000, MSCSV or ExcelCSV. ka From djc at object-craft.com.au Wed Jan 29 01:47:01 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:47:01 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <1043800126.25139.3411.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: >> Instead of limiting the tweakable options by raising an exception >> we could have an interface which allowed the user to query the >> options normally associated with a dialect. >> >> >> Hmm... What would be the best way to handle Excel TSV. Maybe a >> >> new dialect 'excel-tsv'? Cliff> So are we leaning towards dialects being done as simple Cliff> classes? Will 'excel-tsv' simply be defined as Cliff> class excel_tsv(excel_2000): Cliff> delimiter = '\t' Cliff> with a dictionary for lookup: Cliff> settings = { 'excel-tsv': excel_tsv, Cliff> 'excel-2000': excel_2000, Cliff> } Dunno yet. Here we go again with a potentially bad idea... I think that there are two things we need to have for each dialect; a set of low level parser configuration, and a set of user tweakables (which correspond to options presented by the application). The set of user tweakables may not necessarily map one-to-one with low level parser configuration items. How would we do this in Python? >> Should we have a sniffer in the module? Cliff> This hasn't been brought up, but of course one of the major Cliff> selling points of DSV is the "sniffing" code. However, I think Cliff> I'm with Dave on having another layer (CSVutils) that would Cliff> contain this sort of thing. Any sniffer would have to be able to traverse the set of dialects implemented in the CSV module and look inside them to understand which options are available to a dialect. - Dave -- http://www.object-craft.com.au From LogiplexSoftware at earthlink.net Wed Jan 29 01:47:20 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 16:47:20 -0800 Subject: [CSV] RE: Number of lines in CSV files In-Reply-To: References: Message-ID: <1043801240.25139.3429.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 16:31, Kevin Altis wrote: > > From: Cliff Wells > > > > Another thing that just occurred to me is that Excel has historically > > been limited in the number of rows and columns that it can import. This > > number has increased with recent versions (I think it was 32K lines in > > Excel 97, Kevin informs me it's 64K in Excel 2000). > > > > Since export will be a feature of the CSV module, should we have some > > sort of warning or raise an exception when exporting data larger than > > the target application can handle, or should we just punt on this? > > +1 on punt > > The user may not actually be trying to import into Excel, they may be using > Access, later versions of Excel might support more rows, whatever. Plus, > Excel still imports the data, it just can't deal with more than 64K rows in > Excel 2000. I guess we need to decide what we mean by "dialect": do we mean "this data _will_ import into this application" or do we mean "this data will be written in a format this application can understand, but might not necessarily be able to use"? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Wed Jan 29 01:48:46 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 11:48:46 +1100 Subject: [CSV] RE: Number of lines in CSV files In-Reply-To: <1043801240.25139.3429.camel@software1.logiplex.internal> References: <1043801240.25139.3429.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> On Tue, 2003-01-28 at 16:31, Kevin Altis wrote: >> > From: Cliff Wells >> > >> > Another thing that just occurred to me is that Excel has >> historically > been limited in the number of rows and columns that >> it can import. This > number has increased with recent versions (I >> think it was 32K lines in > Excel 97, Kevin informs me it's 64K in >> Excel 2000). >> > >> > Since export will be a feature of the CSV module, should we have >> some > sort of warning or raise an exception when exporting data >> larger than > the target application can handle, or should we just >> punt on this? >> >> +1 on punt >> >> The user may not actually be trying to import into Excel, they may >> be using Access, later versions of Excel might support more rows, >> whatever. Plus, Excel still imports the data, it just can't deal >> with more than 64K rows in Excel 2000. Cliff> I guess we need to decide what we mean by "dialect": do we mean Cliff> "this data _will_ import into this application" or do we mean Cliff> "this data will be written in a format this application can Cliff> understand, but might not necessarily be able to use"? I vote for the "this data will be written in a format this application can understand, but might not necessarily be able to use". We can always supplement the code with documentation. - Dave -- http://www.object-craft.com.au From LogiplexSoftware at earthlink.net Wed Jan 29 02:11:33 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 28 Jan 2003 17:11:33 -0800 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> Message-ID: <1043802693.25139.3445.camel@software1.logiplex.internal> On Tue, 2003-01-28 at 16:47, Dave Cole wrote: > >>>>> "Cliff" == Cliff Wells writes: > > >> Instead of limiting the tweakable options by raising an exception > >> we could have an interface which allowed the user to query the > >> options normally associated with a dialect. > >> > >> >> Hmm... What would be the best way to handle Excel TSV. Maybe a > >> >> new dialect 'excel-tsv'? > > Cliff> So are we leaning towards dialects being done as simple > Cliff> classes? Will 'excel-tsv' simply be defined as > > Cliff> class excel_tsv(excel_2000): > Cliff> delimiter = '\t' > > Cliff> with a dictionary for lookup: > > Cliff> settings = { 'excel-tsv': excel_tsv, > Cliff> 'excel-2000': excel_2000, > Cliff> } > > Dunno yet. > > Here we go again with a potentially bad idea... > > I think that there are two things we need to have for each dialect; a > set of low level parser configuration, and a set of user tweakables > (which correspond to options presented by the application). The set > of user tweakables may not necessarily map one-to-one with low level > parser configuration items. Can you give examples? I suppose you are referring to things like CR/LF translation and spaces around quotes as being low-level parser configurations and things like delimiters being user-tweakable? > > How would we do this in Python? > > >> Should we have a sniffer in the module? > > Cliff> This hasn't been brought up, but of course one of the major > Cliff> selling points of DSV is the "sniffing" code. However, I think > Cliff> I'm with Dave on having another layer (CSVutils) that would > Cliff> contain this sort of thing. > > Any sniffer would have to be able to traverse the set of dialects > implemented in the CSV module and look inside them to understand > which options are available to a dialect. Maybe. Currently the sniffing code in DSV just makes a best guess regarding delimiters, text qualifiers and headers. Certainly the dialects could be used to improve its guess (most likely when the sniffed results are ambiguous or fail). Using dialects on import is of less importance if sniffing code is used. They are two different approaches to the same problem. If the user specifies the file as Excel compatible, then sniffing seems rather redundant, further, if the file is sniffed and the format discovered, it doesn't seem important which dialect it matches, as long as we are able to use the sniffed parameters to parse it. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Wed Jan 29 02:21:42 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 12:21:42 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <1043802693.25139.3445.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> <1043802693.25139.3445.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> On Tue, 2003-01-28 at 16:47, Dave Cole wrote: >> >>>>> "Cliff" == Cliff Wells >> writes: >> >> >> Instead of limiting the tweakable options by raising an >> exception >> we could have an interface which allowed the user to >> query the >> options normally associated with a dialect. >> >> >> >> >> Hmm... What would be the best way to handle Excel TSV. >> Maybe a >> >> new dialect 'excel-tsv'? >> Cliff> So are we leaning towards dialects being done as simple Cliff> classes? Will 'excel-tsv' simply be defined as >> Cliff> class excel_tsv(excel_2000): delimiter = '\t' >> Cliff> with a dictionary for lookup: >> Cliff> settings = { 'excel-tsv': excel_tsv, 'excel-2000': excel_2000, Cliff> } >> Dunno yet. >> >> Here we go again with a potentially bad idea... >> >> I think that there are two things we need to have for each dialect; >> a set of low level parser configuration, and a set of user >> tweakables (which correspond to options presented by the >> application). The set of user tweakables may not necessarily map >> one-to-one with low level parser configuration items. Cliff> Can you give examples? I suppose you are referring to things Cliff> like CR/LF translation and spaces around quotes as being Cliff> low-level parser configurations and things like delimiters Cliff> being user-tweakable? I do not have access to the software at the moment, but not long ago I used a program called TOAD which was a GUI for fiddling around with Oracle as a client. One of the things you could after executing a query was export the results to a file. I seem to recall that the export dialog has a number of options which do not cleanly map onto just one of the settings we would place in our writer/reader. I will see if I can get a screen shot of the dialog... Cliff> Maybe. Currently the sniffing code in DSV just makes a best Cliff> guess regarding delimiters, text qualifiers and headers. Cliff> Certainly the dialects could be used to improve its guess (most Cliff> likely when the sniffed results are ambiguous or fail). Cliff> Using dialects on import is of less importance if sniffing code Cliff> is used. They are two different approaches to the same Cliff> problem. If the user specifies the file as Excel compatible, Cliff> then sniffing seems rather redundant, further, if the file is Cliff> sniffed and the format discovered, it doesn't seem important Cliff> which dialect it matches, as long as we are able to use the Cliff> sniffed parameters to parse it. The sniffer is definitely your area of expertise. I am just making stuff up as I go :-) - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Wed Jan 29 02:36:01 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 12:36:01 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 11:47:01 +1100." References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> Message-ID: >Here we go again with a potentially bad idea... *-) >I think that there are two things we need to have for each dialect; a >set of low level parser configuration, and a set of user tweakables >(which correspond to options presented by the application). The set >of user tweakables may not necessarily map one-to-one with low level >parser configuration items. This seems to add a fair bit of complexity to the implementation, without simplifying the interface much. In particular, it makes it difficult for the user to move to an alternate dialect (because they'll need to change all the config options). It also makes it harder for third parties to implement their own dialects (or maintain the base ones). And it makes the documenation and tests harder. KISS. >Any sniffer would have to be able to traverse the set of dialects >implemented in the CSV module and look inside them to understand >which options are available to a dialect. It might be enough to look at the first N lines of the file, and do some basic stats (tabs per line, commas per line, etc). Whether it guesses a dialect, or just tries to set individual options is another question. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Jan 29 02:36:01 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 12:36:01 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 11:47:01 +1100." References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> Message-ID: <20030129013601.E49083C1F4@coffee.object-craft.com.au> >Here we go again with a potentially bad idea... *-) >I think that there are two things we need to have for each dialect; a >set of low level parser configuration, and a set of user tweakables >(which correspond to options presented by the application). The set >of user tweakables may not necessarily map one-to-one with low level >parser configuration items. This seems to add a fair bit of complexity to the implementation, without simplifying the interface much. In particular, it makes it difficult for the user to move to an alternate dialect (because they'll need to change all the config options). It also makes it harder for third parties to implement their own dialects (or maintain the base ones). And it makes the documenation and tests harder. KISS. >Any sniffer would have to be able to traverse the set of dialects >implemented in the CSV module and look inside them to understand >which options are available to a dialect. It might be enough to look at the first N lines of the file, and do some basic stats (tabs per line, commas per line, etc). Whether it guesses a dialect, or just tries to set individual options is another question. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Jan 29 02:41:00 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 12:41:00 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: Message from Cliff Wells of "28 Jan 2003 17:11:33 -0800." <1043802693.25139.3445.camel@software1.logiplex.internal> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> <1043802693.25139.3445.camel@software1.logiplex.internal> Message-ID: <20030129014100.D008A3C1F4@coffee.object-craft.com.au> >Using dialects on import is of less importance if sniffing code is >used. They are two different approaches to the same problem. If the >user specifies the file as Excel compatible, then sniffing seems rather >redundant, further, if the file is sniffed and the format discovered, it >doesn't seem important which dialect it matches, as long as we are able >to use the sniffed parameters to parse it. A client of ours has CSV files being sent to him by many different sources - a sniffer would be more valuable to him. I'd like to assume the rules are consistent within any given file, but I'm not sure this is even certain in his application. I think the multiple sources are merged into one file before he gets his hands on them - it's a pathological situation - he has a diabolical pile of python that iteratively attempts to produce something useful. Madness lies this way. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Jan 29 03:01:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:01:01 -0600 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030128232803.C6A943C1F4@coffee.object-craft.com.au> References: <1043790321.25139.3251.camel@software1.logiplex.internal> <20030128232803.C6A943C1F4@coffee.object-craft.com.au> Message-ID: <15927.13789.344190.312001@montanaro.dyndns.org> >> Interesting point. I think that newlines inside records are going to >> be the same as those separating records. Anything else would be very >> bizarre. Andrew> You should know better than to make a statement like that where Andrew> Microsoft is concerned. Excel uses a single LF within fields, Andrew> but CRLF at the end of lines. If you import a field containing Andrew> CRLF, the CR appears within the field as a box (the "unprintable Andrew> character" symbol). Here's what I can figure out from the samples I saved in Excel today. I'm away from the Windows machine now, so I can only infer the titles in the save menu from the file names, so I may be a bit off in the associations. Still, here goes: File Type delimiter hard return line terminator CSV comma LF CRLF DOS Text TAB LF CRLF DOS CSV comma LF CRLF Mac Text TAB LF CR Mac CSV comma LF CR Space yow, this seems all screwed up! TSV TAB LF CRLF Unicode CSV comma LF CRLF Unicode Text TAB LF CRLF The Space-separated file looked pretty much like garbage. I'll have to check it out more closely tomorrow. The Unicode CSV file was the same as the DOS CSV and CSV files (same checksum). I was thus fairly surprised to see that the Unicode Text file looked like it had been saved as UTF-16 - each character is followed by an ASCII NUL and there is a little-endian UTF-16 BOM at the start of the file. The table suggests that Excel cares about Windows and Mac line endings, so we should allow that to be a user-specified option. Unfortunately, that means we have to tell people to open files in binary mode, since they will be passing open file objects. Doesn't seem very clean to me. Any ideas? Skip From djc at object-craft.com.au Wed Jan 29 03:01:27 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 13:01:27 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030129013601.E49083C1F4@coffee.object-craft.com.au> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <20030129000816.2C9153C1F4@coffee.object-craft.com.au> <1043800126.25139.3411.camel@software1.logiplex.internal> <20030129013601.E49083C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> Here we go again with a potentially bad idea... Andrew> *-) >> I think that there are two things we need to have for each dialect; >> a set of low level parser configuration, and a set of user >> tweakables (which correspond to options presented by the >> application). The set of user tweakables may not necessarily map >> one-to-one with low level parser configuration items. Andrew> This seems to add a fair bit of complexity to the Andrew> implementation, without simplifying the interface much. In Andrew> particular, it makes it difficult for the user to move to an Andrew> alternate dialect (because they'll need to change all the Andrew> config options). It also makes it harder for third parties to Andrew> implement their own dialects (or maintain the base ones). And Andrew> it makes the documenation and tests harder. KISS. OK. Yes, it was a bad idea which achieved full potential. >> Any sniffer would have to be able to traverse the set of dialects >> implemented in the CSV module and look inside them to understand >> which options are available to a dialect. Andrew> It might be enough to look at the first N lines of the file, Andrew> and do some basic stats (tabs per line, commas per line, Andrew> etc). Whether it guesses a dialect, or just tries to set Andrew> individual options is another question. Just to make your heads hurt a bit more... In a previous job (at a stock broker) I had to read some CSV data which had been exported by the MS SQL Server BCP program. The excellent BCP program happily exported comma separated data without quoting fields which contained commas. Nasty! I ended up writing some code which post-processed the parsed records based upon the number of fields. The post-processing had high level knowledge of the type of each column so applied heuristics to join fields back together to get the correct field count. I remember that the code knew which columns were text, numeric, dates, times and bit. The code worked from left to right and tried joining text columns with trailing fields then asserted that the remaining fields were consistent with their respective columns. This continued until the field count matched the table column count. All of this was complicated further by the fact that it had to handle archived data and the table definition changed over time... - Dave -- http://www.object-craft.com.au From skip at pobox.com Wed Jan 29 03:06:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:06:55 -0600 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> References: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> Message-ID: <15927.14143.403232.338340@montanaro.dyndns.org> Dunno who said this now, but I disagree with this statement: >> I suppose that exporting should raise an exception if you specify any >> variation on the dialect in the writer function. In the proto-PEP I tried to address this issue: When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed first, then the others are processed. This makes it easy to choose a dialect, then override one or more of the settings. For example, if a CSV file was generated by Excel 2000 using single quotes as the quote character and TAB as the delimiter, you could create a reader like:: csvreader = csv.reader(file("some.csv"), dialect="excel2000", quotechar="'", delimiter='\t') Other details of how Excel generates CSV files would be handled automatically. I think we should try our damndest to not raise exceptions. The example is just to show that we will allow people to start from a known state and tweak it. "This file has all the properties of an Excel 2000 file except an apostrophe was used as the quote character and a TAB was used as the delimiter." Skip From skip at pobox.com Wed Jan 29 03:21:20 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:21:20 -0600 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> Message-ID: <15927.15008.879314.896465@montanaro.dyndns.org> Dave> The point of the 'raw' dialect is to expose the full capabilities Dave> of the raw parser. Maybe we should use None rather than 'raw'. Nah, "raw" won't mean anything to anyone. Make "excel2000" the default. The point of the dialect names is that they should mean something to someone. That generally means application names, not something lile "raw". I think it also means you only have variants associated with applications which normally provide few choices. We can probably all come close to specifying what the parameter settings are for "excel2000", but what about "gnumeric"? As I write this I'm looking at a Gnumeric "Save As" wizard. The user can choose line termination (LF is the default), delimiter (comma is the default), quoting style (automatic (default), always, never), and the quote character (" is the default). Even though the wizard presents sensible defaults, I'm less enthusiastic about creating a "gnumeric" variant, precisely because it won't necessarily mean much. Cliff> I think it is an option to save as a TSV file (IIRC), which is Cliff> the same as a CSV file, but with tabs. Dave> Hmm... What would be the best way to handle Excel TSV. Maybe a Dave> new dialect 'excel-tsv'? Any of: reader = csv.reader(file("some.csv"), variant="excel2000-tsv") or reader = csv.reader(file("some.csv"), variant="excel2000", delimiter='\t') or (assuming "excel2000" is the default), just: reader = csv.reader(file("some.csv"), delimiter='\t') Dave> I am not saying that the wrapper should absolutely prevent someone Dave> from using options not available in the application. If you want to Dave> break the dialect then maybe it should be a two step process. Dave> csvwriter = csv.writer(file("newnastiness.csv", "w"), Dave> dialect='excel2000') Dave> csvwriter.setparams(delimiter='"') That seems cumbersome. I think we have to give our users both some credit (for brains) and some flexibility. It seems gratuitous (and unPythonic) to specify some parameters in the constructor and some in a later method. All this dialect stuff will be handled at the Python level, right? Skip From skip at pobox.com Wed Jan 29 03:30:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:30:23 -0600 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> Message-ID: <15927.15551.93504.635849@montanaro.dyndns.org> Skip> I'm not so sure this mapping None to "None" on output is such a Skip> good idea because it's not reversible in all situations and hurts Skip> portability to other systems (e.g., does Excel have a concept of Skip> None? what happens if you have a text field which just happens to Skip> contain "None"?). Dave> I think that None should always be written as a zero length field, Dave> and always read as the field value 'None' I'm really skeptical of this. There is just no equivalence between None and ''. Right now using the Object Craft csv module, a blank field comes through as an empty string. I think that's the correct behavior. Skip> I think we need to limit the data which can be output to strings, Skip> Unicode strings (if we use an encoded stream), floats and ints. Skip> Anything else should raise TypeError. Dave> Is there any merit having the writer handling non-string data by Dave> producing an empty field for None, and the result of Dave> PyObject_Str() for all other values? I'm not sure. I'm inclined to not allow anything other than what I said above. Certainly, compound objects should raise exceptions. I think of CSV more like XML-RPC than Pyro. We're trying to exchange data with as many other languages and applications as possible, not create a new protocol for exchanging data with other Python programs. CSV is designed to represent the numeric and string values in spreadsheets and databases. Going too far beyond that seems like out-of-scope to me, especially if this is to get into 2.3. Remember, 2.3a1 is already out there! Dave> That raises another implementation issue. If you export from Dave> Excel, does it always quote fields? If not then the default Dave> dialect behaviour should not unconditionally quote fields. Not in my limited experience. It quotes only where necessary (fields containing delimiters or starting with the quote character). Dave> We could/should support mandatoryquote as a writer option. This is something Laurence Tratt's original CSV module did (his ASV module probably does as well). I used it all the time. Gnumeric provides "always", "as needed" and "never". I don't know how you'd do "never" without specifying an escape character. I just tried "never" while saving CSV data from Gnumeric. It didn't escape embedded commas, so it effectively toasted the data. Skip From skip at pobox.com Wed Jan 29 03:36:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:36:04 -0600 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <1043798625.25139.3395.camel@software1.logiplex.internal> References: <1043798625.25139.3395.camel@software1.logiplex.internal> Message-ID: <15927.15892.377982.750393@montanaro.dyndns.org> Cliff> Frankly, I think I lean towards an exception on this one. There Cliff> are enough text-processing tools available (dos2unix and kin) Cliff> that someone should be able to pre-process a CSV file that is Cliff> raising exceptions and get it into a form acceptable to the Cliff> parser. A little work up front is far more acceptable than Cliff> putting out a fire on someone's database. How would you handle this example? You saved a file in Excel which contained "hard returns". Line termination is thus CRLF and hard returns are LF. Bring it over to your Unix system, run dos2unix on it, read it into Python, fiddle with it and write it out. Now run unix2dos and push it back to the Windows machine for viewing with Excel. Gues what just happened to those "hard returns"? :-( Like you said, this may indeed be a very hard, or intractable problem. I propose we not spend any more time on it now, but add it as an issue and get some feedback from the broader community when an initial version of the PEP is released (which I'd like to do in the next couple of days). Skip From csv-request at manatee.mojam.com Wed Jan 29 03:39:31 2003 From: csv-request at manatee.mojam.com (csv-request at manatee.mojam.com) Date: Tue, 28 Jan 2003 20:39:31 -0600 Subject: Welcome to the "Csv" mailing list Message-ID: <200301290239.h0T2dVPL007061@manatee.mojam.com> Welcome to the Csv at manatee.mojam.com mailing list! To post to this list, send your email to: csv at manatee.mojam.com General information about the mailing list is at: http://manatee.mojam.com/mailman/listinfo/csv If you ever want to unsubscribe or change your options (eg, switch to or from digest mode, change your password, etc.), visit your subscription page at: http://manatee.mojam.com/mailman/options/csv/andrewm%40object-craft.com.au You can also make such adjustments via email by sending a message to: Csv-request at manatee.mojam.com with the word `help' in the subject or body (don't include the quotes), and you will get back a message with instructions. You must know your password to change your options (including changing the password, itself) or to unsubscribe. It is: uhzuug If you forget your password, don't worry, you will receive a monthly reminder telling you what all your manatee.mojam.com mailing list passwords are, and how to unsubscribe or change your options. There is also a button on your options page that will email your current password to you. You may also have your password mailed to you automatically off of the Web page noted above. From skip at pobox.com Wed Jan 29 03:45:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:45:45 -0600 Subject: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043794249.14244.3330.camel@software1.logiplex.internal> <15927.3700.803751.757376@montanaro.dyndns.org> Message-ID: <15927.16473.998484.348688@montanaro.dyndns.org> Skip> Nope. I could set up a Mailman list on the Mojam server if you Skip> don't think that's too much overkill. Dave> Do it. We can then use URL's to old messages. You got it. We've all been subscribed and you should each have received a welcome message by now. I will make sure list messages are archived, and once the PEP is published, use that as the response address for comments. All five of us have been subscribed. The posting address is csv at mail.mojam.com. I'll run spambayes in front of Mailman so I can leave open posting enabled yet not drown in a sea of spam (which will almost certainly begin shortly after the address is published). If you use procmail or other mail filtering tools, you can key on this header: X-Spambayes-Classification: ham; 0.00 where "ham" (good mail) may be replaced by "spam" or "unsure". The number will range from 0.00 ("certain ham") to 1.00 ("certain spam"). Skip From skip at pobox.com Wed Jan 29 03:49:53 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:49:53 -0600 Subject: [Csv] test message Message-ID: <15927.16721.284748.270083@montanaro.dyndns.org> Just a test - did I screw anything up? Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 03:55:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 20:55:28 -0600 Subject: [Csv] 'Nuther test Message-ID: <15927.17056.525801.505496@montanaro.dyndns.org> test 2 _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 04:20:44 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 14:20:44 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <15927.13789.344190.312001@montanaro.dyndns.org> References: <1043790321.25139.3251.camel@software1.logiplex.internal> <20030128232803.C6A943C1F4@coffee.object-craft.com.au> <15927.13789.344190.312001@montanaro.dyndns.org> Message-ID: skip> The table suggests that Excel cares about Windows and Mac line skip> endings, so we should allow that to be a user-specified option. skip> Unfortunately, that means we have to tell people to open files skip> in binary mode, since they will be passing open file objects. skip> Doesn't seem very clean to me. Any ideas? Failing to open a file in binary mode is already a gotcha in Python. If someone wants to force a particular end of line in the writer then they must be prepared to open the file in binary mode. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 04:25:16 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 14:25:16 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <15927.14143.403232.338340@montanaro.dyndns.org> References: <20030128233947.6C5593C1F4@coffee.object-craft.com.au> <15927.14143.403232.338340@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> Dunno who said this now, but I disagree with this statement: >>> I suppose that exporting should raise an exception if you specify >>> any variation on the dialect in the writer function. That was me. I now agree that it is a bad idea. Andrew suggested that we apply the KISS principle. I agree with his suggestion that a dialect just defines a collection of settings in the parser. You are then free to redefine any or all of those settings as additional keyword arguments to the csv.reader() or csv.writer() functions. Skip> In the proto-PEP I tried to address this issue: Skip> When processing a dialect setting and one or more of the Skip> other optional parameters, the dialect parameter is processed Skip> first, then the others are processed. This makes it easy to Skip> choose a dialect, then override one or more of the settings. Skip> For example, if a CSV file was generated by Excel 2000 using Skip> single quotes as the quote character and TAB as the delimiter, Skip> you could create a reader like:: Skip> csvreader = csv.reader(file("some.csv"), Skip> dialect="excel2000", quotechar="'", Skip> delimiter='\t') I think that we now in violent agreement. A good thing. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 04:25:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 21:25:49 -0600 Subject: [Csv] List up and running - mostly Message-ID: <15927.18877.657911.91142@montanaro.dyndns.org> Posting to the list seems to be working okay. Nothing seems to be archived though. I'll try and get that resolved by midday tomorrow. I'm kinda pooped though and need to knock off for the evening. I will ask David Goodger, the PEP editor, for a number for the PEP so we can check it in and share the writing. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 04:27:37 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 21:27:37 -0600 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: References: <1043790321.25139.3251.camel@software1.logiplex.internal> <20030128232803.C6A943C1F4@coffee.object-craft.com.au> <15927.13789.344190.312001@montanaro.dyndns.org> Message-ID: <15927.18985.513079.628267@montanaro.dyndns.org> Dave> Failing to open a file in binary mode is already a gotcha in Dave> Python. If someone wants to force a particular end of line in the Dave> writer then they must be prepared to open the file in binary mode. Then I guess we just document the wart. ;-) Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 04:30:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 21:30:49 -0600 Subject: [Csv] CVS checkin privileges Message-ID: <15927.19177.471122.352771@montanaro.dyndns.org> Dave, Andrew & Cliff have been added as developers to the Python project. At Kevin's request he wasn't added. G'night... Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 04:34:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 28 Jan 2003 21:34:03 -0600 Subject: [Csv] More mailing lists ;-) Message-ID: <15927.19371.648045.184281@montanaro.dyndns.org> Barry Warsaw suggested you also subscribe to python-checkins. I'm less certain you'll find that interesting, but it's the only way you'll see the checkins others make to our little sandbox. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 04:36:50 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 14:36:50 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <15927.15008.879314.896465@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <15927.15008.879314.896465@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> The point of the 'raw' dialect is to expose the full Dave> capabilities of the raw parser. Maybe we should use None rather Dave> than 'raw'. Skip> Nah, "raw" won't mean anything to anyone. Make "excel2000" the Skip> default. The point of the dialect names is that they should Skip> mean something to someone. That generally means application Skip> names, not something lile "raw". I think it also means you only Skip> have variants associated with applications which normally Skip> provide few choices. We can probably all come close to Skip> specifying what the parameter settings are for "excel2000", but Skip> what about "gnumeric"? As I write this I'm looking at a Skip> Gnumeric "Save As" wizard. The user can choose line termination Skip> (LF is the default), delimiter (comma is the default), quoting Skip> style (automatic (default), always, never), and the quote Skip> character (" is the default). Even though the wizard presents Skip> sensible defaults, I'm less enthusiastic about creating a Skip> "gnumeric" variant, precisely because it won't necessarily mean Skip> much. Before we get too excited about setting dialect names in stone we might want to start on the torture test. It seems logical (to me) that the first step in cataloguing dialects is to define the classification tool. We may find that many applications are faithful clones of 'excel' (rather than 'excel2000', 'excel97', 'excel.net'). Cliff> I think it is an option to save as a TSV file (IIRC), which is Cliff> the same as a CSV file, but with tabs. Dave> Hmm... What would be the best way to handle Excel TSV. Maybe a Dave> new dialect 'excel-tsv'? Skip> Any of: Skip> reader = csv.reader(file("some.csv"), Skip> variant="excel2000-tsv") Are you suggesting that each dialect have a collection of variants? This would mean you would have two layers of settings (is this a good thing?) The variant could just be a way of layering a set of options over the options defined by a dialect. I can see Andrew telling us to KISS. Dave> I am not saying that the wrapper should absolutely prevent Dave> someone from using options not available in the application. If Dave> you want to break the dialect then maybe it should be a two step Dave> process. Dave> csvwriter = csv.writer(file("newnastiness.csv", "w"), Dave> dialect='excel2000') Dave> csvwriter.setparams(delimiter='"') Skip> That seems cumbersome. I think we have to give our users both Skip> some credit (for brains) and some flexibility. It seems Skip> gratuitous (and unPythonic) to specify some parameters in the Skip> constructor and some in a later method. I have been convinced now that this is a bad idea. Skip> All this dialect stuff will be handled at the Python level, Skip> right? Yes. In my mind all that the extension module would be is an engine with a set of configurable items. No knowledge of dialects (or variants) would be in the C code. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 04:45:10 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 14:45:10 +1100 Subject: [CSV] Re: First Cut at CSV PEP In-Reply-To: <15927.15551.93504.635849@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> <15927.15551.93504.635849@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> I'm not so sure this mapping None to "None" on output is such a Skip> good idea because it's not reversible in all situations and Skip> hurts portability to other systems (e.g., does Excel have a Skip> concept of None? what happens if you have a text field which Skip> just happens to contain "None"?). Dave> I think that None should always be written as a zero length Dave> field, and always read as the field value 'None' Skip> I'm really skeptical of this. There is just no equivalence Skip> between None and ''. Right now using the Object Craft csv Skip> module, a blank field comes through as an empty string. I think Skip> that's the correct behavior. I think I was unnecessarily clumsy in my explanation. This is what I was trying to say: >>> w = csv.writer(sys.stdio) >>> w.write(['','hello',None]) ',hello,\n' >>> r = csv.reader(StringIO('None,hello,')) >>> for l in csv: print r ['None','hello',''] Skip> I think we need to limit the data which can be output to Skip> strings, Unicode strings (if we use an encoded stream), floats Skip> and ints. Anything else should raise TypeError. Dave> Is there any merit having the writer handling non-string data by Dave> producing an empty field for None, and the result of Dave> PyObject_Str() for all other values? Skip> I'm not sure. I'm inclined to not allow anything other than Skip> what I said above. Certainly, compound objects should raise Skip> exceptions. I think of CSV more like XML-RPC than Pyro. We're Skip> trying to exchange data with as many other languages and Skip> applications as possible, not create a new protocol for Skip> exchanging data with other Python programs. CSV is designed to Skip> represent the numeric and string values in spreadsheets and Skip> databases. Going too far beyond that seems like out-of-scope to Skip> me, especially if this is to get into 2.3. Remember, 2.3a1 is Skip> already out there! OK. The current version of the CSV module does what I was suggesting. We will just have to remove that code. Dave> That raises another implementation issue. If you export from Dave> Excel, does it always quote fields? If not then the default Dave> dialect behaviour should not unconditionally quote fields. Skip> Not in my limited experience. It quotes only where necessary Skip> (fields containing delimiters or starting with the quote Skip> character). Dave> We could/should support mandatoryquote as a writer option. Skip> This is something Laurence Tratt's original CSV module did (his Skip> ASV module probably does as well). I used it all the time. Skip> Gnumeric provides "always", "as needed" and "never". I don't Skip> know how you'd do "never" without specifying an escape Skip> character. I just tried "never" while saving CSV data from Skip> Gnumeric. It didn't escape embedded commas, so it effectively Skip> toasted the data. I have seen that happen in other applications. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Wed Jan 29 07:05:01 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 17:05:01 +1100 Subject: [Csv] CSV interface question Message-ID: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> In the proposed PEP, we have separate instances for reading and writing. In the Object Craft csv module, a single instance is shared by the parse and join methods - the only virtue of this is config is shared (so the same options are used to write the file as were used to read the file). Maybe we should consider a "container of options" class (of which the dialects would be subclasses). The sniffing code could then return an instance of this class (which wouldn't necessarily be a dialect). With this, you might do things like: options = csv.sniffer(open("foobar.csv")) for fields in csv.reader(open("foobar.csv"), options) ... do stuff csvwriter = csv.writer(open("newfoovar.csv", "w"), options) try: for fields in whatever: csvwriter.write(fields) finally: csvwriter.close() The idea being you'd then re-write the file with the same sniffed options. Another idea occurs - looping over an iteratable is going to be common - we could probably supply a convenience function, say "writelines(iteratable)"? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Wed Jan 29 11:16:58 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 29 Jan 2003 21:16:58 +1100 Subject: [Csv] CSV interface question In-Reply-To: Message from Andrew McNamara of "Wed, 29 Jan 2003 17:05:01 +1100." <20030129060501.DB9193C1F4@coffee.object-craft.com.au> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> Message-ID: <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> >Maybe we should consider a "container of options" class (of which the >dialects would be subclasses). The sniffing code could then return an >instance of this class (which wouldn't necessarily be a dialect). With >this, you might do things like: Another thought - rather than specify the dialect name as a string, it could be specified as a class or instance - something like: csv.reader(fileobj, csv.dialect.excel) Thoughts? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 11:50:30 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 21:50:30 +1100 Subject: [Csv] CSV interface question In-Reply-To: <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> Maybe we should consider a "container of options" class (of which >> the dialects would be subclasses). The sniffing code could then >> return an instance of this class (which wouldn't necessarily be a >> dialect). With this, you might do things like: Andrew> Another thought - rather than specify the dialect name as a Andrew> string, it could be specified as a class or instance - Andrew> something like: Andrew> csv.reader(fileobj, csv.dialect.excel) Andrew> Thoughts? Is there a downside to this? I can't see one immediately. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 11:53:11 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 21:53:11 +1100 Subject: [Csv] Getting some files in place Message-ID: I am currently converting the CSV module to something which at least looks like it is native Python C code. I will commit to the sandbox soon. This is a chance to bring my Python guts knowledge up to date. Probably going to take a few goes though. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Wed Jan 29 12:55:08 2003 From: djc at object-craft.com.au (Dave Cole) Date: 29 Jan 2003 22:55:08 +1100 Subject: [Csv] My version of the PEP Message-ID: I had all sorts of grand plans for the PEP during the day which involved dialects and validation of options used on dialects. I was also going to write it up tonight. In retrospect there is very little of what I was proposing which I still think is worthwhile. Andrew has sent me a small Python module which almost completely implements the current PEP - I have asked him to commit it to the sandbox. If you look at the sandbox now you will notice that I have committed a reformatted version of our csv parser. We are fairly close to having something concrete to play with. - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 14:33:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 07:33:04 -0600 Subject: [Csv] PEP checked in Message-ID: <15927.55312.962767.436646@montanaro.dyndns.org> I asked David Goodger for a number for the CSV PEP. He checked it in as PEP 305. You can edit it via cvs from the .../python/nondist/peps directory. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 15:10:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 08:10:44 -0600 Subject: [Csv] some PEP reorg Message-ID: <15927.57572.299432.893613@montanaro.dyndns.org> I reorganized the parameter descriptions and added set_dialect and get_dialect functions. The job is incomplete, but I have to get to work. Feel free to flesh things out more. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Wed Jan 29 15:20:13 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 01:20:13 +1100 Subject: [Csv] My version of the PEP In-Reply-To: Message from Dave Cole of "29 Jan 2003 22:55:08 +1100." References: Message-ID: <20030129142013.B80303C1F4@coffee.object-craft.com.au> >Andrew has sent me a small Python module which almost completely >implements the current PEP - I have asked him to commit it to the sandbox. Okay - I've commited it. It's pretty crude, and contains no docstrings yet. Time for bed. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 16:56:56 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 09:56:56 -0600 Subject: [CSV] Number of lines in CSV files In-Reply-To: <1043800695.14244.3420.camel@software1.logiplex.internal> References: <15925.58225.712028.494438@montanaro.dyndns.org> <1043795476.25146.3351.camel@software1.logiplex.internal> <1043799349.25146.3400.camel@software1.logiplex.internal> <20030129002546.493EB3C1F4@coffee.object-craft.com.au> <1043800695.14244.3420.camel@software1.logiplex.internal> Message-ID: <15927.63944.112239.481587@montanaro.dyndns.org> Cliff> Okay, +1 on punting file size. Unless anyone else cares to argue Cliff> it I suppose we'll leave it out. I don't know how you could support it if a csv reader is an iterable. You wouldn't know until you encountered a row with more than max columns or read the read which exceeded the max rows. Similarly, just because I want my CSV file to be formatted the same way Excel does things doesn't mean I am going to load the file into Excel. >> I have a sneeking suspician that Excels CSV parsing code is resonably >> stable - they're probably not game to make changes now that it mostly >> works. We might find that dialect="excel" is good enough. Cliff> Probably. This can be fixed via bug reports (and dialects added) Cliff> if that changes. "excel" it is. I believe that should be fine for Excel 97 and Excel 2000 (ISTR that Excel 2000 is just Excel 97 bundled in Office 2000). Any distinctions with older versions can be tagged, e.g., "excel95", "excel4", though I suspect they may also be the same. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 16:58:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 09:58:45 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: References: Message-ID: <15927.64053.523071.753657@montanaro.dyndns.org> Kevin> I suppose there is a good reason that each dialect isn't just a Kevin> subclass, if so, the reasoning for using dialects instead of Kevin> subclasses of a parser might be called out in the PEP. I can go Kevin> with it either way. Overkill, I think. The engine never changes. All we are doing is making it easy to set a bunch of parameters. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 17:02:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 10:02:05 -0600 Subject: [Csv] RE: Number of lines in CSV files In-Reply-To: <1043801240.25139.3429.camel@software1.logiplex.internal> References: <1043801240.25139.3429.camel@software1.logiplex.internal> Message-ID: <15927.64253.371647.216288@montanaro.dyndns.org> Cliff> I guess we need to decide what we mean by "dialect": do we mean Cliff> "this data _will_ import into this application" or do we mean Cliff> "this data will be written in a format this application can Cliff> understand, but might not necessarily be able to use"? When I proposed "variant" and later "dialect" I was only referring to the format of the file. I wasn't concerned directly with whether a specific application would be able to process it. For example, it appears that Gnumeric can import Excel-generated CSV files just fine. Accordingly, if I know I'm going to read the file into Gnumeric, I might just as well specify "excel" as the dialect for the writer. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 17:11:34 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 10:11:34 -0600 Subject: [Csv] Coding dialects In-Reply-To: <1043800126.25139.3411.camel@software1.logiplex.internal> Message-ID: <15927.64822.167086.284052@montanaro.dyndns.org> (Changing the subject to suit to topic a bit better...) Cliff> So are we leaning towards dialects being done as simple classes? Cliff> Will 'excel-tsv' simply be defined as Cliff> class excel_tsv(excel_2000): Cliff> delimiter = '\t' Cliff> with a dictionary for lookup: Cliff> settings = { 'excel-tsv': excel_tsv, Cliff> 'excel-2000': excel_2000, } Cliff> ? I was thinking of dialects as dicts. You'd have excel_dialect = { "quotechar": '"', "delimiter": ',', "linetermintor": '\r\n', ... } with a corresponding mapping as you suggested: settings = { 'excel': excel_dialect, 'excel-tsv: excel_tabs_dialect, } then in the factory functions do something like: def reader(fileobj, dialect="excel", **kwds): kwargs = copy.copy(settings[dialect]) kwargs.update(kwds) # possible sanity check on kwargs here ... return _csv.reader(fileobj, **kwargs) Perhaps we could distribute a dialects.csv file ;-) with the module which defines the supported dialects. That file would be loaded upon initial import to define the various dialect dicts. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 17:16:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 10:16:44 -0600 Subject: [Csv] Sniffing dialects In-Reply-To: <1043802693.25139.3445.camel@software1.logiplex.internal> Message-ID: <15927.65132.432457.594501@montanaro.dyndns.org> If my notion of dialects as dicts isn't too far off-base, the sniffing code could just return a dict. That would be a good way to define new dialects. Someone could send us a CSV file from a particular application. We'd turn the sniffer loose on it then append the result to our dialects.csv file. (A different version of) the sniffer could take an optional dialect string as an arg and either use it as the starting point (for stuff it can't discern, like hard returns in CSV files which don't contain any) or tell you if the input file is compatible with that dialect. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 17:21:42 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 10:21:42 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <1043788652.25139.3222.camel@software1.logiplex.internal> <15927.15008.879314.896465@montanaro.dyndns.org> Message-ID: <15927.65430.961029.406378@montanaro.dyndns.org> Skip> Any of: Skip> reader = csv.reader(file("some.csv"), Skip> variant="excel2000-tsv") Dave> Are you suggesting that each dialect have a collection of Dave> variants? Nope. "variant" was a mistake there. Should have been "dialect". Dialect names are just strings which map to either classes or dicts. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 17:27:59 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 10:27:59 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> <15927.15551.93504.635849@montanaro.dyndns.org> Message-ID: <15928.271.283784.851985@montanaro.dyndns.org> Dave> This is what I was trying to say: >>> w = csv.writer(sys.stdio) >>> w.write(['','hello',None]) ',hello,\n' >>> r = csv.reader(StringIO('None,hello,')) >>> for l in csv: print r ['None','hello',''] Skip> I think we need to limit the data which can be output to strings, Skip> Unicode strings (if we use an encoded stream), floats and ints. Skip> Anything else should raise TypeError. Dave> Is there any merit having the writer handling non-string data by Dave> producing an empty field for None, and the result of Dave> PyObject_Str() for all other values? We could do like some of the DB API modules do and provide mappings which take the types of objects and see if a function exists to handle that type. If so, whatever that function returns would be what was written. This could handle the case of None (allowing the user to specify how it was mapped), but could also be used to massage data of known type (for example, to round all floats to two decimal places). I think this sort of capability should wait until the second generation though. Skip> I just tried "never" while saving CSV data from Gnumeric. It Skip> didn't escape embedded commas, so it effectively toasted the data. Dave> I have seen that happen in other applications. Needless to say, our csv module should *not* do that. Fried data, when accompanied by angry mobs, doesn't taste too good. If the user specifies "never", I think an exception should be raised if no escape character is defined and fields containing the delimiter are encountered. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 17:54:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 10:54:55 -0600 Subject: [Csv] Re: python/nondist/peps pep-0305.txt,1.2,1.3 In-Reply-To: References: Message-ID: <15928.1887.707385.352015@montanaro.dyndns.org> >> Changed Type to Standards Track. David> I believe this PEP is Informational, not Standards Track. Yes, but it's also the working document for the csv module currently gestating in the sandbox, and which we hope to get into Python 2.3. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From LogiplexSoftware at earthlink.net Wed Jan 29 17:58:37 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 29 Jan 2003 08:58:37 -0800 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars Message-ID: <1043859517.16012.14.camel@software1.logiplex.internal> Okay, despite claims to the contrary, Pure Evil can in fact be broken down into little bits and stored in ASCII files. This spaces around quoted data bit is starting to bother me. Consider the following: 1, "not quoted","quoted" It seems reasonable to parse this as: [1, ' "not quoted"', "quoted"] which is the described Excel behavior. Now consider 1,"not quoted" ,"quoted" Is the second field quoted or not? If it is, do we discard the extraneous whitespace following it or raise an exception? Worse, consider this "quoted", "not quoted, but this ""field"" has delimiters and quotes" How should this parse? I say free exceptions for everyone. While we're on the topic, I heard back from my DSV user who had mentioned this corner case of spaces between delimiters and quotes and he admitted that the files were created by hand, by him (figures), he seems to recall some now forgotten application that may have done this but wasn't sure. His memory was vague on whether he saw it on a PC or in a barn eating hay. I propose space between delimiters and quotes raise an exception and let's be done with it. I don't think this really affects Excel compatibility since Excel will never generate this type of file and doesn't require it for import. It's true that some files that Excel would import (probably incorrectly) won't import in CSV, but I think that's outside the scope of Excel compatibility. Anyway, I know no one has said "On your mark, get set" yet, but I can't think without code sitting in front of me, breaking worse with every keystroke, so in addition to creating some test cases, I've hacked up a very preliminary CSV module so we have something to play with. I was up til 6am so if there's anything odd, I blame it on lack of sleep and the feverish optimism and glossing of detail that comes with it. Note that while the entire test.csv gets imported without exception, the last few lines aren't parsed correctly. At least, I don't think they are. I can't remember now. Also, this code is based upon what was discussed up until yesterday when I went home, so recent conversations may not be reflected. Mercilessly disect away. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 -------------- next part -------------- A non-text attachment was scrubbed... Name: CSV.py Type: text/x-python Size: 5570 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030129/d80b9ba0/attachment.py -------------- next part -------------- A non-text attachment was scrubbed... Name: test.csv Type: text/x-comma-separated-values Size: 720 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030129/d80b9ba0/attachment.bin From skip at pobox.com Wed Jan 29 18:17:53 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 11:17:53 -0600 Subject: [Csv] CSV interface question In-Reply-To: References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> Message-ID: <15928.3265.630020.528438@montanaro.dyndns.org> Andrew> csv.reader(fileobj, csv.dialect.excel) Andrew> Thoughts? Dave> Is there a downside to this? I can't see one immediately. With the dialect concept all we are talking about is a collection of parameter settings. Encapsulating that as subclasses seems like it hides the data-oriented nature behind the facade of source code. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From LogiplexSoftware at earthlink.net Wed Jan 29 18:31:02 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 29 Jan 2003 09:31:02 -0800 Subject: [Csv] CSV interface question In-Reply-To: References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> Message-ID: <1043861462.16012.46.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 02:50, Dave Cole wrote: > >>>>> "Andrew" == Andrew McNamara writes: > > >> Maybe we should consider a "container of options" class (of which > >> the dialects would be subclasses). The sniffing code could then > >> return an instance of this class (which wouldn't necessarily be a > >> dialect). With this, you might do things like: > > Andrew> Another thought - rather than specify the dialect name as a > Andrew> string, it could be specified as a class or instance - > Andrew> something like: > > Andrew> csv.reader(fileobj, csv.dialect.excel) > > Andrew> Thoughts? > > Is there a downside to this? I can't see one immediately. Actually, there is a downside to using strings, as you will see if you look at the code I posted a little while ago. By taking dialect as a string, it basically precludes the user rolling their own dialect except as keyword arguments. After working on this, I'm inclined to have the programmer pass a class or other structure. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 18:31:31 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 11:31:31 -0600 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars In-Reply-To: <1043859517.16012.14.camel@software1.logiplex.internal> References: <1043859517.16012.14.camel@software1.logiplex.internal> Message-ID: <15928.4083.834299.369381@montanaro.dyndns.org> Cliff> Now consider Cliff> 1,"not quoted" ,"quoted" Cliff> Is the second field quoted or not? If it is, do we discard the Cliff> extraneous whitespace following it or raise an exception? Well, there's always the, "be flexible in what you accept, strict in what you generate" school of thought. In the above, that would suggest the list returned would be ['1', 'not quoted', 'quoted'] It seems like a minor formatting glitch. How about a warning? Or a "strict" flag for the parser? Cliff> Worse, consider this Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes" Depends on the setting of skipinitialspaces. If false, you get ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] if True, I think you get ['quoted', 'not quoted, but this "field" has delimiters and quotes'] Cliff> How should this parse? I say free exceptions for everyone. Cliff> While we're on the topic, I heard back from my DSV user who had Cliff> mentioned this corner case of spaces between delimiters and Cliff> quotes and he admitted that the files were created by hand, by Cliff> him (figures), he seems to recall some now forgotten application Cliff> that may have done this but wasn't sure. His memory was vague on Cliff> whether he saw it on a PC or in a barn eating hay. Don't you just love customers with concrete requirements? ;-) Cliff> I propose space between delimiters and quotes raise an exception Cliff> and let's be done with it. I don't think this really affects Cliff> Excel compatibility since Excel will never generate this type of Cliff> file and doesn't require it for import. It's true that some Cliff> files that Excel would import (probably incorrectly) won't import Cliff> in CSV, but I think that's outside the scope of Excel Cliff> compatibility. Sounds good to me. Cliff> Anyway, I know no one has said "On your mark, get set" yet, but I Cliff> can't think without code sitting in front of me, breaking worse Cliff> with every keystroke, so in addition to creating some test cases, Cliff> I've hacked up a very preliminary CSV module so we have something Cliff> to play with. I was up til 6am so if there's anything odd, I Cliff> blame it on lack of sleep and the feverish optimism and glossing Cliff> of detail that comes with it. Perhaps you and Dave were in a race but didn't know it? ;-) Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 18:41:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 11:41:07 -0600 Subject: [Csv] CSV interface question In-Reply-To: <1043861462.16012.46.camel@software1.logiplex.internal> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> Message-ID: <15928.4659.449989.410123@montanaro.dyndns.org> Cliff> Actually, there is a downside to using strings, as you will see Cliff> if you look at the code I posted a little while ago. By taking Cliff> dialect as a string, it basically precludes the user rolling Cliff> their own dialect except as keyword arguments. After working on Cliff> this, I'm inclined to have the programmer pass a class or other Cliff> structure. Don't forget we have the speedy Object Craft _csv engine sitting underneath the covers. Under the assumption that all the actual processing goes on at that level, I see no particular reason dialect info needs to be anything other than a collection of keyword arguments. I view csv.reader and csv.writer as factory functions which return functional readers and writers defined in _csv.c. The Python level serves simply to paper over the low-level extension module. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 18:46:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 11:46:01 -0600 Subject: [CSV] Number of lines in CSV files Message-ID: <15928.4953.211377.214912@montanaro.dyndns.org> An oldish message which got snared by my laptop's mobility... From: Skip Montanaro To: Cliff Wells Cc: Kevin Altis , csv at object-craft.com.au Subject: Re: [CSV] Number of lines in CSV files Date: Tue, 28 Jan 2003 18:48:56 -0600 Reply-To: skip at pobox.com Cliff> Since export will be a feature of the CSV module, should we have Cliff> some sort of warning or raise an exception when exporting data Cliff> larger than the target application can handle, or should we just Cliff> punt on this? Punt. At most I would put it in a separate csvutils module such as Dave suggested. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From LogiplexSoftware at earthlink.net Wed Jan 29 19:08:24 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 29 Jan 2003 10:08:24 -0800 Subject: [Csv] CSV interface question In-Reply-To: <15928.4659.449989.410123@montanaro.dyndns.org> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> Message-ID: <1043863704.16012.64.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 09:41, Skip Montanaro wrote: > Cliff> Actually, there is a downside to using strings, as you will see > Cliff> if you look at the code I posted a little while ago. By taking > Cliff> dialect as a string, it basically precludes the user rolling > Cliff> their own dialect except as keyword arguments. After working on > Cliff> this, I'm inclined to have the programmer pass a class or other > Cliff> structure. > > Don't forget we have the speedy Object Craft _csv engine sitting underneath > the covers. Under the assumption that all the actual processing goes on at > that level, I see no particular reason dialect info needs to be anything > other than a collection of keyword arguments. You've lost me, I'm afraid. What I'm saying is that: csvreader = reader(file("test_data/sfsample.csv", 'r'), dialect='excel') isn't as flexible as csvreader = reader(file("test_data/sfsample.csv", 'r'), dialect=excel) where excel is either a pre-defined dictionary/class or a user-created dictionary/class. As an aside, I prefer using a class as it allows for validating the dialect settings from the dialect object itself (see the CSV.py I posted earlier). > I view csv.reader and > csv.writer as factory functions which return functional readers and writers > defined in _csv.c. The Python level serves simply to paper over the > low-level extension module. That's what I see also (even though the CSV.py I posted earlier doesn't exactly follow that convention). I do think we need a pure Python alternative to the C module, but both of them should be exposed via a higher-level interface. Unfortunately, I'm still not mentally linking this with my earlier point =) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Wed Jan 29 19:17:39 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 12:17:39 -0600 Subject: [Csv] CSV interface question In-Reply-To: <1043863704.16012.64.camel@software1.logiplex.internal> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> Message-ID: <15928.6851.934680.995625@montanaro.dyndns.org> Cliff> You've lost me, I'm afraid. What I'm saying is that: Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'), Cliff> dialect='excel') Cliff> isn't as flexible as Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'), Cliff> dialect=excel) Cliff> where excel is either a pre-defined dictionary/class or a Cliff> user-created dictionary/class. Yes, but my string just indexes into a mapping to get to the real dict which stores the parameter settings, as I indicated in an earlier post: I was thinking of dialects as dicts. You'd have excel_dialect = { "quotechar": '"', "delimiter": ',', "linetermintor": '\r\n', ... } with a corresponding mapping as you suggested: settings = { 'excel': excel_dialect, 'excel-tsv: excel_tabs_dialect, } then in the factory functions do something like: def reader(fileobj, dialect="excel", **kwds): kwargs = copy.copy(settings[dialect]) kwargs.update(kwds) # possible sanity check on kwargs here ... return _csv.reader(fileobj, **kwargs) Did that not make it out? I also think it's cleaner if we have a data file which is loaded at import time to define the various dialects. That way we aren't mixing too much data into our code. It also opens up the opportunity for users to later specify their own dialect data files. Where I indicated "possible sanity check" above would be a call to a validation function on the settings. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From LogiplexSoftware at earthlink.net Wed Jan 29 20:18:16 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 29 Jan 2003 11:18:16 -0800 Subject: [Csv] CSV interface question In-Reply-To: <15928.6851.934680.995625@montanaro.dyndns.org> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> Message-ID: <1043867895.16012.87.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 10:17, Skip Montanaro wrote: > Cliff> You've lost me, I'm afraid. What I'm saying is that: > > Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'), > Cliff> dialect='excel') > > Cliff> isn't as flexible as > > Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'), > Cliff> dialect=excel) > > Cliff> where excel is either a pre-defined dictionary/class or a > Cliff> user-created dictionary/class. > > Yes, but my string just indexes into a mapping to get to the real dict which > stores the parameter settings, as I indicated in an earlier post: > > I was thinking of dialects as dicts. You'd have > excel_dialect = { "quotechar": '"', > "delimiter": ',', > "linetermintor": '\r\n', > ... > } > > with a corresponding mapping as you suggested: > > settings = { 'excel': excel_dialect, > 'excel-tsv: excel_tabs_dialect, } > > then in the factory functions do something like: > > def reader(fileobj, dialect="excel", **kwds): > kwargs = copy.copy(settings[dialect]) > kwargs.update(kwds) > # possible sanity check on kwargs here ... > return _csv.reader(fileobj, **kwargs) I understand this, but I think you miss my point (or I missed you with it ;) Consider now the programmer actually defining a new dialect: Passing a class or other structure (a dict is fine), they can create this on the fly with minimal work. Using a *string*, they must first "register" that string somewhere (probably in the mapping we agree upon) before they can actually make the function call. Granted, it's only a an extra step, but it requires a bit more knowledge (of the mapping) and doesn't seem to provide a real benefit. If you prefer a mapping to a class, that is fine, but lets pass the mapping rather than a string referring to it: excel_dialect = { "quotechar": '"', "delimiter": ',', "linetermintor": '\r\n', ... } settings = { 'excel': excel, 'excel-tsv: excel_tabs, } def reader(fileobj, dialect=excel, **kwds): kwargs = copy.copy(dialect) kwargs.update(kwds) # possible sanity check on kwargs here ... return _csv.reader(fileobj, **kwargs) This allows the user to do such things as: mydialect = { ... } reader(fileobj, mydialect, ...) rather than mydialect = { ... } settings['mydialect'] = mydialect reader(fileobj, 'mydialect', ...) To use the settings table for getting a default, they can still use reader(fileobj, settings['excel-tsv'], ...) or just use the excel settings directly: reader(fileobj, excel_tsv, ...) (BTW, I prefer 'dialects' to 'settings' for the mapping name, just for consistency). I'll grant that the difference is small, but it still requires one extra line and one extra piece of knowledge with no real benefit to the programmer, AFAICT. If you don't agree I'll let it pass as it *is* a relatively minor difference. > Did that not make it out? I also think it's cleaner if we have a data file > which is loaded at import time to define the various dialects. That way we > aren't mixing too much data into our code. It also opens up the opportunity > for users to later specify their own dialect data files. Where I indicated > "possible sanity check" above would be a call to a validation function on > the settings. +1 on this, but only if you cave on the other one -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Thu Jan 30 00:15:23 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 10:15:23 +1100 Subject: [Csv] CSV interface question In-Reply-To: <15928.6851.934680.995625@montanaro.dyndns.org> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> Message-ID: Cliff> You've lost me, I'm afraid. What I'm saying is that: Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'), Cliff> dialect='excel') Cliff> isn't as flexible as Cliff> csvreader = reader(file("test_data/sfsample.csv", 'r'), Cliff> dialect=excel) Cliff> where excel is either a pre-defined dictionary/class or a Cliff> user-created dictionary/class. Skip> Yes, but my string just indexes into a mapping to get to the Skip> real dict which stores the parameter settings, as I indicated in Skip> an earlier post: Skip> Skip> I was thinking of dialects as dicts. You'd have Skip> Skip> excel_dialect = { "quotechar": '"', Skip> "delimiter": ',', Skip> "linetermintor": '\r\n', Skip> ... Skip> } Note the spelling error in "linetermintor" - user constructed dictionaries are not good. Whenever I find myself using dictionaries for storing values as opposed to indexing data I can't escape the feeling that my past as a Perl programmer is coming back to haunt me. At least with Perl there is some syntactic sugar to make this type of thing less ugly: excel_dialect = { quotechar => '"', delimiter => ',', linetermintor => '\r\n' } In the absence of that sugar I would prefer something like the following: class excel: quotechar = '"' delimiter = ',' linetermintor = '\r\n' settings = {} for dialect in (excel, exceltsv): settings[dialect.__name__] = dialect Maybe we could include a name attribute which allowed us to use 'excel-tsv' as a dialect identifier. Skip> with a corresponding mapping as you suggested: Skip> Skip> settings = { 'excel': excel_dialect, Skip> 'excel-tsv: excel_tabs_dialect, } Skip> Skip> then in the factory functions do something like: Skip> Skip> def reader(fileobj, dialect="excel", **kwds): Skip> kwargs = copy.copy(settings[dialect]) Skip> kwargs.update(kwds) Skip> # possible sanity check on kwargs here ... Skip> return _csv.reader(fileobj, **kwargs) With the class technique this would become: def reader(fileobj, dialect=excel, **kwds): kwargs = {} for key, value in dialect.__dict__.iteritems(): if not key.startswith('_'): kwargs[key] = value kwargs.update(kwds) return _csv.reader(fileobj, **kwargs) Skip> Did that not make it out? I also think it's cleaner if we have Skip> a data file which is loaded at import time to define the various Skip> dialects. That way we aren't mixing too much data into our Skip> code. It also opens up the opportunity for users to later Skip> specify their own dialect data files. Where I indicated Skip> "possible sanity check" above would be a call to a validation Skip> function on the settings. Hmmm... Hard and messy to define classes on the fly. Then we are back to some kind of dialect object. class dialect: def __init__(self, quotechar='"', delimiter=',', lineterminator='\r\n'): self.quotechar = quotechar self.delimiter = delimiter self.lineterminator = lineterminator settings = { 'excel': dialect(), 'excel-tsv': dialect(delimiter='\t') } def add_dialect(name, dialect): settings[name] = dialect def reader(fileobj, args='excel', **kwds): kwargs = {} if not isinstance(args, dialect): dialect = settings[args] kwargs.update(name.__dict__) kwargs.update(kwds) return _csv.reader(fileobj, **kwargs) This would then allow you to extend the settings dictionary on the fly, or simply pass your own dialect object. >>> import csv >>> my_dialect = csv.dialect(lineterminator = '\f') >>> rdr = csv.reader(file('blah.csv'), my_dialect) - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Thu Jan 30 00:16:57 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 10:16:57 +1100 Subject: [Csv] CSV interface question In-Reply-To: <1043867895.16012.87.camel@software1.logiplex.internal> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> <1043867895.16012.87.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: >> Did that not make it out? I also think it's cleaner if we have a >> data file which is loaded at import time to define the various >> dialects. That way we aren't mixing too much data into our code. >> It also opens up the opportunity for users to later specify their >> own dialect data files. Where I indicated "possible sanity check" >> above would be a call to a validation function on the settings. Cliff> +1 on this, but only if you cave on the other one LOL. Have you considered a career as a politician? - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Thu Jan 30 00:19:32 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 10:19:32 +1100 Subject: [Csv] Sniffing dialects In-Reply-To: <15927.65132.432457.594501@montanaro.dyndns.org> References: <15927.65132.432457.594501@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> If my notion of dialects as dicts isn't too far off-base, the Skip> sniffing code could just return a dict. That would be a good Skip> way to define new dialects. Someone could send us a CSV file Skip> from a particular application. We'd turn the sniffer loose on Skip> it then append the result to our dialects.csv file. I am all for dialects as attribute only objects. You get the same effect as a dict but with less Perlish syntax. Skip> (A different version of) the sniffer could take an optional Skip> dialect string as an arg and either use it as the starting point Skip> (for stuff it can't discern, like hard returns in CSV files Skip> which don't contain any) or tell you if the input file is Skip> compatible with that dialect. +1 - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Thu Jan 30 00:25:23 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 10:25:23 +1100 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <15928.271.283784.851985@montanaro.dyndns.org> References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> <15927.15551.93504.635849@montanaro.dyndns.org> <15928.271.283784.851985@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> This is what I was trying to say: >>>> w = csv.writer(sys.stdio) w.write(['','hello',None]) Skip> ',hello,\n' >>>> r = csv.reader(StringIO('None,hello,')) for l in csv: print r Skip> ['None','hello',''] Skip> I think we need to limit the data which can be output to Skip> strings, Unicode strings (if we use an encoded stream), floats Skip> and ints. Anything else should raise TypeError. Dave> Is there any merit having the writer handling non-string data by Dave> producing an empty field for None, and the result of Dave> PyObject_Str() for all other values? Skip> We could do like some of the DB API modules do and provide Skip> mappings which take the types of objects and see if a function Skip> exists to handle that type. If so, whatever that function Skip> returns would be what was written. This could handle the case Skip> of None (allowing the user to specify how it was mapped), but Skip> could also be used to massage data of known type (for example, Skip> to round all floats to two decimal places). Skip> I think this sort of capability should wait until the second Skip> generation though. I think this would make things too slow. The Python core already has a convenience function for doing the necessary conversion; PyObject_Str(). If we are in a hurry we could document the existing low level writer behaviour which is to invoke PyObject_Str() for all non-string values except None. None is translated to ''. Skip> I just tried "never" while saving CSV data from Gnumeric. It Skip> didn't escape embedded commas, so it effectively toasted the Skip> data. Dave> I have seen that happen in other applications. Skip> Needless to say, our csv module should *not* do that. Fried Skip> data, when accompanied by angry mobs, doesn't taste too good. Skip> If the user specifies "never", I think an exception should be Skip> raised if no escape character is defined and fields containing Skip> the delimiter are encountered. Should the _csv parser should sanity check the combination of options in the constructor, or when told to write data which is broken? It is possible to define no quote or escape character but still write valid data. 1,2,3,4 - Dave -- http://www.object-craft.com.au _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Thu Jan 30 00:43:45 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 10:43:45 +1100 Subject: [Csv] CSV interface question In-Reply-To: Message from Skip Montanaro <15928.3265.630020.528438@montanaro.dyndns.org> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <15928.3265.630020.528438@montanaro.dyndns.org> Message-ID: <20030129234345.3CE6D3C32B@coffee.object-craft.com.au> > Andrew> csv.reader(fileobj, csv.dialect.excel) > > Andrew> Thoughts? > > Dave> Is there a downside to this? I can't see one immediately. > >With the dialect concept all we are talking about is a collection of >parameter settings. Encapsulating that as subclasses seems like it hides >the data-oriented nature behind the facade of source code. It has the virtue that sub-classing can be used to represent related variants. So, excel-tab might be: class excel-tab(excel): delimiter = '\t' This could also be useful for users of the module: class funky(excel): quotes = "'" Essentially we'd be using classes as glorified dictionaries with cascading values. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Thu Jan 30 00:45:26 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 10:45:26 +1100 Subject: [Csv] CSV interface question In-Reply-To: Message from Dave Cole References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> Message-ID: <20030129234526.2B1943C32B@coffee.object-craft.com.au> >With the class technique this would become: > >def reader(fileobj, dialect=excel, **kwds): > kwargs = {} > for key, value in dialect.__dict__.iteritems(): > if not key.startswith('_'): > kwargs[key] = value > kwargs.update(kwds) > return _csv.reader(fileobj, **kwargs) BTW, your method of extracting directly from the instance's __dict__ doesn't pick up class attributes. In my prototype implementation, I used getattr instead. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From djc at object-craft.com.au Thu Jan 30 00:49:44 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 10:49:44 +1100 Subject: [Csv] CSV interface question In-Reply-To: <20030129234345.3CE6D3C32B@coffee.object-craft.com.au> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <15928.3265.630020.528438@montanaro.dyndns.org> <20030129234345.3CE6D3C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: Andrew> csv.reader(fileobj, csv.dialect.excel) >> Andrew> Thoughts? >> Dave> Is there a downside to this? I can't see one immediately. >> With the dialect concept all we are talking about is a collection >> of parameter settings. Encapsulating that as subclasses seems like >> it hides the data-oriented nature behind the facade of source code. Andrew> It has the virtue that sub-classing can be used to represent Andrew> related variants. So, excel-tab might be: Andrew> class excel-tab(excel): Andrew> delimiter = '\t' Not sure the python interpreter will like that class name :-) Andrew> This could also be useful for users of the module: Andrew> class funky(excel): Andrew> quotes = "'" Andrew> Essentially we'd be using classes as glorified dictionaries Andrew> with cascading values. I would prefer attribute only flat objects. The alternative would have us traversing inheritance trees to extract class dictionaries. class dialect: def __init__(self, delimiter=',', ...): self.delimiter = delimiter : >>> funky = csv.copy_dialect('excel') >>> funky.quotes = "'" Not as nice as subclassing, but probably good enough. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 00:57:57 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 10:57:57 +1100 Subject: [Csv] CSV interface question In-Reply-To: <20030129234526.2B1943C32B@coffee.object-craft.com.au> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> <20030129234526.2B1943C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> With the class technique this would become: >> >> def reader(fileobj, dialect=excel, **kwds): >> kwargs = {} >> for key, value in dialect.__dict__.iteritems(): >> if not key.startswith('_'): >> kwargs[key] = value >> kwargs.update(kwds) >> return _csv.reader(fileobj, **kwargs) Andrew> BTW, your method of extracting directly from the instance's Andrew> __dict__ doesn't pick up class attributes. In my prototype Andrew> implementation, I used getattr instead. Ahhh... So does this mean that we can go back to classes? class dialect: quotechar = '"' delimiter = ',' lineterminator = '\r\n' dialect_opts = [attr for attr in dir(dialect) if not attr.startswith('_')] excel = dialect class excel_tsv(excel): delimiter = '\t' def reader(fileobj, dialectobj=excel, **kwds): kwargs = {} for opt in dialect_opts: kwargs[opt] = getattr(dialectobj, opt) kwargs.update(kwds) return _csv.reader(fileobj, **kwargs) -- http://www.object-craft.com.au From skip at pobox.com Thu Jan 30 02:47:20 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 19:47:20 -0600 Subject: [Csv] CSV interface question In-Reply-To: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> Message-ID: <15928.33832.843555.867347@montanaro.dyndns.org> Andrew> In the proposed PEP, we have separate instances for reading and Andrew> writing. In the Object Craft csv module, a single instance is Andrew> shared by the parse and join methods - the only virtue of this Andrew> is config is shared (so the same options are used to write the Andrew> file as were used to read the file). ... Andrew> The idea being you'd then re-write the file with the same Andrew> sniffed options. In my work, I rarely read and write the same file. I either read a file, then shoot it to a database or go the other way. In situations where the input and output are both CSV files, at least one is stdout, and there is almost always something different about the reading and writing parameters. Andrew> Another idea occurs - looping over an iteratable is going to be Andrew> common - we could probably supply a convenience function, say Andrew> "writelines(iteratable)"? Seems reasonable. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 02:57:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 19:57:30 -0600 Subject: [Csv] CSV interface question In-Reply-To: <1043867895.16012.87.camel@software1.logiplex.internal> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> <1043867895.16012.87.camel@software1.logiplex.internal> Message-ID: <15928.34442.337899.905054@montanaro.dyndns.org> Cliff> Consider now the programmer actually defining a new dialect: Cliff> Passing a class or other structure (a dict is fine), they can Cliff> create this on the fly with minimal work. Using a *string*, they Cliff> must first "register" that string somewhere (probably in the Cliff> mapping we agree upon) before they can actually make the function Cliff> call. Granted, it's only a an extra step, but it requires a bit Cliff> more knowledge (of the mapping) and doesn't seem to provide a Cliff> real benefit. If you prefer a mapping to a class, that is fine, Cliff> but lets pass the mapping rather than a string referring to it: Somewhere I think we still need to associate string names with these beasts. Maybe it's just another attribute: class dialect: name = None class excel(dialect): name = "excel" ... They should all be collected together for operation as a group. This could be so a GUI knows all the names to present or so a sniffer can return all the dialects with which a sample file is compatible. Both operations suggest the need to register dialects somehow. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 03:07:37 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 20:07:37 -0600 Subject: [Csv] CSV interface question In-Reply-To: References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> Message-ID: <15928.35049.893899.848768@montanaro.dyndns.org> Skip> I was thinking of dialects as dicts.... Dave> Note the spelling error in "linetermintor" - user constructed Dave> dictionaries are not good. Yeah, but Cliff's dialect validator would have caught that. ;-) Dave> Whenever I find myself using dictionaries for storing values as Dave> opposed to indexing data I can't escape the feeling that my past Dave> as a Perl programmer is coming back to haunt me. At least with Dave> Perl there is some syntactic sugar to make this type of thing less Dave> ugly: Dave> excel_dialect = { quotechar => '"', Dave> delimiter => ',', Dave> linetermintor => '\r\n' } Other than losing a couple quote marks and substituting => for : I don't see how the Perl syntax is any better. Note also that with dicts you can simply pass them as keyword args: return _csv.reader(..., **kwdargs) You'll have to do a little more work with classes to make that work (a subclass's __dict__ attribute does not include the parent class's __dict__ contents) and with the possibility of new-style classes you will have to work even harder. Dave> Maybe we could include a name attribute which allowed us to use Dave> 'excel-tsv' as a dialect identifier. As I mentioned in my last post, I think name attributes will be necessary, at least for human consumption. Dave> def reader(fileobj, dialect=excel, **kwds): Dave> kwargs = {} Dave> for key, value in dialect.__dict__.iteritems(): Dave> if not key.startswith('_'): Dave> kwargs[key] = value Dave> kwargs.update(kwds) Dave> return _csv.reader(fileobj, **kwargs) Not quite. You need to traverse the bases to pick up everything. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 03:10:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 20:10:58 -0600 Subject: [Csv] Sniffing dialects In-Reply-To: References: <15927.65132.432457.594501@montanaro.dyndns.org> Message-ID: <15928.35250.269725.510622@montanaro.dyndns.org> Dave> I am all for dialects as attribute only objects. You get the same Dave> effect as a dict but with less Perlish syntax. I'll cave on this one, but I still think dicts are the better solution, especially if dialects might be read from data files. There's also the issue of mapping dialects as classes onto keyword argument dicts. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Thu Jan 30 03:21:13 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 13:21:13 +1100 Subject: [Csv] Sniffing dialects In-Reply-To: Message from Skip Montanaro <15928.35250.269725.510622@montanaro.dyndns.org> References: <15927.65132.432457.594501@montanaro.dyndns.org> <15928.35250.269725.510622@montanaro.dyndns.org> Message-ID: <20030130022113.7E9E93C32B@coffee.object-craft.com.au> > Dave> I am all for dialects as attribute only objects. You get the same > Dave> effect as a dict but with less Perlish syntax. > >I'll cave on this one, but I still think dicts are the better solution, >especially if dialects might be read from data files. There's also the >issue of mapping dialects as classes onto keyword argument dicts. Have you had a look at the code I checked in as csv.py in the sandbox? Aside from the inheritance, I prefer dicts. But the inheritance feels like a valuable addition. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 03:21:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 20:21:51 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: References: <15926.1287.36487.12649@montanaro.dyndns.org> <15926.64576.481489.373053@montanaro.dyndns.org> <15927.15551.93504.635849@montanaro.dyndns.org> <15928.271.283784.851985@montanaro.dyndns.org> Message-ID: <15928.35903.293437.273039@montanaro.dyndns.org> Skip> We could do like some of the DB API modules do and provide Skip> mappings which take the types of objects and see if a function Skip> exists to handle that type. Dave> I think this would make things too slow. Not in the typical case. The typical case would be the null converter case. Dave> The Python core already has a convenience function for doing the Dave> necessary conversion; PyObject_Str(). This smacks of implicit type conversions to me, which has been the bane of my interaction with Perl (via XML-RPC). I still think we have no business writing anything but strings, Unicode strings (encoded by codecs.open()), ints and floats to CSV files. Exceptions should be raised for anything else, even None. An empty field is "". Dave> If we are in a hurry we could document the existing low level Dave> writer behaviour which is to invoke PyObject_Str() for all Dave> non-string values except None. None is translated to ''. I really still dislike this whole None thing. Whose use case is that anyway? Skip> Needless to say, our csv module should *not* do that. Fried data, Skip> when accompanied by angry mobs, doesn't taste too good. If the Skip> user specifies "never", I think an exception should be raised if Skip> no escape character is defined and fields containing the delimiter Skip> are encountered. Dave> Should the _csv parser should sanity check the combination of Dave> options in the constructor, or when told to write data which is Dave> broken? I think only when a row is written which would create an ambiguous row. Upon reading you have no real choice. If there's an unescaped embedded delimiter in an unquoted field, how is the reader object to know the user doesn't want multiple fields? Dave> It is possible to define no quote or escape character but still Dave> write valid data. Dave> 1,2,3,4 Yup, and it should work okay, only barfing when there is an actual ambiguity. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 03:23:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 20:23:10 -0600 Subject: [Csv] CSV interface question In-Reply-To: References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <15928.3265.630020.528438@montanaro.dyndns.org> <20030129234345.3CE6D3C32B@coffee.object-craft.com.au> Message-ID: <15928.35982.667260.27999@montanaro.dyndns.org> Dave> I would prefer attribute only flat objects. Sounds like a dictionary to me. ;-) Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 03:29:00 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 20:29:00 -0600 Subject: [Csv] Sniffing dialects In-Reply-To: <20030130022113.7E9E93C32B@coffee.object-craft.com.au> References: <15927.65132.432457.594501@montanaro.dyndns.org> <15928.35250.269725.510622@montanaro.dyndns.org> <20030130022113.7E9E93C32B@coffee.object-craft.com.au> Message-ID: <15928.36332.883241.640958@montanaro.dyndns.org> >> I'll cave on this one, but I still think dicts are the better >> solution, especially if dialects might be read from data files. >> There's also the issue of mapping dialects as classes onto keyword >> argument dicts. Andrew> Have you had a look at the code I checked in as csv.py in the Andrew> sandbox? Not since midday, and I wasn't looking closely. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 03:48:59 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 20:48:59 -0600 Subject: [Csv] Status Message-ID: <15928.37531.445243.692589@montanaro.dyndns.org> It would appear we are converging on dialects as data-only classes (subclassable but with no methods). I'll update the PEP. Many other ideas have been floating through the list, and while I haven't been deleting the messages, I haven't been adding them to the PEP either. Can someone help with that? I'd like to get the wording in the PEP to converge on our current thoughts and announce it on c.l.py and python-dev sometime tomorrow. I think we will get a lot of feedback from both camps, hopefully some of it useful. ;-) Sound like a plan? I just finished making a pass through the messages I hadn't deleted (and then saved them to a csv mbox file since the list appears to still not be archiving). Here's what I think we've concluded: * Dialects are a set of defaults, probably implemented as classes (which allows subclassing, whereas dicts wouldn') and the default dialect named as something like csv.dialects.excel or "excel" if we allow string specifiers. (I think strings work well at the API, simply because they are shorter and can more easily be presented in GUI tools.) * A csvutils module should be at least scoped out which might do a fair number of things: - Implements one or more sniffers for parameter types - Validates CSV files (e.g., constant number of columns, type constraints on column values, compares against given dialect) - Generate a sniffer from a CSV file * These individual parameters are necessary (hopefully the names will be enough clue as to there meaning): quote_char, quoting ("auto", "always", "nonnumeric", "never"), delimiter, line_terminator, skip_whitespace, escape_char, hard_return. Are there others? * We're still undecided about None (I certainly don't think it's a valid value to be writing to CSV files) * Rows can have variable numbers of columns and the application is responsible for deciding on and enforcing max_rows or max_cols. * Don't raise exceptions needlessly. For example, specifying quoting="never" and not specifying a value for escape_char would be okay until you encounter a field when writing which contains the delimiter. * Files have to be opened in binary mode (we can check the mode attribute I believe) so we can do the right thing with line terminators. * Data values should always be returned as strings, even if they are valid numbers. Let the application do data conversion. Other stuff we haven't talked about much: * Unicode. I think we punt on this for now and just pretend that passing codecs.open(csvfile, mode, encoding) is sufficient. I'm sure Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low level parser (C code) is probably going to need to handle unicode." Let's wait and see how well codecs.open() works for us. * We know we need tests but haven't talked much about them. I vote for PyUnit as much as possible, though a certain amount of manual testing using existing spreadsheets and databases will be required. * Exceptions. We know we need some. We should start with CSVError and try to avoid getting carried away with things. If need be, we can add a code field to the class. I don't like the idea of having 17 different subclasses of CSVError though. It's too much complexity for most users. Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From altis at semi-retired.com Thu Jan 30 04:01:19 2003 From: altis at semi-retired.com (Kevin Altis) Date: Wed, 29 Jan 2003 19:01:19 -0800 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <15928.35903.293437.273039@montanaro.dyndns.org> Message-ID: > From: Skip Montanaro > > Dave> The Python core already has a convenience function for doing the > Dave> necessary conversion; PyObject_Str(). > > This smacks of implicit type conversions to me, which has been the bane of > my interaction with Perl (via XML-RPC). I still think we have no business > writing anything but strings, Unicode strings (encoded by codecs.open()), > ints and floats to CSV files. Exceptions should be raised for anything > else, even None. An empty field is "". > > Dave> If we are in a hurry we could document the existing low level > Dave> writer behaviour which is to invoke PyObject_Str() for all > Dave> non-string values except None. None is translated to ''. > > I really still dislike this whole None thing. Whose use case is that > anyway? I think I brought up None. There was some initial confusion because Cliff's DSV exporter was doing the wrong thing. My feeling is that if you have a list [5, 'Bob', None, 1.1] as a csv with the Excel dialect that becomes 5,Bob,,1.1 Are you saying that you want to throw an exception instead? Booleans may also present a problem. I was mostly thinking in terms of importing and exporting data from embedded databases like MetaKit, my own list of dictionaries (flatfile stuff), PySQLite, Gadfly. Anyway, the implication might be that it is necessary for the user to sanitize data as part of the export operation too. Have to ponder that. Regardless, we have to be careful to not make this too complicated or it will be worse than nothing. Quotes aren't going to get used in the case above unless you've specified to always use them (overridden part of the Excel dialect), because no field contains the comma separator character. Now that I look at this again the Access export dialog I sent in an earlier email shows that the default Access csv is actually a separate dialect because they specifically call out the "Text qualifier" while numbers, empty fields (probably NULLS in SQL?) will not have quotes, only text fields will. To further complicate things I'm now wondering what happens with numbers in a Europe ore elsewhere where the comma is used instead of a decimal point so 1.1 is 1,1 or does that not actually occur and I'm remembering some localization issues incorrectly? Reading in 5,Bob,,1.1 becomes ['5', 'Bob', '', '1.1'] because we said we weren't going to do further processing, the user code should do further conversions as part of the iteration. I'm way behind on reading all the emails. I got bogged down in a bunch of Mac OS X testing... I'll try and dig through them a little tomorrow and Friday. If we put together the unittest test cases first then our input, output, and expected results for processing would be clear for a given dialect. ka _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Thu Jan 30 04:12:54 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 14:12:54 +1100 Subject: [Csv] Status In-Reply-To: Message from Skip Montanaro <15928.37531.445243.692589@montanaro.dyndns.org> References: <15928.37531.445243.692589@montanaro.dyndns.org> Message-ID: <20030130031254.D2E853C32B@coffee.object-craft.com.au> >I'd like to get the wording in the PEP to converge on our current thoughts >and announce it on c.l.py and python-dev sometime tomorrow. I think we will >get a lot of feedback from both camps, hopefully some of it useful. ;-) > >Sound like a plan? Yep, pending an ACK from the others. >I just finished making a pass through the messages I hadn't deleted (and >then saved them to a csv mbox file since the list appears to still not be >archiving). Here's what I think we've concluded: I have all the messages archived, which I can forward to you in a convenient form for feeding to mailman. > * Dialects are a set of defaults, probably implemented as classes (which > allows subclassing, whereas dicts wouldn') and the default dialect > named as something like csv.dialects.excel or "excel" if we allow > string specifiers. (I think strings work well at the API, simply > because they are shorter and can more easily be presented in GUI > tools.) I think you are right - we need strings as well, and a way to list them. But exposing the "dialects are classes" to the user of the module is valuable. I'd vote +1 on giving the class a "name" attribute, and the dialects should probably share a common null root class (say "dialect") - the "list_dialects()" function could then walk the csv.dialects namespace returning the names of any classses found that are subclasses of dialect? > * These individual parameters are necessary (hopefully the names will be > enough clue as to there meaning): quote_char, quoting ("auto", > "always", "nonnumeric", "never"), delimiter, line_terminator, > skip_whitespace, escape_char, hard_return. Are there others? Not that I can think of at the moment. As other dialects appear, we may want to add new paramaters anyway. > * We're still undecided about None (I certainly don't think it's a valid > value to be writing to CSV files) I suspect we're in violent agreement? If the user happens to pass None, it should be written as a null field. On input, a null field should be returned as a zero length string. Is that what you were suggesting? > * Don't raise exceptions needlessly. For example, specifying > quoting="never" and not specifying a value for escape_char would be > okay until you encounter a field when writing which contains the > delimiter. I don't like this specific one. Because it depends on the data, the module user may not pick up their error during testing. Better to raise an exception immediately if we know the format is invalid. This is an argument I have over and over - I believe it's nearly always better to push errors back towards their source. In spite of how it sounds, this isn't really at odds with "be liberal in what you accept, be strict in what you generate". > * Files have to be opened in binary mode (we can check the mode > attribute I believe) so we can do the right thing with line > terminators. We need to be a little careful when using uncommon interfaces on the file class, because file-like classes may not have implemented them (for example, StringIO doesn't have the mode attribute). > * Data values should always be returned as strings, even if they are > valid numbers. Let the application do data conversion. Yes. +1 >Other stuff we haven't talked about much: > > * Unicode. I think we punt on this for now and just pretend that > passing codecs.open(csvfile, mode, encoding) is sufficient. I'm sure > Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low > level parser (C code) is probably going to need to handle unicode." > Let's wait and see how well codecs.open() works for us. I'm almost 100% certain the C code will need work. But it should the sort of work that can be done without disturbing the interface too much? > * We know we need tests but haven't talked much about them. I vote for > PyUnit as much as possible, though a certain amount of manual testing > using existing spreadsheets and databases will be required. This is the big one - tests are absolutely essential. I put a bit of effort into coming up with a bunch of "this is how Excel does it with this unusual case" tests for our csv module - we can use this as a start. I haven't investigated how the official python test harness works - it predates pyunit. > * Exceptions. We know we need some. We should start with CSVError and > try to avoid getting carried away with things. If need be, we can add > a code field to the class. I don't like the idea of having 17 > different subclasses of CSVError though. It's too much complexity for > most users. Agreed. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 04:35:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 21:35:49 -0600 Subject: [Csv] Status In-Reply-To: <20030130031254.D2E853C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> Message-ID: <15928.40341.991680.82247@montanaro.dyndns.org> >> * We're still undecided about None (I certainly don't think it's a >> valid value to be writing to CSV files) Andrew> I suspect we're in violent agreement? If the user happens to Andrew> pass None, it should be written as a null field. On input, a Andrew> null field should be returned as a zero length string. Is that Andrew> what you were suggesting? Not really. In my mind, if I try to write [5.0, "marty", "golden slippers", None] then I have a bug somewhere. I *don't* want None silently converted to ''. >> * Don't raise exceptions needlessly. For example, specifying >> quoting="never" and not specifying a value for escape_char would be >> okay until you encounter a field when writing which contains the >> delimiter. Andrew> I don't like this specific one. Because it depends on the data, Andrew> the module user may not pick up their error during Andrew> testing. Better to raise an exception immediately if we know the Andrew> format is invalid. I can live with that. I would propose then that escape_char default to something reasonable, not None. Andrew> This is an argument I have over and over - I believe it's nearly Andrew> always better to push errors back towards their source. In spite Andrew> of how it sounds, this isn't really at odds with "be liberal in Andrew> what you accept, be strict in what you generate". If I cave on this, they you have to cave on None. ;-) >> * Files have to be opened in binary mode (we can check the mode >> attribute I believe) so we can do the right thing with line >> terminators. Andrew> We need to be a little careful when using uncommon interfaces on Andrew> the file class, because file-like classes may not have Andrew> implemented them (for example, StringIO doesn't have the mode Andrew> attribute). Correct. That occurred to me as well. Do we just punt if hasattr(fileobj, "mode") returns False? >> * Unicode. I think we punt on this for now and just pretend that >> passing codecs.open(csvfile, mode, encoding) is sufficient. I'm sure >> Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low >> level parser (C code) is probably going to need to handle unicode." >> Let's wait and see how well codecs.open() works for us. Andrew> I'm almost 100% certain the C code will need work. But it should Andrew> the sort of work that can be done without disturbing the Andrew> interface too much? "Handle Unicode" probably doesn't mean messing with encoding/decoding issues though. Let the user deal with them. Andrew> I haven't investigated how the official python test harness Andrew> works - it predates pyunit. Most new tests are written using unittest (nee PyUnit) and many existing tests are getting converted. If we use the core test framework for as much as we can, our unit tests will just move cleanly from the sandbox to Lib/test/. Now to see about Mailman 2.1... Skip From skip at pobox.com Thu Jan 30 04:49:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 21:49:44 -0600 Subject: [Csv] Mailman upgrade Thursday on manatee.mojam.com Message-ID: <15928.41176.310569.780900@montanaro.dyndns.org> To all people subscribed to mailing lists hosted on manatee.mojam.com: I plan to upgrade the Mailman software on manatee.mojam.com (aka mail.mojam.com) sometime Thursday. I don't know the exact time because it will be a sort of as-I-have-time sort of thing. To perform the upgrade I will have to shut down mail service on the system for a time. I hope to keep that time to a minimum, but it will depend on what problems I encounter. During that time mail should queue up on remote hosts. Don't be alarmed if mail messages from your favorite mailing list stops arriving for awhile. I'll send out another message once the upgrade is complete or I've utterly failed and fallen back to the older version. -- Skip Montanaro skip at pobox.com http://www.musi-cal.com/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From andrewm at object-craft.com.au Thu Jan 30 04:51:41 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 14:51:41 +1100 Subject: [Csv] Status In-Reply-To: Message from Skip Montanaro <15928.40341.991680.82247@montanaro.dyndns.org> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> Message-ID: <20030130035141.271EA3C32B@coffee.object-craft.com.au> >Not really. In my mind, if I try to write > > [5.0, "marty", "golden slippers", None] > >then I have a bug somewhere. I *don't* want None silently converted to ''. I think you might be right. [invalid combinations of options] >I can live with that. I would propose then that escape_char default to >something reasonable, not None. That's a little hairy, because the resulting file can't be parsed correctly by Excel. But it should be safe if the escape_char is only emitted if quote is set to none. >If I cave on this, they you have to cave on None. ;-) *-) [binary file mode, StringIO has no mode attribute] >Correct. That occurred to me as well. Do we just punt if hasattr(file, >obj, "mode") returns False? Yes (or just catch the AttributeError and ignore it). >"Handle Unicode" probably doesn't mean messing with encoding/decoding >issues though. Let the user deal with them. But the C code will care if it's passed a unicode string (which, I understand, are not 8 bits per character - typically 16 bits). And the escape_char, etc, will be 16 bits. I understand that some of the other C modules are compiled twice and #define tricks are used to produce two versions that perform optimally on their respective string type. >Now to see about Mailman 2.1... Did you try my suggestion? I have a vague memory of there being an earlier version of Mailman that forgot to create that file. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 05:09:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 22:09:06 -0600 Subject: [Csv] Status In-Reply-To: <20030130035141.271EA3C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> Message-ID: <15928.42338.298849.316715@montanaro.dyndns.org> >> "Handle Unicode" probably doesn't mean messing with encoding/decoding >> issues though. Let the user deal with them. Andrew> But the C code will care if it's passed a unicode string (which, Andrew> I understand, are not 8 bits per character - typically 16 Andrew> bits). And the escape_char, etc, will be 16 bits. I understand Andrew> that some of the other C modules are compiled twice and #define Andrew> tricks are used to produce two versions that perform optimally Andrew> on their respective string type. In the C code can't you just look up "split", "join", "__add__" and such and not care that you are dealing with string or unicode objects? Even better, can't you just make heavy use of the abstract interface which implements many of the things that are trivial in Python code? >> Now to see about Mailman 2.1... Andrew> Did you try my suggestion? I have a vague memory of there being Andrew> an earlier version of Mailman that forgot to create that file. Yup. Now there's an empty csv.mbox file available on the web... Skip _______________________________________________ Csv mailing list Csv at mail.mojam.com http://manatee.mojam.com/mailman/listinfo/csv From skip at pobox.com Thu Jan 30 06:53:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 23:53:55 -0600 Subject: [Csv] test message Message-ID: <15928.48627.343496.857274@montanaro.dyndns.org> Test of reconstituted csv list under Mailman 2.1 S From andrewm at object-craft.com.au Thu Jan 30 06:58:39 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 16:58:39 +1100 Subject: [Csv] Status In-Reply-To: Message from Skip Montanaro <15928.42338.298849.316715@montanaro.dyndns.org> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> <15928.42338.298849.316715@montanaro.dyndns.org> Message-ID: <20030130055839.64E903C32B@coffee.object-craft.com.au> >In the C code can't you just look up "split", "join", "__add__" and such and >not care that you are dealing with string or unicode objects? Even better, >can't you just make heavy use of the abstract interface which implements >many of the things that are trivial in Python code? Currently the C module just deals with raw strings. I suspect there would be a fair performance cost to using the string object's methods (I should have a look at how strings and unicode strings are implemented internally these days). Suffice to say, it's a reasonable amount of work. We probably should be focusing on refining the PEP and writing some tests at this stage... 8-) Regarding the PEP - - are we going to retain the ability to pass keyword arguments, that override the dialect, to the factory functions (the pep doesn't mention this)? - we could make the dialect parameter accept either a string dialect name or a dialect instance - is this a good idea? - regarding the dialect list function - this probably should be called list_dialects(), yes? - should we call the delimiter parameter "field_sep" instead (I notice you haven't used underscores in the parameter names - is this deliberate)? Thinking about the tests, I envisage a bunch of tests for the underlying C module, and tests for each dialect (just the basic dialect with no additional parameters)? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Thu Jan 30 06:59:10 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 16:59:10 +1100 Subject: [Csv] Module question... Message-ID: <20030130055910.828A43C32B@coffee.object-craft.com.au> The way we've speced it, the module only deals with file objects. I wonder if there's any need to deal with strings, rather than files? What was the rational for using files, rather making the user do their own readline(), etc? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Thu Jan 30 06:59:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 29 Jan 2003 23:59:49 -0600 Subject: [Csv] Looks like we're live... Message-ID: <15928.48981.251010.410861@montanaro.dyndns.org> It looks like I successfully migrated this mailing list to Mailman 2.1. We have archives and everything. Andrew, you said you had an archive of all the messages. Can you pass that along to me with any tips you feel worthwhile about incorporating that archive into pipermail? Thx, Skip From skip at pobox.com Thu Jan 30 07:08:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 00:08:02 -0600 Subject: [Csv] Status In-Reply-To: <20030130055839.64E903C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> <15928.42338.298849.316715@montanaro.dyndns.org> <20030130055839.64E903C32B@coffee.object-craft.com.au> Message-ID: <15928.49474.186478.320826@montanaro.dyndns.org> Andrew> We probably should be focusing on refining the PEP and writing Andrew> some tests at this stage... 8-) That sounds like a good idea. Andrew> Regarding the PEP - Andrew> - are we going to retain the ability to pass keyword arguments, Andrew> that override the dialect, to the factory functions (the pep Andrew> doesn't mention this)? Yes, I thought that was the plan. The current text under Module Interface gives an incomplete function prototype: reader(fileobj [, dialect='excel2000']) but in the text below it says: The optional dialect parameter is discussed below. It also accepts several keyword parameters which define specific format settings (see the section "Formatting Parameters"). I'd like not to enumerate all the possible keyword parameters, especially since that list may grow. How should I write the synopsis? reader(fileobj [, dialect='excel2000'] [, keyword parameters]) ? Andrew> - we could make the dialect parameter accept either a string Andrew> dialect name or a dialect instance - is this a good idea? It can pretty easily do both. Perhaps we should present the pros and cons in the PEP and see what kind of feedback we get. Andrew> - regarding the dialect list function - this probably should be Andrew> called list_dialects(), yes? Where do you see dialect_list()? Maybe I need to cvs up. In any case, I like list_dialects() better. Andrew> - should we call the delimiter parameter "field_sep" instead (I Andrew> notice you haven't used underscores in the parameter names - Andrew> is this deliberate)? I don't have a big preference one way or the other. I've been calling it "delimiter" though. Andrew> Thinking about the tests, I envisage a bunch of tests for the Andrew> underlying C module, and tests for each dialect (just the basic Andrew> dialect with no additional parameters)? Give me one test you'd like to run and one set of inputs and expected outputs. I'll set up a module tomorrow which should just drop into Lib/test. I'm kind of running out of steam. (It's Thursday 12:07am here.) Skip From altis at semi-retired.com Thu Jan 30 07:33:49 2003 From: altis at semi-retired.com (Kevin Altis) Date: Wed, 29 Jan 2003 22:33:49 -0800 Subject: [Csv] Access Products sample Message-ID: I created a db and table in Access (products.mdb) using one of the built-in samples. I created two rows, one that is mostly empty. I used the default CSV export to create(Products.csv) and also output the table as an Excel 97/2000 XLS file (Products.xls). Finally, I had Excel export as CSV (ProductsExcel.csv). They are all contained in the attached zip. The currency column in the table is actually written out with formatting ($5.66 instead of just 5.66). Note that when Excel exports this column it has a trailing space for some reason (,$5.66 ,). While exporting it reminded me that unless a column in the data set contains an embedded newline or carriage return it shouldn't matter whether the file is opened in binary mode for reading. Without a schema we don't know what each column is supposed to contain, so that is outside the domain of the csv import parser and export writer. The values exported by both Access and Excel are designed to prevent information loss within the constraints of the CSV format, thus a field with no value (what I think of as None in Python) is empty in the CSV We should we be able to import and then export using a given dialect, such that there would be no differences between the original csv and the exported one? Actually, using the Access default of quoting strings it isn't possible to do that because it implies having a schema to know that a given column is a string. With the Excel csv format it is possible because a column that doesn't contain a comma won't be quoted. Just thinking out loud. ka -------------- next part -------------- A non-text attachment was scrubbed... Name: products.zip Type: application/x-zip-compressed Size: 17035 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030129/e29424d8/attachment.bin From altis at semi-retired.com Thu Jan 30 07:55:03 2003 From: altis at semi-retired.com (Kevin Altis) Date: Wed, 29 Jan 2003 22:55:03 -0800 Subject: [Csv] Module question... In-Reply-To: <20030130055910.828A43C32B@coffee.object-craft.com.au> Message-ID: > From: Andrew McNamara > > The way we've speced it, the module only deals with file objects. I > wonder if there's any need to deal with strings, rather than files? A string can be wrapped as StringIO to appear as a file and there may also be other file-like objects that people might want to pass in. > What was the rational for using files, rather making the user do their > own readline(), etc? I'll try and summarize, if this is too simplistic or incorrect I'm sure someone will speak up :) The simplest solution might have been to provide a file path and then let the parser handle all the opening, reading, and closing, returning a result list. However, that is far too limiting since then if you do want to parse a string or something that isn't a physical file on disk you have to collect the raw data, write it to a temp file and then pass the path of the temp file in. Definitely, too cumbersome. It would be possible to require the user code to supply one large string to parse, thus putting the burden of opening, reading, and closing the file-like object. This wastes memory, which can be a problem especially for large data files. One other possibility would be for the parser to only deal with one row at a time, leaving it up to the user code to feed the parser the row strings. But given the various possible line endings for a row of data and the fact that a column of a row may contain a line ending, not to mention all the other escape character issues we've discussed, this would be error-prone. The solution was to simply accept a file-like object and let the parser do the interpretation of a record. By having the parser present an iterable interface, the user code still gets the convenience of processing per row if needed or if no processing is desired a result list can easily be obtained. This should provide the most flexibility while still being easy to use. ka From altis at semi-retired.com Thu Jan 30 07:58:22 2003 From: altis at semi-retired.com (Kevin Altis) Date: Wed, 29 Jan 2003 22:58:22 -0800 Subject: [Csv] change of Sender address Message-ID: Skip, the mailing list Sender: is now csv-bounces at mail.mojam.com while previously it was csv-admin at mail.mojam.com. Is that intentional? ka From andrewm at object-craft.com.au Thu Jan 30 08:24:57 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 18:24:57 +1100 Subject: [Csv] Status In-Reply-To: Message from Skip Montanaro <15928.49474.186478.320826@montanaro.dyndns.org> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> <15928.42338.298849.316715@montanaro.dyndns.org> <20030130055839.64E903C32B@coffee.object-craft.com.au> <15928.49474.186478.320826@montanaro.dyndns.org> Message-ID: <20030130072457.25FEB3C32B@coffee.object-craft.com.au> > Andrew> - are we going to retain the ability to pass keyword arguments, > Andrew> that override the dialect, to the factory functions (the pep > Andrew> doesn't mention this)? > >Yes, I thought that was the plan. Just checking... 8-) >I'd like not to enumerate all the possible keyword parameters, especially >since that list may grow. How should I write the synopsis? > > reader(fileobj [, dialect='excel2000'] [, keyword parameters]) > >? Maybe make it "optional keyword parameters"... implied, I know, but... > Andrew> - we could make the dialect parameter accept either a string > Andrew> dialect name or a dialect instance - is this a good idea? > >It can pretty easily do both. Perhaps we should present the pros and cons >in the PEP and see what kind of feedback we get. Sometimes you can give people too much choice. We don't have time for an endless discussion. If we don't think we're going to be crucified, we should just pick something that's tasteful. Dave? > Andrew> - regarding the dialect list function - this probably should be > Andrew> called list_dialects(), yes? > >Where do you see dialect_list()? Maybe I need to cvs up. In any case, I >like list_dialects() better. Ah - I mean "dialect list function" in the generic sense - we need one, and I was proposing to call it list_dialects, or maybe that should be listdialects to be like listdir... nah, looks ugly. > Andrew> - should we call the delimiter parameter "field_sep" instead (I > Andrew> notice you haven't used underscores in the parameter names - > Andrew> is this deliberate)? > >I don't have a big preference one way or the other. I've been calling it >"delimiter" though. Is there any precident in the other modules? Our module called it field_sep, and I noticed you called it that in the description. > Andrew> Thinking about the tests, I envisage a bunch of tests for the > Andrew> underlying C module, and tests for each dialect (just the basic > Andrew> dialect with no additional parameters)? > >Give me one test you'd like to run and one set of inputs and expected >outputs. I'll set up a module tomorrow which should just drop into >Lib/test. I'm kind of running out of steam. (It's Thursday 12:07am here.) I might be able to work it out myself... we'll see. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Thu Jan 30 08:33:52 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 18:33:52 +1100 Subject: [Csv] Module question... In-Reply-To: Message from "Kevin Altis" References: Message-ID: <20030130073352.A55953C32B@coffee.object-craft.com.au> >> The way we've speced it, the module only deals with file objects. I >> wonder if there's any need to deal with strings, rather than files? BTW, I'm asking this because it's something that will come back to haunt us if we get it wrong - it's something we need to make the right call on. >A string can be wrapped as StringIO to appear as a file and there may also >be other file-like objects that people might want to pass in. Yes - if the most common use by far is reading and writing files, then this is the right answer (i.e., say "use StringIO if you really need to do a string"). >> What was the rational for using files, rather making the user do their >> own readline(), etc? > >I'll try and summarize, if this is too simplistic or incorrect I'm sure >someone will speak up :) > >The simplest solution might have been to provide a file path and then let >the parser handle all the opening, reading, and closing, returning a result >list. However, that is far too limiting since then if you do want to parse a >string or something that isn't a physical file on disk you have to collect >the raw data, write it to a temp file and then pass the path of the temp >file in. Definitely, too cumbersome. Yeah - I'm certainly not suggesting that. >It would be possible to require the user code to supply one large string to >parse, thus putting the burden of opening, reading, and closing the >file-like object. This wastes memory, which can be a problem especially for >large data files. Agreed. >One other possibility would be for the parser to only deal with one row at a >time, leaving it up to the user code to feed the parser the row strings. But >given the various possible line endings for a row of data and the fact that >a column of a row may contain a line ending, not to mention all the other >escape character issues we've discussed, this would be error-prone. This is the way the Object Craft module has worked - it works well enough, and the universal end-of-line stuff in 2.3 makes it more seamless. Not saying I'm wedded to this scheme, but I'd just like to have clear why we've chosen one over the other. I'm trying to think of an example where operating on a file-like object would be too restricting, and I can't - oh, here's one: what if you wanted to do some pre-processing on the data (say it was uuencoded)? >The solution was to simply accept a file-like object and let the parser do >the interpretation of a record. By having the parser present an iterable >interface, the user code still gets the convenience of processing per row if >needed or if no processing is desired a result list can easily be obtained. > >This should provide the most flexibility while still being easy to use. Should the object just be defined as an iteratable, and leave closing, etc, up to the user of the module? One downside of this is you can't rewind an iterator, so things like the sniffer would be SOL. We can't ensure that the passed file is rewindable either. Hmmm. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Thu Jan 30 08:36:37 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 18:36:37 +1100 Subject: [Csv] Access Products sample In-Reply-To: Message from "Kevin Altis" References: Message-ID: <20030130073637.CA4DD3C32B@coffee.object-craft.com.au> >The currency column in the table is actually written out with formatting >($5.66 instead of just 5.66). Note that when Excel exports this column it >has a trailing space for some reason (,$5.66 ,). I think you'll find that if you enter a negative amount, that space turns into a minus sign (not verified). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Thu Jan 30 08:37:53 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 30 Jan 2003 18:37:53 +1100 Subject: [Csv] change of Sender address In-Reply-To: Message from "Kevin Altis" References: Message-ID: <20030130073754.02FEF3C32B@coffee.object-craft.com.au> >the mailing list Sender: is now csv-bounces at mail.mojam.com while previously >it was csv-admin at mail.mojam.com. Is that intentional? Mailman attempts to handle the bounces itself - I guess that's just something that has changed between 2.0 and 2.1. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From altis at semi-retired.com Thu Jan 30 09:54:16 2003 From: altis at semi-retired.com (Kevin Altis) Date: Thu, 30 Jan 2003 00:54:16 -0800 Subject: [Csv] Module question... In-Reply-To: <20030130073352.A55953C32B@coffee.object-craft.com.au> Message-ID: > From: Andrew McNamara > > >> The way we've speced it, the module only deals with file objects. I > >> wonder if there's any need to deal with strings, rather than files? > > BTW, I'm asking this because it's something that will come back to haunt > us if we get it wrong - it's something we need to make the right call on. Agreed, in fact I'm now reconsidering my position. > >One other possibility would be for the parser to only deal with > one row at a > >time, leaving it up to the user code to feed the parser the row > strings. But > >given the various possible line endings for a row of data and > the fact that > >a column of a row may contain a line ending, not to mention all the other > >escape character issues we've discussed, this would be error-prone. > > This is the way the Object Craft module has worked - it works well enough, > and the universal end-of-line stuff in 2.3 makes it more seamless. Not > saying I'm wedded to this scheme, but I'd just like to have clear why > we've chosen one over the other. I'm tempted to agree that maybe your original way would be better, but I haven't caught up on some of the discussion the last couple of days. Skip and Cliff can probably argue effectively for not doing it that way if they really want. > I'm trying to think of an example where operating on a file-like object > would be too restricting, and I can't - oh, here's one: what if you > wanted to do some pre-processing on the data (say it was uuencoded)? That seems to be stretching things a bit, but even then wouldn't you simply pass the uuencoded file-like object to uu.decode and then pass the out_file file-like object to the parser? I haven't used uu myself, so maybe that wouldn't work. Regardless, the cvs module should be focused on one task. > >The solution was to simply accept a file-like object and let the > parser do > >the interpretation of a record. By having the parser present an iterable > >interface, the user code still gets the convenience of > processing per row if > >needed or if no processing is desired a result list can easily > be obtained. > > > >This should provide the most flexibility while still being easy to use. > > Should the object just be defined as an iteratable, and leave closing, > etc, up to the user of the module? One downside of this is you can't > rewind an iterator, so things like the sniffer would be SOL. We can't > ensure that the passed file is rewindable either. Hmmm. Given a file-like object, you might not be able to rewind anyway. This might be another argument for just parsing line by line, but does that make using the module too complex and error-prone? We probably have to provide some use-case examples. Putting the whole operation in a try/except/finally block with the file close in finally is probably the safe way to do this type of operation. In the PEP we need to make it clear the benefits of the csv module over a user simply trying to use split(',') and such, which I think Skip has already done to a certain extent. We are also trying to address export as well which is actually quite important. If people simply try and export with only a simplistic understanding of the edge cases, then they potentially end up with unusable csv files. This is the same kind of thing you see with XML where people start writing out data or whatever thinking that is all there is to it and then they end up with something that isn't really XML. I wouldn't be surprised if there is more invalid XML out there than valid. In our case I think we are identifying some pretty clearly defined dialects of csv, that if you use those you are going to be in good shape. We will also be able to tell someone whether in fact a file is well-formed and/or throw an exception if it doesn't match the chosen dialect, which again, seems simple, but that's a pretty big deal. Ugh, I need sleep, any stupidity above is just me being tired ;-) ka From djc at object-craft.com.au Thu Jan 30 11:06:23 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 21:06:23 +1100 Subject: [Csv] CSV interface question In-Reply-To: <15928.4659.449989.410123@montanaro.dyndns.org> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Cliff> Actually, there is a downside to using strings, as you will see Cliff> if you look at the code I posted a little while ago. By taking Cliff> dialect as a string, it basically precludes the user rolling Cliff> their own dialect except as keyword arguments. After working Cliff> on this, I'm inclined to have the programmer pass a class or Cliff> other structure. Skip> Don't forget we have the speedy Object Craft _csv engine sitting Skip> underneath the covers. Under the assumption that all the actual Skip> processing goes on at that level, I see no particular reason Skip> dialect info needs to be anything other than a collection of Skip> keyword arguments. I view csv.reader and csv.writer as factory Skip> functions which return functional readers and writers defined in Skip> _csv.c. The Python level serves simply to paper over the Skip> low-level extension module. I have been going through the messages again to see if I can build up a TODO list. I missed something on the first reading of this message. In the current version of the code sitting in the sandbox the reader factory is actually a class: class reader(OCcvs): def __init__(self, fileobj, dialect = 'excel2000', **options): self.fileobj = fileobj OCcvs.__init__(self, dialect, **options) def __iter__(self): return self def next(self): while 1: fields = self.parser.parse(self.fileobj.next()) if fields: return fields You message above talks about the _csv parser exposing the iterator interface, not the Python layer. I wonder how much of a measurable performance difference there would be by leaving the code as is. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 11:56:36 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 21:56:36 +1100 Subject: [Csv] Status In-Reply-To: <20030130072457.25FEB3C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> <15928.42338.298849.316715@montanaro.dyndns.org> <20030130055839.64E903C32B@coffee.object-craft.com.au> <15928.49474.186478.320826@montanaro.dyndns.org> <20030130072457.25FEB3C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> I'd like not to enumerate all the possible keyword parameters, >> especially since that list may grow. How should I write the >> synopsis? >> >> reader(fileobj [, dialect='excel2000'] [, keyword parameters]) >> >> ? Andrew> Maybe make it "optional keyword parameters"... implied, I Andrew> know, but... [I have been franticly trying to reread all of the messages again. Other work has made me fall behind and lose context.] Is there any harm in just doing something like this: The basic reading interface is:: reader(fileobj [, **kwargs]) The dialect keyword argument identifies the CSV dialect which will be implemented by the reader. The dialect corresponds to a set of parameters which are set in the low level CSV parsing engine. Variants of a dialect can be specified by passing additional keyword arguments which serve to override the parameters defined by the dialect argument. The parser parameters are catalogued below. Andrew> - we could make the dialect parameter accept either a string Andrew> dialect name or a dialect instance - is this a good idea? +1 from me. csv.py:: class dialect: name = None quotechar = "'" delimiter = "," excel2000 = dialect yourcode.py:: import csv my_dialect = csv.dialect() my_dialect.delimiter = '\t' # or class my_dialect(csv.dialect): delimiter = '\t' csvreader = csv.reader(file("some.csv"), dialect=my_dialect) >> It can pretty easily do both. Perhaps we should present the pros >> and cons in the PEP and see what kind of feedback we get. Andrew> Sometimes you can give people too much choice. We don't have Andrew> time for an endless discussion. If we don't think we're going Andrew> to be crucified, we should just pick something that's Andrew> tasteful. Dave? If we had to choose one, I would say pass a class or instance rather than a string. Andrew> - regarding the dialect list function - this probably should Andrew> be called list_dialects(), yes? >> Where do you see dialect_list()? Maybe I need to cvs up. In any >> case, I like list_dialects() better. Andrew> Ah - I mean "dialect list function" in the generic sense - we Andrew> need one, and I was proposing to call it list_dialects, or Andrew> maybe that should be listdialects to be like listdir... nah, Andrew> looks ugly. +1 list_dialects() Andrew> - should we call the delimiter parameter "field_sep" instead Andrew> (I notice you haven't used underscores in the parameter names Andrew> - is this deliberate)? >> I don't have a big preference one way or the other. I've been >> calling it "delimiter" though. +1 delimiter (I think :-) - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 12:31:06 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 22:31:06 +1100 Subject: [Csv] Module question... In-Reply-To: References: Message-ID: >>>>> "Kevin" == Kevin Altis writes: >> From: Andrew McNamara >> >> >> The way we've speced it, the module only deals with file >> objects. I >> wonder if there's any need to deal with strings, >> rather than files? >> >> BTW, I'm asking this because it's something that will come back to >> haunt us if we get it wrong - it's something we need to make the >> right call on. Kevin> Agreed, in fact I'm now reconsidering my position. When I originally wrote the Object Craft parser I thought about these things too. I eventually settled on the current interface. To use the stuff in CVS now, this is what the interface looks like: csvreader = _csv.parser() for line in file("some.csv"): row = csvreader.parse(line) if row: process(row) The reason I settled on this interface was that it placed only the performance critical code into the extension module. All policy decisions about where the CSV data would come from were pushed back into the application. The current PEP is only a slight variation on this, but it is a nice variation. The variation pushes the conditional in the loop into the reader and thereby exposes a much nicer interface. Hmmm... The argument to the PEP reader() should not be a file object, it should be an iterator which returns lines. There really is no reason why it should not handle the following: lines = ('1,2,3,"""I see,""\n', 'said the blind man","as he picked up his\n', 'hammer and saw"\n') csvreader = csv.reader(lines) for row in csvreader: process(row) >> >One other possibility would be for the parser to only deal with >> >one row at a time, leaving it up to the user code to feed the >> >parser the row strings. But given the various possible line >> >endings for a row of data and the fact that a column of a row may >> >contain a line ending, not to mention all the other escape >> >character issues we've discussed, this would be error-prone. >> >> This is the way the Object Craft module has worked - it works well >> enough, and the universal end-of-line stuff in 2.3 makes it more >> seamless. Not saying I'm wedded to this scheme, but I'd just like >> to have clear why we've chosen one over the other. You might have missed it but the Object Craft parser is designed to be fed one line at a time. It actually raises an exception if you pass more than one line to it. Internally it collects fields from lines until it detects end of record, at which point it returns the record to the caller. >> I'm trying to think of an example where operating on a file-like >> object would be too restricting, and I can't - oh, here's one: what >> if you wanted to do some pre-processing on the data (say it was >> uuencoded)? I think this could be solved by changing the reader() fileobj argument to an iterable. >> >The solution was to simply accept a file-like object and let the >> >parser do the interpretation of a record. By having the parser >> >present an iterable interface, the user code still gets the >> >convenience of processing per row if needed or if no processing is >> >desired a result list can easily be obtained. Is this the same thing as what I said above? >> Should the object just be defined as an iteratable, and leave >> closing, etc, up to the user of the module? One downside of this is >> you can't rewind an iterator, so things like the sniffer would be >> SOL. We can't ensure that the passed file is rewindable >> either. Hmmm. Application code will just have to be aware of this and arrange to do something like the following: sniffer_input = [fileobj.readline() for i in range(20)] dialect = csvutils.sniff(sniffer_input) csvreader = csv.reader(sniffer_input, dialect=dialect) for row in csvreader: process(row) Then we have two problems (our principle weapons are surprise and fear): * The sniffer_input might have a partial record (multi-line record spanning last line read out of file). * We do not have a way to continue using a reader with additional input. * The list comprehension be longer than the file :-) This could be solved by exposing a further method on the reader. sniffer_input = [fileobj.readline() for i in range(20)] dialect = csvutils.sniff(sniffer_input) csvreader = csv.reader(sniffer_input, dialect=dialect) for row in csvreader: process(row) # now continue on with the rest of the file csvreader.use(fileobj) for row in csvreader: process(row) Given the above, is it reasonable to say that the above logic could be hardened and placed into a csvutils function? - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 13:13:10 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 23:13:10 +1100 Subject: [Csv] Made some changes to the PEP Message-ID: Here is the commit message: Trying to bring PEP up to date with discussions on mailing list. I hope I have not misinterpreted the conclusions. * dialect argument is now either a string to identify one of the internally defined parameter sets, otherwise it is an object which contains attributes which correspond to the parameter set. * Altered set_dialect() to take dialect name and dialect object. * Altered get_dialect() to take dialect name and return dialect object. * Fleshed out formatting parameters, adding escapechar, lineterminator, quoting. - Dave -- http://www.object-craft.com.au From skip at pobox.com Thu Jan 30 13:21:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 06:21:18 -0600 Subject: [Csv] We have archives Message-ID: <15929.6334.716837.600555@montanaro.dyndns.org> Thanks to Andrew saving messages, we have archives. There are probably a few duplicates around the transition to MM 2.1, but I decided to not worry about it. Skip From skip at pobox.com Thu Jan 30 13:28:39 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 06:28:39 -0600 Subject: [Csv] change of Sender address In-Reply-To: References: Message-ID: <15929.6775.921999.863765@montanaro.dyndns.org> Kevin> the mailing list Sender: is now csv-bounces at mail.mojam.com while Kevin> previously it was csv-admin at mail.mojam.com. Is that intentional? It appears to be a side effect of the transition from Mailman 2.0.9 to Mailman 2.1. I suspect it was intentional on Barry Warsaw's part. ;-) The old version of the list had these aliases: csv csv-admin csv-request csv-owner while the new version has many more: csv csv-admin csv-bounces csv-confirm csv-join csv-leave csv-owner csv-request csv-subscribe csv-unsubscribe It seems the system now has more fine-grained control over the disposition of admin messages. Skip From skip at pobox.com Thu Jan 30 13:37:46 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 06:37:46 -0600 Subject: [Csv] Module question... In-Reply-To: <20030130073352.A55953C32B@coffee.object-craft.com.au> References: <20030130073352.A55953C32B@coffee.object-craft.com.au> Message-ID: <15929.7322.744669.187499@montanaro.dyndns.org> >> One other possibility would be for the parser to only deal with one >> row at a time, leaving it up to the user code to feed the parser the >> row strings. But given the various possible line endings for a row of >> data and the fact that a column of a row may contain a line ending, >> not to mention all the other escape character issues we've discussed, >> this would be error-prone. Andrew> This is the way the Object Craft module has worked - it works Andrew> well enough, and the universal end-of-line stuff in 2.3 makes it Andrew> more seamless. Not saying I'm wedded to this scheme, but I'd Andrew> just like to have clear why we've chosen one over the other. You have to be careful. I think the Universal eol stuff might bit you in the arse here. Recall that in Excel, the default line terminator (record separator?) is CRLF, but that a hard return within a cell is simply LF. I don't know what Universal eol handling will do with that. In any case, because you have to have full control over line termination, I think you have to start dealing just with binary files. Andrew> I'm trying to think of an example where operating on a file-like Andrew> object would be too restricting, and I can't - oh, here's one: Andrew> what if you wanted to do some pre-processing on the data (say it Andrew> was uuencoded)? Then you force the user to uudecode the file and stuff it into a StringIO object. ;-) Andrew> Should the object just be defined as an iteratable, I had envisioned that the object the csv.reader() factory function (or class) returned would be an iterable and that the object the csv.writer() factory function (or class) returned would accept an iterable. Andrew> closing, etc, up to the user of the module? One downside of this Andrew> is you can't rewind an iterator, so things like the sniffer Andrew> would be SOL. We can't ensure that the passed file is rewindable Andrew> either. Hmmm. The sniffer is going to be in a csvutils module, correct? It could certainly have either accept a filename or a string containing some subset of the rows in the file to be sniffed. I see no reason to constrain it to the csv.reader()'s interface. Skip From djc at object-craft.com.au Thu Jan 30 13:40:33 2003 From: djc at object-craft.com.au (Dave Cole) Date: 30 Jan 2003 23:40:33 +1100 Subject: [Csv] Moving _csv.c closer to PEP Message-ID: In the process of fixing _csv.c so it will handle the parameters specified in the PEP I came across yet another configurable dialect setting. doublequote When True quotechar in a field value is represented by two consecutive quotechar. I will continue fixing _csv.c on the assumption that we want to keep this tweakable parameter. - Dave -- http://www.object-craft.com.au From skip at pobox.com Thu Jan 30 13:54:43 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 06:54:43 -0600 Subject: [Csv] Made some changes to the PEP In-Reply-To: References: Message-ID: <15929.8339.826486.231614@montanaro.dyndns.org> Dave> Here is the commit message: Dave> Trying to bring PEP up to date with discussions on mailing list... Much appreciated. I just added a todo section near the top. Anyone can feel free to add to the list or take care of any items. Skip From skip at pobox.com Thu Jan 30 13:57:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 06:57:27 -0600 Subject: [Csv] Cutting out for a bit... Message-ID: <15929.8503.814261.580267@montanaro.dyndns.org> As masochistic as it may seem, I am currently working on two PEPs. I'm going to cut out for awhile to work on PEP 304. I need to make some progress on that it it's going to have more than a snowball's chance in hell of making it into 2.3. At some point today it would be good if we could announce PEP 305 to the world and start to get some feedback from the unwashed masses. Skip From djc at object-craft.com.au Thu Jan 30 14:17:59 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 00:17:59 +1100 Subject: [Csv] Moving _csv.c closer to PEP In-Reply-To: References: Message-ID: >>>>> "Dave" == Dave Cole writes: Dave> In the process of fixing _csv.c so it will handle the parameters Dave> specified in the PEP I came across yet another configurable Dave> dialect setting. Dave> doublequote Dave> When True quotechar in a field value is represented by Dave> two consecutive quotechar. Dave> I will continue fixing _csv.c on the assumption that we want to Dave> keep this tweakable parameter. Here is the commit message: * More formatting changes to bring code closer to the Guido style. * Changed all internal parser settings to match those in the PEP. * Added PEP settings to allow _csv use by csv.py - new parameters are not handled yet (skipinitialspace, lineterminator, quoting). * Removed overloading of quotechar and escapechar values by introducing have_quotechar and have_escapechar attributes. Barest minimum of testing has been done. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 14:30:10 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 00:30:10 +1100 Subject: [Csv] Made some changes to the PEP In-Reply-To: <15929.8339.826486.231614@montanaro.dyndns.org> References: <15929.8339.826486.231614@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> Here is the commit message: Dave> Trying to bring PEP up to date with discussions on mailing Dave> list... Skip> Much appreciated. I just added a todo section near the top. Skip> Anyone can feel free to add to the list or take care of any Skip> items. >From the TODO: - Need to complete initial list of formatting parameters and settle on names. This is what I have done in the _csv module: >>> import _csv >>> help(_csv) [snip] delimiter Defines the character that will be used to separate fields in the CSV record. quotechar Defines the character used to quote fields that contain the field separator or newlines. If set to None special characters will be escaped using the escapechar. escapechar Defines the character used to escape special characters. Only used if quotechar is None. doublequote When True, quotes in a fields must be doubled up. skipinitialspace When True spaces following the delimiter are ignored. lineterminator The string used to terminate records. quoting Controls the generation of quotes around fields when writing records. This is only used when quotechar is not None. autoclear When True, calling parse() will automatically call the clear() method if the previous call to parse() raised an exception during parsing. strict When True, the parser will raise an exception on malformed fields rather than attempting to guess the right behavior. [snip] Not sure that we need to keep the last two... When the parser fails you are able to look at the fields it managed to parse before the problem was encountered. This might be useful for the sniffer. The autoclear parameter controls whether or not you must manually clear() the partial record before trying to parse more data. The strict parameter controls what happens when you see data like this: "blah","oops" blah" If strict is False then the " after the oops is included as part of the field 'oops" blah'. If in strict is True and exception is raised. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 15:03:57 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 01:03:57 +1100 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars In-Reply-To: <15928.4083.834299.369381@montanaro.dyndns.org> References: <1043859517.16012.14.camel@software1.logiplex.internal> <15928.4083.834299.369381@montanaro.dyndns.org> Message-ID: Checking against the current version of the CSV parser. Cliff> 1, "not quoted","quoted" Cliff> It seems reasonable to parse this as: Cliff> [1, ' "not quoted"', "quoted"] Cliff> which is the described Excel behavior. >>> import _csv >>> p = _csv.parser() >>> p.parse('1, "not quoted","quoted"') ['1', ' "not quoted"', 'quoted'] Looks OK. Cliff> Now consider Cliff> 1,"not quoted" ,"quoted" Cliff> Is the second field quoted or not? If it is, do we discard the Cliff> extraneous whitespace following it or raise an exception? The current version of the _csv parser can do two things depending upon the value of the strict parameter. >>> p.strict 0 >>> p.parse('1,"not quoted" ,"quoted"') ['1', 'not quoted ', 'quoted'] >>> p.strict = 1 >>> p.parse('1,"not quoted" ,"quoted"') Traceback (most recent call last): File "", line 1, in ? _csv.Error: , expected after " Skip> Well, there's always the, "be flexible in what you accept, Skip> strict in what you generate" school of thought. In the above, Skip> that would suggest the list returned would be Skip> ['1', 'not quoted', 'quoted'] Why wouldn't you include the trailing space on the second field? Andrew, what does Excel do here? Hmm... I was sort of expecting _csv to do this: ['1', 'not quoted" ', 'quoted'] Skip> It seems like a minor formatting glitch. How about a warning? Skip> Or a "strict" flag for the parser? I think that there are enough variations here that strict is not enough. The second one does look a bit bogus... ['1', '"not quoted" ', 'quoted'] ['1', 'not quoted" ', 'quoted'] ['1', 'not quoted ', 'quoted'] Cliff> Worse, consider this Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes" Skip> Depends on the setting of skipinitialspaces. If false, you get Skip> ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] parser does this: ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] Skip> if True, I think you get Skip> ['quoted', 'not quoted, but this "field" has delimiters and quotes'] Yeah, but the doublequote stuff is only meant for quoted fields (or is it). Cliff> How should this parse? I say free exceptions for everyone. Don't know if exceptions are what we need. We just need to come up with parameters which control the parser to sufficient detail to handle the dialect variations. Cliff> I propose space between delimiters and quotes raise an exception Cliff> and let's be done with it. I don't think this really affects Cliff> Excel compatibility since Excel will never generate this type of Cliff> file and doesn't require it for import. It's true that some Cliff> files that Excel would import (probably incorrectly) won't import Cliff> in CSV, but I think that's outside the scope of Excel Cliff> compatibility. Skip> Sounds good to me. I dunno. We should look at the corner cases and handle as many as we can in the dialect. That is sort of the whole point of why we are here. Cliff> Anyway, I know no one has said "On your mark, get set" yet, but I Cliff> can't think without code sitting in front of me, breaking worse Cliff> with every keystroke, so in addition to creating some test cases, Cliff> I've hacked up a very preliminary CSV module so we have something Cliff> to play with. I was up til 6am so if there's anything odd, I Cliff> blame it on lack of sleep and the feverish optimism and glossing Cliff> of detail that comes with it. Skip> Perhaps you and Dave were in a race but didn't know it? ;-) When Skip mentioned that we were going to have the speedy Object Craft parser I just checked in the _csv module. It does not handle all of what we have been discussing, but it is close. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Jan 30 15:18:50 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 01:18:50 +1100 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: References: Message-ID: >>>>> "Kevin" == Kevin Altis writes: >> Exceptions should be raised for anything else, even None. An empty >> field is "". [snip] >> I really still dislike this whole None thing. Whose use case is >> that anyway? [snip] Kevin> Are you saying that you want to throw an exception instead? Kevin> Booleans may also present a problem. I was mostly thinking in Kevin> terms of importing and exporting data from embedded databases Kevin> like MetaKit, my own list of dictionaries (flatfile stuff), Kevin> PySQLite, Gadfly. Anyway, the implication might be that it is Kevin> necessary for the user to sanitize data as part of the export Kevin> operation too. Have to ponder that. The penny finally dropped!!! The None thing and the implicit __str__ conversion is there in the Object Craft parser to be compatible with the DB-API. Consider the following code (which is close to something I had to do a couple of years ago): import csv import Sybase db = Sybase.connect(server, user, passwd, database) c = db.cursor() c.execute('select some stuff from the database') p = csv.parser() fp = open('results.csv', 'w') for row in c.fetchall(): fp.write(p.join(row)) fp.write('\n') We would be doing it slightly better now: import csv import Sybase db = Sybase.connect(server, user, passwd, database) c = db.cursor() c.execute('select some stuff from the database') csvwriter = csv.writer(file('results.csv', 'w')) for row in c.fetchall(): csvwriter.write(row) Or even: import csv import Sybase db = Sybase.connect(server, user, passwd, database) c = db.cursor() c.execute('select some stuff from the database') csvwriter = csv.writer(file('results.csv', 'w')) csvwriter.writelines(c) Now without the implicit __str__ and conversion of None to '' we would require a shirtload of code to do the same thing, only it would be as slow as a slug on valium. - Dave -- http://www.object-craft.com.au From skip at pobox.com Thu Jan 30 15:18:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 08:18:44 -0600 Subject: [Csv] Completely off-topic... In-Reply-To: References: Message-ID: <15929.13380.487790.427756@montanaro.dyndns.org> Saving useful commentary for later... Dave> ...""""I see,""\n', Dave> 'said the blind man","as he picked up his\n', Dave> 'hammer and saw"\n') My father used to use this expression all the time. I have no idea of its origins (though his dad was a carpenter and he started out life as one). He's been dead and gone for over 30 years now so I can't easily ask him. Any time I've used it people always looked at me like I was nuts. This is the first instance where I've seen actually encountered another person using it. Skip From djc at object-craft.com.au Thu Jan 30 15:20:27 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 01:20:27 +1100 Subject: [Csv] Re: Completely off-topic... In-Reply-To: <15929.13380.487790.427756@montanaro.dyndns.org> References: <15929.13380.487790.427756@montanaro.dyndns.org> Message-ID: Skip> Saving useful commentary for later... Dave> ...""""I see,""\n', Dave> 'said the blind man","as he picked up his\n', Dave> 'hammer and saw"\n') Skip> My father used to use this expression all the time. I have no Skip> idea of its origins (though his dad was a carpenter and he Skip> started out life as one). He's been dead and gone for over 30 Skip> years now so I can't easily ask him. Any time I've used it Skip> people always looked at me like I was nuts. This is the first Skip> instance where I've seen actually encountered another person Skip> using it. Your mind is failing... http://www.object-craft.com.au/projects/csv/ :-) - Dave -- http://www.object-craft.com.au From skip at pobox.com Thu Jan 30 15:25:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 08:25:58 -0600 Subject: [Csv] Moving _csv.c closer to PEP In-Reply-To: References: Message-ID: <15929.13814.339184.359208@montanaro.dyndns.org> Dave> In the process of fixing _csv.c so it will handle the parameters Dave> specified in the PEP I came across yet another configurable Dave> dialect setting. Dave> doublequote Dave> When True quotechar in a field value is represented by two Dave> consecutive quotechar. Isn't that implied as long as quoting is not "never" and escapechar is None? If so, and we decide to have a separate doublequote parameter anyway, checking that relationship should be part of validating the parameter set. Speaking of doubling things, can the low-level partser support mulit-character quotechar or delimiter strings? Recall I mentioned the previous client who didn't quote anything in their private file format and used ::: as the field separator. Skip From skip at pobox.com Thu Jan 30 15:40:31 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 08:40:31 -0600 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars In-Reply-To: References: <1043859517.16012.14.camel@software1.logiplex.internal> <15928.4083.834299.369381@montanaro.dyndns.org> Message-ID: <15929.14687.742062.136173@montanaro.dyndns.org> Dave> The current version of the _csv parser can do two things depending Dave> upon the value of the strict parameter. >>> p.strict 0 >>> p.parse('1,"not quoted" ,"quoted"') ['1', 'not quoted ', 'quoted'] Hmmm... I think this is wrong. You treated " as the quote character but tacked the space onto the field even though it occurred after the " which should have terminated the field. I would have expected: ['1', 'not quoted', 'quoted'] Barfing when p.strict == 1 seems correct to me. Skip> ['1', 'not quoted', 'quoted'] Dave> Why wouldn't you include the trailing space on the second field? Because the quoting tells you the field has ended. Dave> I think that there are enough variations here that strict is not Dave> enough. I think that when strict == 0, extra whitespace between the terminating quote and the delimiter or between the delimiter and the first quote should be discarded. If the field is not quoted, leading or trailing whitespace is ignored. I think that makes the treatment of whitespace near delimiters uniform (principle of least surprise?). If that's not what the user wants, she can damn well set the strict flag to True and catch the exception. ;-) (Speaking of exceptions, should there be a field in _csv.Error which holds the raw text which causes the exception?) Skip> Depends on the setting of skipinitialspaces. If false, you get Skip> ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] Dave> parser does this: Dave> ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] Skip> if True, I think you get Skip> ['quoted', 'not quoted, but this "field" has delimiters and quotes'] Dave> Yeah, but the doublequote stuff is only meant for quoted fields Dave> (or is it). Damn, yeah. Maybe we have overspecified the parameter set. Do we need both strict and skipinitialspaces? I'd say keep strict and dump skipinitialspaces, then define fairly precisely what to do when strict==False. Cliff> I propose space between delimiters and quotes raise an exception Cliff> and let's be done with it. I don't think this really affects Cliff> Excel compatibility since Excel will never generate this type of Cliff> file and doesn't require it for import. It's true that some Cliff> files that Excel would import (probably incorrectly) won't import Cliff> in CSV, but I think that's outside the scope of Excel Cliff> compatibility. Skip> Sounds good to me. I can never remember my past train of thought from one day to the next. :-( can-you-hear-me-waffling?-ly y'rs, Skip From skip at pobox.com Thu Jan 30 15:51:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 08:51:03 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: References: Message-ID: <15929.15319.901753.91284@montanaro.dyndns.org> Dave> The None thing and the implicit __str__ conversion is there in the Dave> Object Craft parser to be compatible with the DB-API.... Hmmm... I've used MySQLdb and psycopg and don't recall my queries returning None. (He furiously searches for None in PEP 249...) Ah, I see: SQL NULL values are represented by the Python None singleton on input and output. I generally have always defined my fields to have defaults and usually also declare them NOT NULL, so I wouldn't expect to see None in my query results. Still, the current treatment of None doesn't successfully round-trip ("select * ...", dump to csv, load from csv, repopulate database). Do you distinguish an empty field from a quoted field printed as ""? That is, are these output rows different? 5.0,,"Mary, Mary, quite contrary"\r\n 5.0,"","Mary, Mary, quite contrary"\r\n the former parsing into [5.0, None, "Mary, Mary, quite contrary"] and the latter into [5.0, "", "Mary, Mary, quite contrary"] ? Dave> Now without the implicit __str__ and conversion of None to '' we Dave> would require a shirtload of code to do the same thing, only it Dave> would be as slow as a slug on valium. How about we let the user define how to handle None? I would always want None's appearing in my data to raise and exception. You clearly have a use case for automatically mapping to the empty string. Skip From skip at pobox.com Thu Jan 30 15:52:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 08:52:24 -0600 Subject: [Csv] Re: Completely off-topic... In-Reply-To: References: <15929.13380.487790.427756@montanaro.dyndns.org> Message-ID: <15929.15400.755945.151484@montanaro.dyndns.org> Skip> This is the first instance where I've seen actually encountered Skip> another person using it. Dave> Your mind is failing... Dave> http://www.object-craft.com.au/projects/csv/ Dave> :-) I never read the instructions. I just click the "Download" link. ;-) Skip From LogiplexSoftware at earthlink.net Thu Jan 30 18:57:45 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 09:57:45 -0800 Subject: [Csv] Status In-Reply-To: <15928.37531.445243.692589@montanaro.dyndns.org> References: <15928.37531.445243.692589@montanaro.dyndns.org> Message-ID: <1043949465.16012.101.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 18:48, Skip Montanaro wrote: > It would appear we are converging on dialects as data-only classes > (subclassable but with no methods). I'll update the PEP. Many other ideas > have been floating through the list, and while I haven't been deleting the > messages, I haven't been adding them to the PEP either. Can someone help > with that? A comment on the dialect classes: I think a validate() method would be good in the base dialect class. A separate validate function would do just as well, but it seems logical to make it part of the class. > I'd like to get the wording in the PEP to converge on our current thoughts > and announce it on c.l.py and python-dev sometime tomorrow. I think we will > get a lot of feedback from both camps, hopefully some of it useful. ;-) Undoubtedly Timothy Rue will inform us that we are wasting our time as the VIC will solve this problem as well (after all, input->9 commands->output), but if you think you can live with that, sure. > I just finished making a pass through the messages I hadn't deleted (and > then saved them to a csv mbox file since the list appears to still not be > archiving). Here's what I think we've concluded: > > * Dialects are a set of defaults, probably implemented as classes (which > allows subclassing, whereas dicts wouldn') and the default dialect > named as something like csv.dialects.excel or "excel" if we allow > string specifiers. (I think strings work well at the API, simply > because they are shorter and can more easily be presented in GUI > tools.) Agreed. Just to clarify, these strings will still be stored in a dictionary ("settings" or "dialects")? > * A csvutils module should be at least scoped out which might do a fair > number of things: > > - Implements one or more sniffers for parameter types > > - Validates CSV files (e.g., constant number of columns, type > constraints on column values, compares against given dialect) > > - Generate a sniffer from a CSV file > > * These individual parameters are necessary (hopefully the names will be > enough clue as to there meaning): quote_char, quoting ("auto", > "always", "nonnumeric", "never"), delimiter, line_terminator, > skip_whitespace, escape_char, hard_return. Are there others? > > * We're still undecided about None (I certainly don't think it's a valid > value to be writing to CSV files) IMO, None should be mapped to '', so [None, None, None] would be saved as ,, or "","","" if quoting="always". I can't think of any reasonable alternative. However, it is arguable whether reading ,, should return [None,None,None] or ['','','']. I'd vote for the latter since we explicitly are not doing conversions between strings and Python types ('6' doesn't become 6). > * Rows can have variable numbers of columns and the application is > responsible for deciding on and enforcing max_rows or max_cols. > > * Don't raise exceptions needlessly. For example, specifying > quoting="never" and not specifying a value for escape_char would be > okay until you encounter a field when writing which contains the > delimiter. > > * Files have to be opened in binary mode (we can check the mode > attribute I believe) so we can do the right thing with line > terminators. > > * Data values should always be returned as strings, even if they are > valid numbers. Let the application do data conversion. > > Other stuff we haven't talked about much: > > * Unicode. I think we punt on this for now and just pretend that > passing codecs.open(csvfile, mode, encoding) is sufficient. I'm sure > Martin von L?wis will let us know if it isn't. ;-) Dave said, "The low > level parser (C code) is probably going to need to handle unicode." > Let's wait and see how well codecs.open() works for us. > > * We know we need tests but haven't talked much about them. I vote for > PyUnit as much as possible, though a certain amount of manual testing > using existing spreadsheets and databases will be required. +1. Testing all the corner cases is going to take some care. > * Exceptions. We know we need some. We should start with CSVError and > try to avoid getting carried away with things. If need be, we can add > a code field to the class. I don't like the idea of having 17 > different subclasses of CSVError though. It's too much complexity for > most users. I can only count to 12 (or was it 11?), so this would be good for me as well. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Thu Jan 30 20:58:25 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 11:58:25 -0800 Subject: [Csv] Module question... In-Reply-To: <20030130073352.A55953C32B@coffee.object-craft.com.au> References: <20030130073352.A55953C32B@coffee.object-craft.com.au> Message-ID: <1043956705.16012.112.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 23:33, Andrew McNamara wrote: > >> What was the rational for using files, rather making the user do their > >> own readline(), etc? > > > >I'll try and summarize, if this is too simplistic or incorrect I'm sure > >someone will speak up :) > >One other possibility would be for the parser to only deal with one row at a > >time, leaving it up to the user code to feed the parser the row strings. But > >given the various possible line endings for a row of data and the fact that > >a column of a row may contain a line ending, not to mention all the other > >escape character issues we've discussed, this would be error-prone. > > This is the way the Object Craft module has worked - it works well enough, > and the universal end-of-line stuff in 2.3 makes it more seamless. Not > saying I'm wedded to this scheme, but I'd just like to have clear why > we've chosen one over the other. It simplifies use for the programmer not to have to feed one line at a time to the parser. If the programmer needs to generate data one line at a time, they can pass a pipe and feed data into that. > I'm trying to think of an example where operating on a file-like object > would be too restricting, and I can't - oh, here's one: what if you > wanted to do some pre-processing on the data (say it was uuencoded)? Then they can uudecode it, write it to a temp file and pass that instead of the original. I think the file-like object is the best compromise between ease-of-use and flexibility. > >The solution was to simply accept a file-like object and let the parser do > >the interpretation of a record. By having the parser present an iterable > >interface, the user code still gets the convenience of processing per row if > >needed or if no processing is desired a result list can easily be obtained. > > > >This should provide the most flexibility while still being easy to use. Hey, that's what I was thinking > Should the object just be defined as an iteratable, and leave closing, > etc, up to the user of the module? One downside of this is you can't > rewind an iterator, so things like the sniffer would be SOL. We can't > ensure that the passed file is rewindable either. Hmmm. -1. If it isn't sniffable, I'd end up having to write another CSV parser to support the features DSV currently has. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Thu Jan 30 21:22:14 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 12:22:14 -0800 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <15929.15319.901753.91284@montanaro.dyndns.org> References: <15929.15319.901753.91284@montanaro.dyndns.org> Message-ID: <1043958134.15753.132.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 06:51, Skip Montanaro wrote: > Dave> The None thing and the implicit __str__ conversion is there in the > Dave> Object Craft parser to be compatible with the DB-API.... > > Hmmm... I've used MySQLdb and psycopg and don't recall my queries returning > None. (He furiously searches for None in PEP 249...) Ah, I see: > > SQL NULL values are represented by the Python None singleton on input > and output. > > I generally have always defined my fields to have defaults and usually also > declare them NOT NULL, so I wouldn't expect to see None in my query results. > > Still, the current treatment of None doesn't successfully round-trip > ("select * ...", dump to csv, load from csv, repopulate database). Do you > distinguish an empty field from a quoted field printed as ""? That is, are > these output rows different? > > 5.0,,"Mary, Mary, quite contrary"\r\n > 5.0,"","Mary, Mary, quite contrary"\r\n > > the former parsing into > > [5.0, None, "Mary, Mary, quite contrary"] > > and the latter into > > [5.0, "", "Mary, Mary, quite contrary"] I'd suggest *not* mapping anything to any object but a string on *import*. CSV files don't have any way of carrying type information (except perhaps on an application-by-application basis, but I don't think that's where we're going here) so it's best to treat *everything* as a string. Export is a slightly different story. I do think None should be mapped to '' on export since that is the only reasonable value for it, and there are enough existing modules that use None to represent an empty value that this would be a reasonable thing for us to handle. > > Dave> Now without the implicit __str__ and conversion of None to '' we > Dave> would require a shirtload of code to do the same thing, only it > Dave> would be as slow as a slug on valium. > > How about we let the user define how to handle None? I would always want > None's appearing in my data to raise and exception. You clearly have a use > case for automatically mapping to the empty string. This might not affect performance too badly if we *always* raise an exception when passed anything but a string, and do the conversion (which would involve a table lookup) in the exception handler. Anything not in the table would cause the exception to be passed up to the caller. That being said, this might complicate things too much for many people. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Thu Jan 30 21:10:10 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 12:10:10 -0800 Subject: [Csv] Access Products sample In-Reply-To: References: Message-ID: <1043957410.16012.122.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 22:33, Kevin Altis wrote: > I created a db and table in Access (products.mdb) using one of the built-in > samples. I created two rows, one that is mostly empty. I used the default > CSV export to create(Products.csv) and also output the table as an Excel > 97/2000 XLS file (Products.xls). Finally, I had Excel export as CSV > (ProductsExcel.csv). They are all contained in the attached zip. > > The currency column in the table is actually written out with formatting > ($5.66 instead of just 5.66). Note that when Excel exports this column it > has a trailing space for some reason (,$5.66 ,). So we've actually found an application that puts an extraneous space around the data, and it's our primary target. Figures. > While exporting it reminded me that unless a column in the data set contains > an embedded newline or carriage return it shouldn't matter whether the file > is opened in binary mode for reading. > > Without a schema we don't know what each column is supposed to contain, so > that is outside the domain of the csv import parser and export writer. Agreed. > The values exported by both Access and Excel are designed to prevent > information loss within the constraints of the CSV format, thus a field with > no value (what I think of as None in Python) is empty in the CSV Something just occurred to me: say someone is controlling Excel via win32com and obtains their data that way. Do the empty cells in that list appear as '' or None? If they do appear as None, then I'd be inclined to again raise the argument that we should map None => '' on export. Unless, of course, someone else has an idea they want to trade +1 votes on again > We should we be able to import and then export using a given dialect, such > that there would be no differences between the original csv and the exported > one? Actually, using the Access default of quoting strings it isn't possible > to do that because it implies having a schema to know that a given column is > a string. With the Excel csv format it is possible because a column that > doesn't contain a comma won't be quoted. I don't think that we need to worry about whether checksum(original) == checksum(output) to claim compatibility, only that we can read and write files compatible with said application. If they turn out to be identical, that's just a side-effect ;) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Thu Jan 30 21:23:27 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 12:23:27 -0800 Subject: [Csv] CSV interface question In-Reply-To: <15928.34442.337899.905054@montanaro.dyndns.org> References: <20030129060501.DB9193C1F4@coffee.object-craft.com.au> <20030129101658.4E95C3C1F4@coffee.object-craft.com.au> <1043861462.16012.46.camel@software1.logiplex.internal> <15928.4659.449989.410123@montanaro.dyndns.org> <1043863704.16012.64.camel@software1.logiplex.internal> <15928.6851.934680.995625@montanaro.dyndns.org> <1043867895.16012.87.camel@software1.logiplex.internal> <15928.34442.337899.905054@montanaro.dyndns.org> Message-ID: <1043958206.15753.134.camel@software1.logiplex.internal> On Wed, 2003-01-29 at 17:57, Skip Montanaro wrote: > Cliff> Consider now the programmer actually defining a new dialect: > Cliff> Passing a class or other structure (a dict is fine), they can > Cliff> create this on the fly with minimal work. Using a *string*, they > Cliff> must first "register" that string somewhere (probably in the > Cliff> mapping we agree upon) before they can actually make the function > Cliff> call. Granted, it's only a an extra step, but it requires a bit > Cliff> more knowledge (of the mapping) and doesn't seem to provide a > Cliff> real benefit. If you prefer a mapping to a class, that is fine, > Cliff> but lets pass the mapping rather than a string referring to it: > > Somewhere I think we still need to associate string names with these > beasts. Maybe it's just another attribute: > > class dialect: > name = None > > class excel(dialect): > name = "excel" > ... > > They should all be collected together for operation as a group. This could > be so a GUI knows all the names to present or so a sniffer can return all > the dialects with which a sample file is compatible. Both operations > suggest the need to register dialects somehow. +1 on this. Hm. If I keep trying I might get you to agree with everything just out of exhaustion -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Thu Jan 30 21:45:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 14:45:26 -0600 Subject: [Csv] Module question... In-Reply-To: <1043956705.16012.112.camel@software1.logiplex.internal> References: <20030130073352.A55953C32B@coffee.object-craft.com.au> <1043956705.16012.112.camel@software1.logiplex.internal> Message-ID: <15929.36582.100675.643804@montanaro.dyndns.org> Cliff> -1. If it isn't sniffable, I'd end up having to write another Cliff> CSV parser to support the features DSV currently has. Or approach the problem differently? Try asking the low-level parser to return a few rows of the file using different parameters. The low-level parser is fast enough that you can (given a filename) attempt to parse it many times in fairly short order. See what works. ;-) Skip From skip at pobox.com Thu Jan 30 22:02:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 15:02:27 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <1043958134.15753.132.camel@software1.logiplex.internal> References: <15929.15319.901753.91284@montanaro.dyndns.org> <1043958134.15753.132.camel@software1.logiplex.internal> Message-ID: <15929.37603.217278.623650@montanaro.dyndns.org> Cliff> Export is a slightly different story. I do think None should be Cliff> mapped to '' on export since that is the only reasonable value Cliff> for it, and there are enough existing modules that use None to Cliff> represent an empty value that this would be a reasonable thing Cliff> for us to handle. How is a database (that was Dave's use case) supposed to distinguish '' as SQL NULL vs '' as an empty string though? This is the sort of thing that bothers me about mapping None to ''. Cliff> This might not affect performance too badly if we *always* raise Cliff> an exception when passed anything but a string, ... except float and int values will be prevalent in the data. Can we limit the data to float, int, plain strings, Unicode and None? If so, I think you can just test the object types and do the right thing. In the case of None, I'd like to see a parameter which would allow me to flag that as an error. The extra complication might be limited to map_none_to='some string, possibly empty' in the writer() constructor and interpret_empty_string_as= in the reader() constructor. Skip From skip at pobox.com Thu Jan 30 22:03:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 15:03:51 -0600 Subject: [Csv] Access Products sample In-Reply-To: <1043957410.16012.122.camel@software1.logiplex.internal> References: <1043957410.16012.122.camel@software1.logiplex.internal> Message-ID: <15929.37687.44696.305338@montanaro.dyndns.org> >> The currency column in the table is actually written out with >> formatting ($5.66 instead of just 5.66). Note that when Excel exports >> this column it has a trailing space for some reason (,$5.66 ,). Cliff> So we've actually found an application that puts an extraneous Cliff> space around the data, and it's our primary target. Figures. So we just discovered we need an "access" dialect. ;-) Skip From LogiplexSoftware at earthlink.net Thu Jan 30 22:10:28 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 13:10:28 -0800 Subject: [Csv] Module question... In-Reply-To: <15929.36582.100675.643804@montanaro.dyndns.org> References: <20030130073352.A55953C32B@coffee.object-craft.com.au> <1043956705.16012.112.camel@software1.logiplex.internal> <15929.36582.100675.643804@montanaro.dyndns.org> Message-ID: <1043961028.15753.148.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 12:45, Skip Montanaro wrote: > Cliff> -1. If it isn't sniffable, I'd end up having to write another > Cliff> CSV parser to support the features DSV currently has. > > Or approach the problem differently? Try asking the low-level parser to > return a few rows of the file using different parameters. The low-level > parser is fast enough that you can (given a filename) attempt to parse it > many times in fairly short order. See what works. ;-) I'm not sure that would be a good approach, as passing incorrect arguments to the parser might cause problems (it *is* written in C ) and given the number of possible variations, it would be inefficient no matter how fast the parser. However, it is certainly possible to sniff the file prior to passing it to the parser. I suppose there is no reason the sniffer has to take the same type of file (or iterator) argument the parser does, although it would be nice, for consistency. Okay: -0 to whatever someone said that I was arguing about . -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Thu Jan 30 22:45:53 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 13:45:53 -0800 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <15929.37603.217278.623650@montanaro.dyndns.org> References: <15929.15319.901753.91284@montanaro.dyndns.org> <1043958134.15753.132.camel@software1.logiplex.internal> <15929.37603.217278.623650@montanaro.dyndns.org> Message-ID: <1043963153.16012.159.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 13:02, Skip Montanaro wrote: > Cliff> Export is a slightly different story. I do think None should be > Cliff> mapped to '' on export since that is the only reasonable value > Cliff> for it, and there are enough existing modules that use None to > Cliff> represent an empty value that this would be a reasonable thing > Cliff> for us to handle. > > How is a database (that was Dave's use case) supposed to distinguish '' as > SQL NULL vs '' as an empty string though? This is the sort of thing that > bothers me about mapping None to ''. The database not being able to distinguish '' from SQL NULL is inherent in the file format. CSV files have no concept of '' vs None vs NULL. There is only ,, or ,"", which I think should be considered the same (because the same data [or lack of] can be expressed either way by tweaking the quote settings). If we don't want them to be considered the same, then we need YAO to specify whether to interpret them differently. > > Cliff> This might not affect performance too badly if we *always* raise > Cliff> an exception when passed anything but a string, ... > > except float and int values will be prevalent in the data. Well, right =) > Can we limit the data to float, int, plain strings, Unicode and None? If > so, I think you can just test the object types and do the right thing. In > the case of None, I'd like to see a parameter which would allow me to flag > that as an error. The extra complication might be limited to > > map_none_to='some string, possibly empty' This seems reasonable. > in the writer() constructor and > > interpret_empty_string_as= > > in the reader() constructor. Okay. > Skip Sure. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From LogiplexSoftware at earthlink.net Thu Jan 30 23:34:32 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 14:34:32 -0800 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <1043963153.16012.159.camel@software1.logiplex.internal> References: <15929.15319.901753.91284@montanaro.dyndns.org> <1043958134.15753.132.camel@software1.logiplex.internal> <15929.37603.217278.623650@montanaro.dyndns.org> <1043963153.16012.159.camel@software1.logiplex.internal> Message-ID: <1043966071.15753.177.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 13:45, Cliff Wells wrote: > On Thu, 2003-01-30 at 13:02, Skip Montanaro wrote: > > Cliff> Export is a slightly different story. I do think None should be > > Cliff> mapped to '' on export since that is the only reasonable value > > Cliff> for it, and there are enough existing modules that use None to > > Cliff> represent an empty value that this would be a reasonable thing > > Cliff> for us to handle. > > > > How is a database (that was Dave's use case) supposed to distinguish '' as > > SQL NULL vs '' as an empty string though? This is the sort of thing that > > bothers me about mapping None to ''. > > The database not being able to distinguish '' from SQL NULL is inherent > in the file format. CSV files have no concept of '' vs None vs NULL. > There is only ,, or ,"", which I think should be considered the same > (because the same data [or lack of] can be expressed either way by > tweaking the quote settings). > > If we don't want them to be considered the same, then we need YAO to > specify whether to interpret them differently. Hm. Something has occurred to me. How about treating None as a true null value. That is, we never quote it. So, even if alwaysquote == true [1,2,3,'',None] would get exported as '1','2','3','', That way the difference between the two is saved in the CSV file. Obviously not all programs would be able to take advantage of this implicit information, but it seems likely some would (does Excel differentiate between an empty string and a null value? It wouldn't surprise me to discover that the '' becomes an empty *character* cell and the null value is simply ignored). Clearly this behavior is not desirable in all circumstances. However, the workaround in any case is to not have None values in the data to be exported . This punts any possible issues with it back into user-space. The only problem I have with this is that the behavior sort of implicit. It saves us a couple of options but it puts the settings in the data which I am not sure is a good idea. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From andrewm at object-craft.com.au Fri Jan 31 00:19:37 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 10:19:37 +1100 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars In-Reply-To: Message from Dave Cole References: <1043859517.16012.14.camel@software1.logiplex.internal> <15928.4083.834299.369381@montanaro.dyndns.org> Message-ID: <20030130231937.C26B83C32B@coffee.object-craft.com.au> >Cliff> 1,"not quoted" ,"quoted" > >Why wouldn't you include the trailing space on the second field? > >Andrew, what does Excel do here? Excel returns the trailing space, and honours the quote: ['1', 'not quoted ', 'quoted'] I've checked that it does this consistently (at end of line, etc). >Hmm... I was sort of expecting _csv to do this: > >['1', 'not quoted" ', 'quoted'] That would have been something I fixed when doing the extensive Excel comparison - it's one of the tests. >Cliff> Worse, consider this >Cliff> "quoted", "not quoted, but this ""field"" has delimiters and quotes" > >Skip> Depends on the setting of skipinitialspaces. If false, you get >Skip> ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] > >parser does this: > >['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] If we implement the "leading whitespace strip" then it would return: ['quoted', 'not quoted, but this "field" has delimiters and quotes'] -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 31 00:23:05 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 10:23:05 +1100 Subject: [Csv] Weird dialects? Message-ID: <20030130232305.7D1253C32B@coffee.object-craft.com.au> Something that occured to me last night - we might find that there are strange dialects that we can't easily parse with the C parser (without make it ugly). It occured to me that maybe the dialect should contain some sort of specification of the parser to use. But my feeling is that if it's too hard to parse with the C parser, it isn't a CSV file, and it should therefore be someone else's problem. Agreed? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From LogiplexSoftware at earthlink.net Fri Jan 31 00:33:09 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 15:33:09 -0800 Subject: [Csv] Weird dialects? In-Reply-To: <20030130232305.7D1253C32B@coffee.object-craft.com.au> References: <20030130232305.7D1253C32B@coffee.object-craft.com.au> Message-ID: <1043969589.15753.181.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 15:23, Andrew McNamara wrote: > Something that occured to me last night - we might find that there are > strange dialects that we can't easily parse with the C parser (without > make it ugly). It occured to me that maybe the dialect should contain > some sort of specification of the parser to use. But my feeling is that > if it's too hard to parse with the C parser, it isn't a CSV file, and > it should therefore be someone else's problem. Agreed? Now there's a concrete definition of CSV -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From andrewm at object-craft.com.au Fri Jan 31 00:35:01 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 10:35:01 +1100 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars In-Reply-To: Message from Skip Montanaro <15929.14687.742062.136173@montanaro.dyndns.org> References: <1043859517.16012.14.camel@software1.logiplex.internal> <15928.4083.834299.369381@montanaro.dyndns.org> <15929.14687.742062.136173@montanaro.dyndns.org> Message-ID: <20030130233501.6DD5C3C32B@coffee.object-craft.com.au> > >>> p.parse('1,"not quoted" ,"quoted"') > ['1', 'not quoted ', 'quoted'] > >Hmmm... I think this is wrong. You treated " as the quote character but >tacked the space onto the field even though it occurred after the " which >should have terminated the field. I would have expected: "Wrong" it might be, but that's what Excel does... >Damn, yeah. Maybe we have overspecified the parameter set. Do we need both >strict and skipinitialspaces? I'd say keep strict and dump >skipinitialspaces, then define fairly precisely what to do when >strict==False. I'd go for fine grained in the back end module - remember we have the "dialects" stuff to hide the complexity from the average user. If anything, strict should be broken up so a given flag only enables one feature. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 31 00:39:14 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 10:39:14 +1100 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: Message from Skip Montanaro <15929.15319.901753.91284@montanaro.dyndns.org> References: <15929.15319.901753.91284@montanaro.dyndns.org> Message-ID: <20030130233914.A854F3C32B@coffee.object-craft.com.au> >How about we let the user define how to handle None? I would always want >None's appearing in my data to raise and exception. You clearly have a use >case for automatically mapping to the empty string. Maybe just add an "allow_none" flag - if false, it raise an exception on None, if true emits a null string? Sure it doesn't survive the round trip - if you care, you probably should post/pre process data. We can't be all things to all people. As mentioned earlier - True and False are also potentially a problem. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 31 00:43:37 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 10:43:37 +1100 Subject: [Csv] Status In-Reply-To: Message from Cliff Wells <1043949465.16012.101.camel@software1.logiplex.internal> References: <15928.37531.445243.692589@montanaro.dyndns.org> <1043949465.16012.101.camel@software1.logiplex.internal> Message-ID: <20030130234337.230973C32B@coffee.object-craft.com.au> >A comment on the dialect classes: I think a validate() method would be >good in the base dialect class. A separate validate function would do >just as well, but it seems logical to make it part of the class. The underlying C module currently validates all the options and will raise an exception if an unknown option is set, etc. Should we change this - I'd hate to duplicate the tests? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Fri Jan 31 00:57:04 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 10:57:04 +1100 Subject: [Csv] Devil in the details, including the small one between delimiters and quotechars In-Reply-To: <20030130233501.6DD5C3C32B@coffee.object-craft.com.au> References: <1043859517.16012.14.camel@software1.logiplex.internal> <15928.4083.834299.369381@montanaro.dyndns.org> <15929.14687.742062.136173@montanaro.dyndns.org> <20030130233501.6DD5C3C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> >>> p.parse('1,"not quoted" ,"quoted"') ['1', 'not quoted ', >> 'quoted'] >> >> Hmmm... I think this is wrong. You treated " as the quote >> character but tacked the space onto the field even though it >> occurred after the " which should have terminated the field. I >> would have expected: Andrew> "Wrong" it might be, but that's what Excel does... I thought so. How are we going to go about building up some dialect test cases? >> Damn, yeah. Maybe we have overspecified the parameter set. Do we >> need both strict and skipinitialspaces? I'd say keep strict and >> dump skipinitialspaces, then define fairly precisely what to do >> when strict==False. Andrew> I'd go for fine grained in the back end module - remember we Andrew> have the "dialects" stuff to hide the complexity from the Andrew> average user. Andrew> If anything, strict should be broken up so a given flag only Andrew> enables one feature. +1 I agree with that. Until we have a few dialects and a test suite we should hold off on trying to lock down all of the parameters. That would be placing the cart before the horse in my opinion. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Fri Jan 31 01:04:24 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 11:04:24 +1100 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <15929.15319.901753.91284@montanaro.dyndns.org> References: <15929.15319.901753.91284@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> Now without the implicit __str__ and conversion of None to '' we Dave> would require a shirtload of code to do the same thing, only it Dave> would be as slow as a slug on valium. Skip> How about we let the user define how to handle None? I would Skip> always want None's appearing in my data to raise and exception. Skip> You clearly have a use case for automatically mapping to the Skip> empty string. I suspect that programs which combine the DB-API and CSV files are probably quite common. I agree that the round trip fails, but not all of those programs need to make the round trip. What the current behaviour does is "solve" the following: DB-API -> CSV I think you would find it hard to come up with a meaningful way to handle NULL columns for any variant of CSV -> DB-API Regardless of the source of the CSV. The only thing I can think of which makes even partial sense is the following field translation (for CSV -> DB-API): null -> None "null" -> "null" Does that mean that we should have an option on the reader/writer which provides this functionality? I don't know. I would probably use it if it were there. - Dave -- http://www.object-craft.com.au From skip at pobox.com Fri Jan 31 01:10:13 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 18:10:13 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <1043966071.15753.177.camel@software1.logiplex.internal> References: <15929.15319.901753.91284@montanaro.dyndns.org> <1043958134.15753.132.camel@software1.logiplex.internal> <15929.37603.217278.623650@montanaro.dyndns.org> <1043963153.16012.159.camel@software1.logiplex.internal> <1043966071.15753.177.camel@software1.logiplex.internal> Message-ID: <15929.48869.249366.775005@montanaro.dyndns.org> Cliff> Hm. Something has occurred to me. How about treating None as a Cliff> true null value. That is, we never quote it. So, even if Cliff> alwaysquote == true Cliff> [1,2,3,'',None] Cliff> would get exported as Cliff> '1','2','3','', Too fragile, methinks. Also, I I've said before, in my application domain at least, trying to write None to a CSV file is a bug. Skip From djc at object-craft.com.au Fri Jan 31 01:12:31 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 11:12:31 +1100 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <15929.37603.217278.623650@montanaro.dyndns.org> References: <15929.15319.901753.91284@montanaro.dyndns.org> <1043958134.15753.132.camel@software1.logiplex.internal> <15929.37603.217278.623650@montanaro.dyndns.org> Message-ID: Cliff> Export is a slightly different story. I do think None should be Cliff> mapped to '' on export since that is the only reasonable value Cliff> for it, and there are enough existing modules that use None to Cliff> represent an empty value that this would be a reasonable thing Cliff> for us to handle. Skip> How is a database (that was Dave's use case) supposed to Skip> distinguish '' as SQL NULL vs '' as an empty string though? Skip> This is the sort of thing that bothers me about mapping None to Skip> ''. Cliff> This might not affect performance too badly if we *always* raise Cliff> an exception when passed anything but a string, ... Skip> except float and int values will be prevalent in the data. Skip> Can we limit the data to float, int, plain strings, Unicode and Skip> None? If so, I think you can just test the object types and do Skip> the right thing. In the case of None, I'd like to see a Skip> parameter which would allow me to flag that as an error. The Skip> extra complication might be limited to Skip> Skip> map_none_to='some string, possibly empty' Skip> Skip> in the writer() constructor and Skip> Skip> interpret_empty_string_as= Skip> Skip> in the reader() constructor. I think that we should have an option (or set of options) which causes the following: * In the writer, export None as the unquoted string 'null'. * In the write, export the string 'null' as the quoted string "null". * In the reader, import the unquoted string 'null' as None. * In the reader, import of the quoted string "null" as 'null'. This solves the ambiguity for the case when we are in control of the round trip. When we are not in control of the round trip all bets are off anyway since there is no standard (that I know of) for expressing this. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Fri Jan 31 01:16:00 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 11:16:00 +1100 Subject: [Csv] Status In-Reply-To: Message from Andrew McNamara <20030130035141.271EA3C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> Message-ID: <20030131001600.C81D83C32B@coffee.object-craft.com.au> >>I can live with that. I would propose then that escape_char default to >>something reasonable, not None. > >That's a little hairy, because the resulting file can't be parsed >correctly by Excel. But it should be safe if the escape_char is only >emitted if quote is set to none. Hmmm - I just realised this isn't safe where the excel dialect is concerned - excel does no special processing of backslash, so our parser shouldn't either. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Fri Jan 31 01:19:34 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 11:19:34 +1100 Subject: [Csv] Weird dialects? In-Reply-To: <20030130232305.7D1253C32B@coffee.object-craft.com.au> References: <20030130232305.7D1253C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: Andrew> Something that occured to me last night - we might find that Andrew> there are strange dialects that we can't easily parse with the Andrew> C parser (without make it ugly). It occured to me that maybe Andrew> the dialect should contain some sort of specification of the Andrew> parser to use. But my feeling is that if it's too hard to Andrew> parse with the C parser, it isn't a CSV file, and it should Andrew> therefore be someone else's problem. Agreed? Why not allow the parser factory function to be an optional argument to the reader and writer factory functions? class csvreader: def __init__(self, fileobj, dialect='excel2000', parser=_csv.parser, **options): : self.parser = parser(**parser_options) This would allow pluggable parsers. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Fri Jan 31 01:21:38 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 11:21:38 +1100 Subject: [Csv] Status In-Reply-To: <20030131001600.C81D83C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <20030130031254.D2E853C32B@coffee.object-craft.com.au> <15928.40341.991680.82247@montanaro.dyndns.org> <20030130035141.271EA3C32B@coffee.object-craft.com.au> <20030131001600.C81D83C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >>> I can live with that. I would propose then that escape_char >>> default to something reasonable, not None. >> That's a little hairy, because the resulting file can't be parsed >> correctly by Excel. But it should be safe if the escape_char is >> only emitted if quote is set to none. Andrew> Hmmm - I just realised this isn't safe where the excel dialect Andrew> is concerned - excel does no special processing of backslash, Andrew> so our parser shouldn't either. That is why for the 'excel2000' dialect you set the escapechar to None. Excel has no escapechar so we do not set one in the parser. Am I missing something? - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Fri Jan 31 01:25:57 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 11:25:57 +1100 Subject: [Csv] Weird dialects? In-Reply-To: Message from Dave Cole References: <20030130232305.7D1253C32B@coffee.object-craft.com.au> Message-ID: <20030131002557.8ABCB3C32B@coffee.object-craft.com.au> >Andrew> Something that occured to me last night - we might find that >Andrew> there are strange dialects that we can't easily parse with the >Andrew> C parser (without make it ugly). It occured to me that maybe >Andrew> the dialect should contain some sort of specification of the >Andrew> parser to use. But my feeling is that if it's too hard to >Andrew> parse with the C parser, it isn't a CSV file, and it should >Andrew> therefore be someone else's problem. Agreed? > >Why not allow the parser factory function to be an optional argument >to the reader and writer factory functions? > > class csvreader: > def __init__(self, fileobj, dialect='excel2000', parser=_csv.parser, > **options): > : > self.parser = parser(**parser_options) > >This would allow pluggable parsers. Well, that's essentially what I was suggesting, but I suspect it's too much flexibility - we're not trying to build a general purpose parser framework. And on further thought, this is something that can be addressed later, if need be. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Fri Jan 31 01:28:08 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 11:28:08 +1100 Subject: [Csv] Moving _csv.c closer to PEP In-Reply-To: <15929.13814.339184.359208@montanaro.dyndns.org> References: <15929.13814.339184.359208@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> In the process of fixing _csv.c so it will handle the parameters Dave> specified in the PEP I came across yet another configurable Dave> dialect setting. Dave> doublequote Dave> When True quotechar in a field value is represented by two Dave> consecutive quotechar. Skip> Isn't that implied as long as quoting is not "never" and Skip> escapechar is None? If so, and we decide to have a separate Skip> doublequote parameter anyway, checking that relationship should Skip> be part of validating the parameter set. Checking against a dialect, or just as a collection of parameters? I think we are fast reaching the point where the only meaningful way forward is to start collecting dialects. Skip> Speaking of doubling things, can the low-level partser support Skip> mulit-character quotechar or delimiter strings? Recall I Skip> mentioned the previous client who didn't quote anything in their Skip> private file format and used ::: as the field separator. Currently the parser only handles single character quotechar, delimiter, and escapechar. I suspect that quotechar, delimiter, and escapechar of more than a single character might be stretching the bounds of what you could reasonably call a CSV parser. - Dave -- http://www.object-craft.com.au From skip at pobox.com Fri Jan 31 01:39:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 30 Jan 2003 18:39:06 -0600 Subject: [Csv] Re: First Cut at CSV PEP In-Reply-To: <20030130233914.A854F3C32B@coffee.object-craft.com.au> References: <15929.15319.901753.91284@montanaro.dyndns.org> <20030130233914.A854F3C32B@coffee.object-craft.com.au> Message-ID: <15929.50602.984909.597305@montanaro.dyndns.org> Andrew> Maybe just add an "allow_none" flag Good enough for me Andrew> As mentioned earlier - True and False are also potentially a Andrew> problem. You could allow_booleans which would have them write as True and False (those will be grokked by many SQL dialects), otherwise they map to 1 and 0. Skip From LogiplexSoftware at earthlink.net Fri Jan 31 01:48:48 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 30 Jan 2003 16:48:48 -0800 Subject: [Csv] Status In-Reply-To: <20030130234337.230973C32B@coffee.object-craft.com.au> References: <15928.37531.445243.692589@montanaro.dyndns.org> <1043949465.16012.101.camel@software1.logiplex.internal> <20030130234337.230973C32B@coffee.object-craft.com.au> Message-ID: <1043974128.16012.184.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 15:43, Andrew McNamara wrote: > >A comment on the dialect classes: I think a validate() method would be > >good in the base dialect class. A separate validate function would do > >just as well, but it seems logical to make it part of the class. > > The underlying C module currently validates all the options and will raise > an exception if an unknown option is set, etc. Should we change this - I'd > hate to duplicate the tests? I think having it outside the parser is preferable since it allows for easier customization (especially for the user). I can't think of any useful cases off the top of my head, but my over-engineering instinct tells me this is so . -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From andrewm at object-craft.com.au Fri Jan 31 02:27:05 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 12:27:05 +1100 Subject: [Csv] StringIO a bit of a lemon... Message-ID: <20030131012705.9EC753C32B@coffee.object-craft.com.au> Not only does StringIO lack a "mode" attribute, it also can't be used as an iterator (like real file objects), as it lacks a .next() method. This is somewhat annoying: if we accept an iterator, rather than specifically a file, it makes the module more generally useful. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Fri Jan 31 03:00:05 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 13:00:05 +1100 Subject: [Csv] Access Products sample In-Reply-To: <15929.37687.44696.305338@montanaro.dyndns.org> References: <1043957410.16012.122.camel@software1.logiplex.internal> <15929.37687.44696.305338@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: >>> The currency column in the table is actually written out with >>> formatting ($5.66 instead of just 5.66). Note that when Excel >>> exports this column it has a trailing space for some reason >>> (,$5.66 ,). Cliff> So we've actually found an application that puts an extraneous Cliff> space around the data, and it's our primary target. Figures. Skip> So we just discovered we need an "access" dialect. ;-) Not really. Python has no concept of currency types (last time I looked). The '$5.66 ' thing is an artifact of converting currency to string, not float to string. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Fri Jan 31 03:06:15 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 13:06:15 +1100 Subject: [Csv] StringIO a bit of a lemon... In-Reply-To: Message from Andrew McNamara <20030131012705.9EC753C32B@coffee.object-craft.com.au> References: <20030131012705.9EC753C32B@coffee.object-craft.com.au> Message-ID: <20030131020615.6CAC03C32B@coffee.object-craft.com.au> >Not only does StringIO lack a "mode" attribute, it also can't be used as >an iterator (like real file objects), as it lacks a .next() method. This >is somewhat annoying: if we accept an iterator, rather than specifically >a file, it makes the module more generally useful. Ignore me. I should be calling iter(fileobj). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From altis at semi-retired.com Fri Jan 31 05:35:00 2003 From: altis at semi-retired.com (Kevin Altis) Date: Thu, 30 Jan 2003 20:35:00 -0800 Subject: [Csv] Moving _csv.c closer to PEP In-Reply-To: Message-ID: > From: Dave Cole > > Skip> Speaking of doubling things, can the low-level partser support > Skip> mulit-character quotechar or delimiter strings? Recall I > Skip> mentioned the previous client who didn't quote anything in their > Skip> private file format and used ::: as the field separator. > > Currently the parser only handles single character quotechar, > delimiter, and escapechar. > > I suspect that quotechar, delimiter, and escapechar of more than a > single character might be stretching the bounds of what you could > reasonably call a CSV parser. Agreed! Double-byte Unicode characters would still be one character in case we do have to do something special for unicode support. ka From andrewm at object-craft.com.au Fri Jan 31 06:01:14 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 16:01:14 +1100 Subject: [Csv] Added some tests Message-ID: <20030131050114.BCF583C32B@coffee.object-craft.com.au> If you've missed the check-in message, I've added some tests finally (essentially just the tests from the Object Craft CSV module stripped down to just those relevant for the excel dialect). I'm thinking we should organise the tests as: - a bunch of tests for each dialect - a bunch of tests for each backend parser -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Jan 31 06:10:32 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 31 Jan 2003 16:10:32 +1100 Subject: [Csv] Excel - trademark... Message-ID: <20030131051032.45F053C32B@coffee.object-craft.com.au> Are we going to get into any trademark poo by calling the dialect "excel"? Should we call it something else to avoid problems (sigh)? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Fri Jan 31 12:55:47 2003 From: djc at object-craft.com.au (Dave Cole) Date: 31 Jan 2003 22:55:47 +1100 Subject: [Csv] Excel - trademark... In-Reply-To: <20030131051032.45F053C32B@coffee.object-craft.com.au> References: <20030131051032.45F053C32B@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: Andrew> Are we going to get into any trademark poo by calling the Andrew> dialect "excel"? Should we call it something else to avoid Andrew> problems (sigh)? Dunno. Importers in applications for foreign application data files usually name the foreign application. I just fired up Gnumeric and looked at the import dialog. It says "MS Excel (tm)" Should we call the dialect "excel(tm)"? - Dave -- http://www.object-craft.com.au From skip at pobox.com Fri Jan 31 13:10:57 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 31 Jan 2003 06:10:57 -0600 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv csv.py,1.4,1.5 In-Reply-To: References: Message-ID: <15930.26577.952898.246807@montanaro.dyndns.org> andrew> Modified Files: andrew> csv.py andrew> Log Message: andrew> Rename dialects from excel2000 to excel. Rename Error to be andrew> CSVError. Explicity fetch iterator in reader class, rather than andrew> simply calling next() (which only works for self-iterators). Minor nit. I think Error was fine. That's the standard for most extension modules. I would normally import csv then reference its objects through it. csv.CSVError looks redundant to me. I'm not a "from csv import CSVError" kind of guy however, so I can understand the desire to make the name more explicit when considered alone. Skip From skip at pobox.com Fri Jan 31 14:07:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 31 Jan 2003 07:07:01 -0600 Subject: [Csv] Excel - trademark... In-Reply-To: <20030131051032.45F053C32B@coffee.object-craft.com.au> References: <20030131051032.45F053C32B@coffee.object-craft.com.au> Message-ID: <15930.29941.147355.904094@montanaro.dyndns.org> Andrew> Are we going to get into any trademark poo by calling the Andrew> dialect "excel"? Should we call it something else to avoid Andrew> problems (sigh)? I wouldn't worry about it. Here's a CPAN search for Excel: cpan> i /Excel/ Distribution I/IS/ISTERIN/XML-Excel-0.02.tar.gz Distribution I/IS/ISTERIN/XML-SAXDriver-Excel-0.06.tar.gz Distribution J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz Distribution K/KW/KWITKNR/DBD-Excel-0.06.tar.gz Distribution K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz Distribution R/RK/RKITOVER/Spreadsheet-ParseExcel_XLHTML-0.02.tar.gz Distribution T/TM/TMTM/Spreadsheet-ParseExcel-Simple-1.01.tar.gz Distribution T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz Distribution T/TM/TMTM/Spreadsheet-WriteExcel-Simple-0.03.tar.gz Module DBD::Excel (K/KW/KWITKNR/DBD-Excel-0.06.tar.gz) Module Spreadsheet::Excel (Contact Author Rachel McGregor Rawlings ) Module Spreadsheet::ParseExcel (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::Dump (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::FmtDefault (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::FmtJapan (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::FmtJapan2 (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::FmtUnicode (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::SaveParser (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel::Simple (T/TM/TMTM/Spreadsheet-ParseExcel-Simple-1.01.tar.gz) Module Spreadsheet::ParseExcel::Utility (K/KW/KWITKNR/Spreadsheet-ParseExcel-0.2602.tar.gz) Module Spreadsheet::ParseExcel_XLHTML (R/RK/RKITOVER/Spreadsheet-ParseExcel_XLHTML-0.02.tar.gz) Module Spreadsheet::WriteExcel (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::BIFFwriter (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::Big (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::Format (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::Formula (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::FromDB (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz) Module Spreadsheet::WriteExcel::FromDB::Oracle (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz) Module Spreadsheet::WriteExcel::FromDB::Pg (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz) Module Spreadsheet::WriteExcel::FromDB::column_finder (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz) Module Spreadsheet::WriteExcel::FromDB::mysql (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz) Module Spreadsheet::WriteExcel::FromDB::sybase (T/TM/TMTM/Spreadsheet-WriteExcel-FromDB-0.09.tar.gz) Module Spreadsheet::WriteExcel::OLEwriter (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::Simple (T/TM/TMTM/Spreadsheet-WriteExcel-Simple-0.03.tar.gz) Module Spreadsheet::WriteExcel::Utility (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::Workbook (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::WorkbookBig (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Spreadsheet::WriteExcel::Worksheet (J/JM/JMCNAMARA/Spreadsheet-WriteExcel-0.40.tar.gz) Module Win32::ShellExt::ExcelToClipboard (J/JB/JBNIVOIT/Win32-ShellExt-0.1.zip) Module XML::Excel (I/IS/ISTERIN/XML-Excel-0.02.tar.gz) Module XML::SAXDriver::Excel (I/IS/ISTERIN/XML-SAXDriver-Excel-0.06.tar.gz) 41 items found In short, Microsoft will have a field day with the Perl folks long before they notice us. Skip From LogiplexSoftware at earthlink.net Fri Jan 31 19:17:21 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 31 Jan 2003 10:17:21 -0800 Subject: [Csv] Access Products sample In-Reply-To: References: <1043957410.16012.122.camel@software1.logiplex.internal> <15929.37687.44696.305338@montanaro.dyndns.org> Message-ID: <1044037040.15753.190.camel@software1.logiplex.internal> On Thu, 2003-01-30 at 18:00, Dave Cole wrote: > >>>>> "Skip" == Skip Montanaro writes: > > >>> The currency column in the table is actually written out with > >>> formatting ($5.66 instead of just 5.66). Note that when Excel > >>> exports this column it has a trailing space for some reason > >>> (,$5.66 ,). > > Cliff> So we've actually found an application that puts an extraneous > Cliff> space around the data, and it's our primary target. Figures. > > Skip> So we just discovered we need an "access" dialect. ;-) > > Not really. Python has no concept of currency types (last time I > looked). The '$5.66 ' thing is an artifact of converting currency to > string, not float to string. I'm not sure what you mean. A trailing space is a trailing space, regardless of data type. In this case, it isn't too important as the data isn't quoted (we can just consider the space part of the data), but it shows that extraneous spaces might not be outside the scope of our problem. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Fri Jan 31 22:39:12 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 31 Jan 2003 15:39:12 -0600 Subject: [Csv] csv.QUOTE_NEVER? Message-ID: <15930.60672.18719.407166@montanaro.dyndns.org> The three quoting constants are currently defined as QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC. Didn't we decide there would be a QUOTE_NEVER constant as well? Skip From skip at pobox.com Fri Jan 31 22:59:40 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 31 Jan 2003 15:59:40 -0600 Subject: [Csv] PEP 305 - CSV File API Message-ID: <15930.61900.995242.11815@montanaro.dyndns.org> A new PEP (305), "CSV File API", is available for reader feedback. This PEP describes an API and implementation for reading and writing CSV files. There is a sample implementation available as well which you can take out for a spin. The PEP is available at http://www.python.org/peps/pep-0305.html (The latest version as of this note is 1.9. Please wait until that is available to grab a copy on which to comment.) The sample implementation, which is heavily based on Object Craft's existing csv module, is available at http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/ To those people who are already using the Object Craft module, make sure you rename your csv.so file before trying this one out. Please send feedback to csv at mail.mojam.com. You can subscribe to that list at http://manatee.mojam.com/mailman/listinfo/csv That page contains a pointer to the list archives. (Many thanks BTW to Barry Warsaw and the Mailman crew for Mailman 2.1. It looks awesome.) -- Skip Montanaro skip at pobox.com http://www.musi-cal.com/