From skip at pobox.com Sat Feb 1 00:18:13 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 31 Jan 2003 17:18:13 -0600 Subject: [Csv] RE: [Python-Dev] PEP 305 - CSV File API In-Reply-To: <000101c2c97d$19cf82c0$8901a8c0@ERICDESKTOP> References: <15930.61900.995242.11815@montanaro.dyndns.org> <000101c2c97d$19cf82c0$8901a8c0@ERICDESKTOP> Message-ID: <15931.1077.597442.713603@montanaro.dyndns.org> eric> Travis Oliphant made a nice package for reading and writing eric> numeric arrays in scipy called scipy.io.... I wanted everyone eric> aware of the available alternative solutions so we can minimize eric> duplicated effort. Eric, Thanks for the heads up. Travis, why don't you subscribe to the csv at mail.mojam.com mailing list and join the fun? We're already considering how the csv module will interface with DB-API-based modules, and of course, Excel is central to our thoughts. It would be good to have the perspective of someone used to slinging scientific data around. The csv mailing list page is at http://manatee.mojam.com/mailman/listinfo/csv Skip From djc at object-craft.com.au Sat Feb 1 06:20:23 2003 From: djc at object-craft.com.au (Dave Cole) Date: 01 Feb 2003 16:20:23 +1100 Subject: [Csv] csv.QUOTE_NEVER? In-Reply-To: <15930.60672.18719.407166@montanaro.dyndns.org> References: <15930.60672.18719.407166@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> The three quoting constants are currently defined as Skip> QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC. Didn't we decide Skip> there would be a QUOTE_NEVER constant as well? I was going to define QUOTE_NEVER then realised that all you have to do is set quotechar to None. Why add the effort of implementing two ways to achieve the same thing. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Sat Feb 1 06:26:11 2003 From: djc at object-craft.com.au (Dave Cole) Date: 01 Feb 2003 16:26:11 +1100 Subject: [Csv] Access Products sample In-Reply-To: <1044037040.15753.190.camel@software1.logiplex.internal> References: <1043957410.16012.122.camel@software1.logiplex.internal> <15929.37687.44696.305338@montanaro.dyndns.org> <1044037040.15753.190.camel@software1.logiplex.internal> Message-ID: >>>>> "Cliff" == Cliff Wells writes: Cliff> On Thu, 2003-01-30 at 18:00, Dave Cole wrote: >> >>>>> "Skip" == Skip Montanaro writes: >> >> >>> The currency column in the table is actually written out with >> >>> formatting ($5.66 instead of just 5.66). Note that when Excel >> >>> exports this column it has a trailing space for some reason >>> >> (,$5.66 ,). >> Cliff> So we've actually found an application that puts an extraneous Cliff> space around the data, and it's our primary target. Figures. >> Skip> So we just discovered we need an "access" dialect. ;-) >> Not really. Python has no concept of currency types (last time I >> looked). The '$5.66 ' thing is an artifact of converting currency >> to string, not float to string. Cliff> I'm not sure what you mean. A trailing space is a trailing Cliff> space, regardless of data type. In this case, it isn't too Cliff> important as the data isn't quoted (we can just consider the Cliff> space part of the data), but it shows that extraneous spaces Cliff> might not be outside the scope of our problem. In my typically clumsy way I was trying to say that Excel has more type information available to it regarding the data being exported. The fact that the data has been formatted as currency tells Excel that it is not just a float, it is a money. Python does not have a money type. It seems that Excel then exports the money in a way which allows it to restore the formatting/type on import. Mind you I have not tried export/import on Excel, I am just guessing that the type is restored on import. - Dave -- http://www.object-craft.com.au From skip at pobox.com Sat Feb 1 16:05:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 1 Feb 2003 09:05:27 -0600 Subject: [Csv] csv.QUOTE_NEVER? In-Reply-To: References: <15930.60672.18719.407166@montanaro.dyndns.org> Message-ID: <15931.57911.857151.359281@montanaro.dyndns.org> Dave> I was going to define QUOTE_NEVER then realised that all you have Dave> to do is set quotechar to None. Why add the effort of Dave> implementing two ways to achieve the same thing. I think there's a certain uniformity in having the full spectrum of quote behaviors defined (from QUOTE_ALL ... QUOTE_NEVER). I skimmed the _csv.c source quickly just now but didn't see self->quoting used anywhere. Skip From skip at pobox.com Sat Feb 1 16:12:00 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 1 Feb 2003 09:12:00 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15931.58304.338512.44007@montanaro.dyndns.org> Fool that I am, when I announced PEP 305 I didn't set my Reply-To header to this list. I'm forwarding a few responses that have turned up on c.l.py. Skip -------------- next part -------------- An embedded message was scrubbed... From: Andrew Dalke Subject: Re: PEP 305 - CSV File API Date: Fri, 31 Jan 2003 17:17:48 -0700 Size: 7894 Url: http://mail.python.org/pipermail/csv/attachments/20030201/6af945ac/attachment.mht From skip at pobox.com Sat Feb 1 16:12:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 1 Feb 2003 09:12:09 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15931.58313.445471.334543@montanaro.dyndns.org> An embedded message was scrubbed... From: Ian Bicking Subject: Re: PEP 305 - CSV File API Date: 31 Jan 2003 20:03:10 -0600 Size: 5991 Url: http://mail.python.org/pipermail/csv/attachments/20030201/a519514f/attachment.mht From skip at pobox.com Sat Feb 1 16:14:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 1 Feb 2003 09:14:01 -0600 Subject: [Csv] RE: [Python-Dev] PEP 305 - CSV File API (fwd) Message-ID: <15931.58425.521404.154286@montanaro.dyndns.org> Passing this along as well. Travis Oliphant from the SciPy bunch joined the group. He's the author of scipy.io which includes facilities to read and write data in various formats. I haven't looked at the package. I'll let Travis summarize its relevant capabilities. Skip -------------- next part -------------- An embedded message was scrubbed... From: Travis Oliphant Subject: RE: [Python-Dev] PEP 305 - CSV File API Date: 31 Jan 2003 19:55:33 -0700 Size: 5100 Url: http://mail.python.org/pipermail/csv/attachments/20030201/924db701/attachment.mht From skip at pobox.com Sat Feb 1 16:14:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 1 Feb 2003 09:14:11 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15931.58435.789383.690793@montanaro.dyndns.org> An embedded message was scrubbed... From: Max M Subject: Re: PEP 305 - CSV File API Date: Sat, 01 Feb 2003 13:43:01 +0100 Size: 4518 Url: http://mail.python.org/pipermail/csv/attachments/20030201/0592e1ed/attachment.mht From skip at pobox.com Sat Feb 1 16:14:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 1 Feb 2003 09:14:18 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15931.58442.615341.935260@montanaro.dyndns.org> An embedded message was scrubbed... From: Roman Suzi Subject: Re: PEP 305 - CSV File API Date: Sat, 1 Feb 2003 16:50:12 +0300 (MSK) Size: 4160 Url: http://mail.python.org/pipermail/csv/attachments/20030201/397dff5b/attachment.mht From djc at object-craft.com.au Sun Feb 2 10:42:59 2003 From: djc at object-craft.com.au (Dave Cole) Date: 02 Feb 2003 20:42:59 +1100 Subject: [Csv] csv.QUOTE_NEVER? In-Reply-To: <15931.57911.857151.359281@montanaro.dyndns.org> References: <15930.60672.18719.407166@montanaro.dyndns.org> <15931.57911.857151.359281@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> I was going to define QUOTE_NEVER then realised that all you Dave> have to do is set quotechar to None. Why add the effort of Dave> implementing two ways to achieve the same thing. Skip> I think there's a certain uniformity in having the full spectrum Skip> of quote behaviors defined (from QUOTE_ALL ... QUOTE_NEVER). I Skip> skimmed the _csv.c source quickly just now but didn't see Skip> self->quoting used anywhere. Not implemented yet. The options on quoting are extensions to the current module behaviour. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Sun Feb 2 12:04:42 2003 From: djc at object-craft.com.au (Dave Cole) Date: 02 Feb 2003 22:04:42 +1100 Subject: [Csv] Added code to implement quoting styles Message-ID: >>> import _csv >>> >>> p = _csv.parser(escapechar='\\') >>> l = ('a',2,'hello, there') >>> >>> for i in range(4): ... p.quoting = i ... print p.join(l) ... a,2,"hello, there" "a","2","hello, there" "a",2,"hello, there" a,2,hello\, there >>> p.escapechar = None >>> print p.join(l) Traceback (most recent call last): File "", line 1, in ? _csv.Error: delimter must be quoted or escaped Ooops - just noticed the spelling error - I will fix that. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Sun Feb 2 12:15:45 2003 From: djc at object-craft.com.au (Dave Cole) Date: 02 Feb 2003 22:15:45 +1100 Subject: [Csv] Implemented skipinitialspace Message-ID: Well that was easy, just one extra test. >>> import _csv >>> p = _csv.parser() >>> s = '"quoted", "not quoted, but this ""field"" has delimiters and quotes"' >>> p.skipinitialspace = 0 >>> p.parse(s) ['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"'] >>> p.skipinitialspace = 1 >>> p.parse(s) ['quoted', 'not quoted, but this "field" has delimiters and quotes'] - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Sun Feb 2 13:00:08 2003 From: djc at object-craft.com.au (Dave Cole) Date: 02 Feb 2003 23:00:08 +1100 Subject: [Csv] Implemented lineterminator Message-ID: The _csv.parser.join() now appends the lineterminator to the resulting record. >>> import _csv >>> p = _csv.parser() >>> p.join([1,2,3]) '1,2,3\r\n' >>> p.lineterminator = '\n' >>> p.join([1,2,3]) '1,2,3\n' I have not put any code into the parser to detect and report/fix fields which contain newlines which do not match the lineterminator. What should be happening there? - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Sun Feb 2 13:26:11 2003 From: djc at object-craft.com.au (Dave Cole) Date: 02 Feb 2003 23:26:11 +1100 Subject: [Csv] Made some small changes to the PEP Message-ID: Here is the commit message: Changed the csv.reader() fileobj argument to interable. This give us much more flexibility in processing filtered data. Made the example excel dialect match the dialect in csv.py. Added explanation of doublequote. Added explanation of csv.QUOTE_NONE. - Dave -- http://www.object-craft.com.au From skip at pobox.com Sun Feb 2 15:22:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 08:22:58 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.10690.465556.671158@montanaro.dyndns.org> Passing along so we get it in the list archive... Skip -------------- next part -------------- An embedded message was scrubbed... From: Tyler Eaves Subject: Re: PEP 305 - CSV File API Date: Sun, 02 Feb 2003 03:14:17 GMT Size: 7256 Url: http://mail.python.org/pipermail/csv/attachments/20030202/cf2350b1/attachment.mht From skip at pobox.com Sun Feb 2 15:25:37 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 08:25:37 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.10849.138459.812300@montanaro.dyndns.org> Another for the archive. -------------- next part -------------- An embedded message was scrubbed... From: Jack Diederich Subject: Re: PEP 305 - CSV File API Date: Sat, 1 Feb 2003 22:43:37 -0500 Size: 4895 Url: http://mail.python.org/pipermail/csv/attachments/20030202/5fecd21c/attachment.mht From skip at pobox.com Sun Feb 2 18:35:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 11:35:02 -0600 Subject: [Csv] Re: [Python-Dev] PEP 305 - CSV File API Message-ID: <15933.22214.952419.308149@montanaro.dyndns.org> I don't see this one in the archives. I think Travis sent it to me but meant to send to the entire list. Skip -------------- next part -------------- An embedded message was scrubbed... From: Travis Oliphant Subject: Re: [Csv] RE: [Python-Dev] PEP 305 - CSV File API (fwd) Date: 01 Feb 2003 22:27:35 -0700 Size: 7775 Url: http://mail.python.org/pipermail/csv/attachments/20030202/1d774a74/attachment.mht From skip at pobox.com Sun Feb 2 19:31:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 12:31:07 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.25579.554586.898615@montanaro.dyndns.org> Pushing over to archive -------------- next part -------------- An embedded message was scrubbed... From: Jarek Zgoda Subject: Re: PEP 305 - CSV File API Date: Sun, 2 Feb 2003 07:42:32 +0000 (UTC) Size: 5335 Url: http://mail.python.org/pipermail/csv/attachments/20030202/36a3a238/attachment.mht From skip at pobox.com Sun Feb 2 19:32:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 12:32:18 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.25650.724660.838965@montanaro.dyndns.org> for the archives -------------- next part -------------- An embedded message was scrubbed... From: Dave Cole Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 20:04:20 +1100 Size: 6567 Url: http://mail.python.org/pipermail/csv/attachments/20030202/9e36fe4f/attachment.mht From skip at pobox.com Mon Feb 3 00:20:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 17:20:10 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.42922.225695.699946@montanaro.dyndns.org> archive... -------------- next part -------------- An embedded message was scrubbed... From: Dave Cole Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 20:40:34 +1100 Size: 6638 Url: http://mail.python.org/pipermail/csv/attachments/20030202/5c6d20a7/attachment.mht From skip at pobox.com Mon Feb 3 00:22:46 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 17:22:46 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.43078.350784.454286@montanaro.dyndns.org> archive... -------------- next part -------------- An embedded message was scrubbed... From: Ian Bicking Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 04:01:42 -0600 Size: 7540 Url: http://mail.python.org/pipermail/csv/attachments/20030202/f7d53ca3/attachment.mht From skip at pobox.com Mon Feb 3 00:28:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 17:28:04 -0600 Subject: [Csv] Added code to implement quoting styles In-Reply-To: References: Message-ID: <15933.43396.469377.935888@montanaro.dyndns.org> Looks good. To avoid thinking of us having two ways to specify don't quote, I was thinking of quotechar as what to quote with if quoting isn't QUOTE_NEVER. Skip From skip at pobox.com Mon Feb 3 00:17:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 17:17:58 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.42790.923347.973573@montanaro.dyndns.org> More for the archives... -------------- next part -------------- An embedded message was scrubbed... From: Dave Cole Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 20:25:38 +1100 Size: 7760 Url: http://mail.python.org/pipermail/csv/attachments/20030202/90188c61/attachment.mht From skip at pobox.com Mon Feb 3 03:13:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 20:13:09 -0600 Subject: [Csv] weird default dialects Message-ID: <15933.53301.891154.795964@montanaro.dyndns.org> I know the behavior is reasonable, but this code class Dialect: delimiter = ',' quotechar = '"' escapechar = None doublequote = True skipinitialspace = False lineterminator = '\r\n' quoting = QUOTE_MINIMAL class excel(Dialect): pass looks really weird to me. I'd prefer it if the Dialect class simply defined the various parameters, but gave them invalid values like None or NotImplemented and then have the excel class fill it its values: class Dialect: delimiter = None quotechar = None escapechar = None doublequote = None skipinitialspace = None lineterminator = None quoting = None class excel(Dialect): delimiter = ',' quotechar = '"' escapechar = None doublequote = True skipinitialspace = False lineterminator = '\r\n' quoting = QUOTE_MINIMAL I know that's a bit more verbose, but people probably shouldn't be able to use Dialect directly, and if they subclass incompletely from Dialect, I think they should get exceptions. If what they want is "just like Excel except ...", they shouldn't be able to get away with subclassing Dialect. They should have to subclass excel. I suggested NotImplemented as a possible default value because None *is* a valid value for at least one of the parameters. Make sense? Skip From andrewm at object-craft.com.au Mon Feb 3 03:46:45 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 03 Feb 2003 13:46:45 +1100 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv csv.py,1.4,1.5 In-Reply-To: Message from Skip Montanaro <15930.26577.952898.246807@montanaro.dyndns.org> References: <15930.26577.952898.246807@montanaro.dyndns.org> Message-ID: <20030203024645.70B6E3C1F4@coffee.object-craft.com.au> > andrew> Rename dialects from excel2000 to excel. Rename Error to be > andrew> CSVError. Explicity fetch iterator in reader class, rather than > andrew> simply calling next() (which only works for self-iterators). > >Minor nit. I think Error was fine. That's the standard for most extension >modules. I would normally import csv then reference its objects through it. >csv.CSVError looks redundant to me. I'm not a "from csv import CSVError" >kind of guy however, so I can understand the desire to make the name more >explicit when considered alone. I'm inclined to agree, although "Error" tends to be a bit of a show-stopper for people who want to do "from csv import ..." Anyone object to me changing it back to "Error"? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Feb 3 04:14:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:14:10 -0600 Subject: [Csv] Various changes Message-ID: <15933.56962.107343.919881@montanaro.dyndns.org> Folks, I made a number of changes this evening. * renamed set_dialect() to register_dialect() * defined the public API using csv.__all__ * hid "dialects" and "OCcsv" with leading underscores so it's clear (even without __all__) that they are not part of the public API * added a first stab at a section for the library reference manual * added a couple conditional macro def'ns to _csv.c so it would compile using Python 2.2.2 * added a few test cases for dialects and writing array.array objects You might want to "csv up". ;-) Skip From skip at pobox.com Mon Feb 3 04:23:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:23:30 -0600 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv _csv.c,1.8,1.9 In-Reply-To: References: Message-ID: <15933.57522.334574.306027@montanaro.dyndns.org> dave> Modified Files: dave> _csv.c dave> Log Message: dave> Fixed refcount bug in constructor regarding lineterminator string. dave> Implemented lineterminator functionality - appends lineterminator dave> to end of joined record. Not sure what to do with \n which do not dave> match the lineterminator string... I'm not sure what you mean with that last sentence. Are you worried about distinguishing the line terminator from a hard return? Skip From skip at pobox.com Mon Feb 3 04:24:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:24:28 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.57580.49194.72218@montanaro.dyndns.org> for the archive... -------------- next part -------------- An embedded message was scrubbed... From: Roman Suzi Subject: Re: PEP 305 - CSV File API Date: Sun, 2 Feb 2003 13:54:57 +0300 (MSK) Size: 5250 Url: http://mail.python.org/pipermail/csv/attachments/20030202/a449ff4f/attachment.mht From skip at pobox.com Mon Feb 3 04:28:52 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:28:52 -0600 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv _csv.c,1.9,1.10 In-Reply-To: References: Message-ID: <15933.57844.794060.305738@montanaro.dyndns.org> dave> Oops - forgot to check for '+-.' when quoting is QUOTE_NONNUMERIC. Looking at the code, I wonder if when quoting is set to NONNUMERIC a single attempt to call PyFloat_FromString(field) should be made and the result used to identify the field as numeric or not. (Not for performance, but for accuracy of the setting.) Skip From skip at pobox.com Mon Feb 3 04:29:35 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:29:35 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.57887.991404.553688@montanaro.dyndns.org> for the archive. -------------- next part -------------- An embedded message was scrubbed... From: Dave Cole Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 23:46:26 +1100 Size: 7499 Url: http://mail.python.org/pipermail/csv/attachments/20030202/cf5c4281/attachment.mht From skip at pobox.com Mon Feb 3 04:32:39 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:32:39 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58071.69450.29922@montanaro.dyndns.org> archive... -------------- next part -------------- An embedded message was scrubbed... From: Skip Montanaro Subject: Re: PEP 305 - CSV File API Date: Sat, 1 Feb 2003 20:41:59 -0600 Size: 5208 Url: http://mail.python.org/pipermail/csv/attachments/20030202/ca6223ad/attachment.mht From skip at pobox.com Mon Feb 3 04:35:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:35:25 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58237.872242.701037@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Andrew Dalke Subject: Re: PEP 305 - CSV File API Date: Sun, 02 Feb 2003 11:51:47 -0700 Size: 10225 Url: http://mail.python.org/pipermail/csv/attachments/20030202/450bddf7/attachment.mht From skip at pobox.com Mon Feb 3 04:38:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:38:24 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58416.721311.156005@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Ian Bicking Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 15:01:04 -0600 Size: 6300 Url: http://mail.python.org/pipermail/csv/attachments/20030202/19030e85/attachment.mht From skip at pobox.com Mon Feb 3 04:39:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:39:28 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58480.217758.128918@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Alex Martelli Subject: Re: PEP 305 - CSV File API Date: Sun, 02 Feb 2003 22:41:02 GMT Size: 4159 Url: http://mail.python.org/pipermail/csv/attachments/20030202/f17c3392/attachment.mht From skip at pobox.com Mon Feb 3 04:40:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:40:01 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58513.404485.292538@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Dennis Lee Bieber Subject: Re: PEP 305 - CSV File API Date: Sun, 02 Feb 2003 13:55:21 -0800 Size: 6560 Url: http://mail.python.org/pipermail/csv/attachments/20030202/4e9b6fe0/attachment.mht From skip at pobox.com Mon Feb 3 04:40:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:40:23 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58535.543168.298725@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Ian Bicking Subject: Re: PEP 305 - CSV File API Date: 02 Feb 2003 17:07:44 -0600 Size: 5146 Url: http://mail.python.org/pipermail/csv/attachments/20030202/0273ee91/attachment.mht From skip at pobox.com Mon Feb 3 04:42:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 21:42:01 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15933.58633.201457.636414@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Carlos Ribeiro Subject: Re: PEP 305 - CSV File API Date: Mon, 3 Feb 2003 00:22:52 +0000 Size: 5956 Url: http://mail.python.org/pipermail/csv/attachments/20030202/7a67e779/attachment.mht From andrewm at object-craft.com.au Mon Feb 3 04:51:02 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 03 Feb 2003 14:51:02 +1100 Subject: [Csv] csv.QUOTE_NEVER? In-Reply-To: Message from Dave Cole References: <15930.60672.18719.407166@montanaro.dyndns.org> Message-ID: <20030203035102.34B183C1F4@coffee.object-craft.com.au> >Skip> The three quoting constants are currently defined as >Skip> QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC. Didn't we decide >Skip> there would be a QUOTE_NEVER constant as well? > >I was going to define QUOTE_NEVER then realised that all you have to >do is set quotechar to None. Why add the effort of implementing two >ways to achieve the same thing. "quotechar" as None probably should be illegal in the new module, and the "quoting" parameter used exclusively. This would be consistent with the direction we've taken with other parameters. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Feb 3 05:15:01 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 03 Feb 2003 15:15:01 +1100 Subject: [Csv] The writer class Message-ID: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au> We document this as a wrapper around a file-like object - I'd assumed it should be providing a file-like interface itself (in particular, I have it a close() method, and was attempting to close the file when the destructor was called), but I now think this is wrong. I propose to remove the following code from the writer class: def close(self): self.fileobj.close() del self.fileobj def __del__(self): if hasattr(self, 'fileobj'): try: self.close() except: pass Comments? I also noticed some negative comments regarding the choice of the name "write" for the method that writes fields. The comments essentially said that this method name is used by other classes where strings are being writen. I agree - we probably should call it something like "writefields" or "write_fields". Comments? What should we call the "writelines" method (that accepts an iterable and writes multiple "lines") in this case? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Feb 3 06:03:29 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 2 Feb 2003 23:03:29 -0600 Subject: [Csv] The writer class In-Reply-To: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au> References: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au> Message-ID: <15933.63521.813128.412444@montanaro.dyndns.org> Andrew> .... I propose to remove the following code from the writer Andrew> class: ... Andrew> Comments? Agreed. This bothered me as well. Andrew> .... we probably should call it something like "writefields" or Andrew> "write_fields". Comments? Someone on c.l.py suggested writerow(s). I sort of liked that. As you noted about write(), both it and append() both carry enough baggage from other usage. Andrew> What should we call the "writelines" method (that accepts an Andrew> iterable and writes multiple "lines") in this case? How about "writerow" for the singular and "writerows" for the plural? Skip From andrewm at object-craft.com.au Mon Feb 3 06:35:16 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 03 Feb 2003 16:35:16 +1100 Subject: [Csv] The writer class In-Reply-To: Message from Skip Montanaro <15933.63521.813128.412444@montanaro.dyndns.org> References: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au> <15933.63521.813128.412444@montanaro.dyndns.org> Message-ID: <20030203053516.2BD573C1F4@coffee.object-craft.com.au> > Andrew> .... I propose to remove the following code from the writer > Andrew> class: > ... > Andrew> Comments? > >Agreed. This bothered me as well. Done (damn, forgot to mention that in the check-in comment). > Andrew> .... we probably should call it something like "writefields" or > Andrew> "write_fields". Comments? > >Someone on c.l.py suggested writerow(s). I sort of liked that. As you >noted about write(), both it and append() both carry enough baggage from >other usage. I like that. Done. > Andrew> What should we call the "writelines" method (that accepts an > Andrew> iterable and writes multiple "lines") in this case? > >How about "writerow" for the singular and "writerows" for the plural? Yep. Done. I've also changed CSVError back to just Error for the sake of consistency, if nothing else. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Feb 3 13:46:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 06:46:51 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15934.25787.133963.848679@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: sjmachin at lexicon.net (John Machin) Subject: Re: PEP 305 - CSV File API Date: 3 Feb 2003 02:15:17 -0800 Size: 4476 Url: http://mail.python.org/pipermail/csv/attachments/20030203/338781c8/attachment.mht From skip at pobox.com Mon Feb 3 16:31:40 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 09:31:40 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15934.35676.989162.259027@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: "John Roth" Subject: Re: PEP 305 - CSV File API Date: Mon, 3 Feb 2003 09:28:16 -0500 Size: 5305 Url: http://mail.python.org/pipermail/csv/attachments/20030203/f701eda4/attachment.mht From skip at pobox.com Mon Feb 3 16:38:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 09:38:04 -0600 Subject: [Csv] Re: PEP 305 - CSV File API In-Reply-To: References: <9ng0h-nh3.ln1@beastie.ix.netcom.com> Message-ID: <15934.36060.434826.450769@montanaro.dyndns.org> I think I have the attributions right. Carlos> that happen to be problematic, and that are locale-related: Carlos> - reading dates from a CSV file JohnM> Certainly dates are a problem ... however, in what way is reading JohnM> dates from a CSV-format file any different to reading them from JohnM> any other format? JohnR> It's not particularly different. What is needed is the ability to JohnR> associate the necessary parameters with a date column to do the JohnR> application dependent "correct" transformation, based on the JohnR> available date libraries. I will note that the csv module under development makes *no* attempts at any kind of data conversion when reading CSV files. Even ints and floats are returned as strings. It's left up to the application programmer to perform type conversions. On output, the situation is similar. For some passing compatibility with the DB-API (which represents SQL NULL values as None), None is currently being written as the empty string (though this is perhaps still subject to change). Other than that, str() is simply called for all data being written to the file. Skip From skip at pobox.com Mon Feb 3 17:38:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 10:38:18 -0600 Subject: [Csv] passing dialects directly - class or instance? Message-ID: <15934.39674.453212.506375@montanaro.dyndns.org> I thought users were supposed to pass dialect classes when not using strings. I see, however, that _OCcsv.__init__ calls isinstance() instead of issubclass(). Which is it supposed to be? Skip From skip at pobox.com Mon Feb 3 17:57:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 10:57:04 -0600 Subject: [Csv] test coverage Message-ID: <15934.40800.760370.712420@montanaro.dyndns.org> Attached is the output of running a gcov-instrumented version of Python and the _csv module against the current test suite. FYI. Skip -------------- next part -------------- /* TODO: + Add reader() and writer() functions which return CSV reader/writer objects which implement the PEP interface: csvreader = csv.reader(file("blah.csv", "rb"), kwargs) for row in csvreader: process(row) csvwriter = csv.writer(file("some.csv", "wb"), kwargs) for row in someiter: csvwriter.write(row) + Add CsvWriter.writelines(someiter) */ #include "Python.h" #include "structmember.h" /* begin 2.2 compatibility macros */ #ifndef PyDoc_STRVAR /* Define macros for inline documentation. */ #define PyDoc_VAR(name) static char name[] #define PyDoc_STRVAR(name,str) PyDoc_VAR(name) = PyDoc_STR(str) #ifdef WITH_DOC_STRINGS #define PyDoc_STR(str) str #else #define PyDoc_STR(str) "" #endif #endif /* ifndef PyDoc_STRVAR */ #ifndef PyMODINIT_FUNC # if defined(__cplusplus) # define PyMODINIT_FUNC extern "C" void # else /* __cplusplus */ # define PyMODINIT_FUNC void # endif /* __cplusplus */ #endif /* end 2.2 compatibility macros */ static PyObject *error_obj; /* CSV exception */ typedef enum { START_RECORD, START_FIELD, ESCAPED_CHAR, IN_FIELD, IN_QUOTED_FIELD, ESCAPE_IN_QUOTED_FIELD, QUOTE_IN_QUOTED_FIELD } ParserState; typedef enum { QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE } QuoteStyle; typedef struct { PyObject_HEAD int doublequote; /* is " represented by ""? */ char delimiter; /* field separator */ int have_quotechar; /* is a quotechar defined */ char quotechar; /* quote character */ int have_escapechar; /* is an escapechar defined */ char escapechar; /* escape character */ int skipinitialspace; /* ignore spaces following delimiter? */ PyObject *lineterminator; /* string to write between records */ QuoteStyle quoting; /* style of quoting to write */ ParserState state; /* current CSV parse state */ PyObject *fields; /* field list for current record */ int autoclear; /* should fields be cleared on next parse() after exception? */ int strict; /* raise exception on bad CSV */ int had_parse_error; /* did we have a parse error? */ char *field; /* build current field in here */ int field_size; /* size of allocated buffer */ int field_len; /* length of current field */ char *rec; /* buffer for parser.join */ int rec_size; /* size of allocated record */ int rec_len; /* length of record */ int num_fields; /* number of fields in record */ } ParserObj; staticforward PyTypeObject Parser_Type; static PyObject * raise_exception(char *fmt, ...) 1 { 1 va_list ap; 1 char msg[512]; 1 PyObject *pymsg; 1 va_start(ap, fmt); #ifdef _WIN32 _vsnprintf(msg, sizeof(msg), fmt, ap); #else 1 vsnprintf(msg, sizeof(msg), fmt, ap); #endif 1 va_end(ap); 1 pymsg = PyString_FromString(msg); 1 PyErr_SetObject(error_obj, pymsg); 1 Py_XDECREF(pymsg); 1 return NULL; } static void parse_save_field(ParserObj *self) 39 { 39 PyObject *field; 39 field = PyString_FromStringAndSize(self->field, self->field_len); 39 if (field != NULL) { 39 PyList_Append(self->fields, field); 39 Py_XDECREF(field); } 39 self->field_len = 0; } static int parse_grow_buff(ParserObj *self) 12 { 12 if (self->field_size == 0) { 12 self->field_size = 4096; 12 self->field = PyMem_Malloc(self->field_size); } else { ###### self->field_size *= 2; ###### self->field = PyMem_Realloc(self->field, self->field_size); } 12 if (self->field == NULL) { ###### PyErr_NoMemory(); ###### return 0; } 12 return 1; } static void parse_add_char(ParserObj *self, char c) 192 { 192 if (self->field_len == self->field_size && !parse_grow_buff(self)) ###### return; 192 self->field[self->field_len++] = c; } static void parse_prepend_char(ParserObj *self, char c) ###### { ###### if (self->field_len == self->field_size && !parse_grow_buff(self)) ###### return; ###### memmove(self->field + 1, self->field, self->field_len); ###### self->field[0] = c; ###### self->field_len++; } static void parse_process_char(ParserObj *self, char c) 262 { 262 switch (self->state) { case START_RECORD: /* start of record */ 17 if (c == '\0') /* empty line - return [] */ ###### break; /* normal character - handle as START_FIELD */ 17 self->state = START_FIELD; /* fallthru */ case START_FIELD: /* expecting field */ 39 if (c == '\0') { /* save empty field - return [fields] */ 3 parse_save_field(self); 3 self->state = START_RECORD; } 36 else if (c == self->quotechar) { /* start quoted field */ 12 self->state = IN_QUOTED_FIELD; } 24 else if (c == self->escapechar) { /* possible escaped character */ ###### self->state = ESCAPED_CHAR; } 24 else if (c == self->delimiter) { /* save empty field */ 2 parse_save_field(self); } 22 else if (c == ' ' && self->skipinitialspace) /* ignore space at start of field */ ; else { /* begin new unquoted field */ 22 parse_add_char(self, c); 22 self->state = IN_FIELD; } 22 break; case ESCAPED_CHAR: ###### if (c != self->escapechar && c != self->delimiter && c != self->quotechar) ###### parse_add_char(self, self->escapechar); ###### parse_add_char(self, c); ###### self->state = IN_FIELD; ###### break; case IN_FIELD: /* in unquoted field */ 42 if (c == '\0') { /* end of line - return [fields] */ 8 parse_save_field(self); 8 self->state = START_RECORD; } 34 else if (c == self->escapechar) { /* possible escaped character */ ###### self->state = ESCAPED_CHAR; } 34 else if (c == self->delimiter) { /* save field - wait for new field */ 16 parse_save_field(self); 16 self->state = START_FIELD; } else { /* normal character - save in field */ 18 parse_add_char(self, c); } 18 break; case IN_QUOTED_FIELD: /* in quoted field */ 162 if (c == '\0') { /* end of line - save '\n' in field */ 2 parse_add_char(self, '\n'); } 160 else if (c == self->escapechar) { /* Possible escape character */ ###### self->state = ESCAPE_IN_QUOTED_FIELD; } 160 else if (c == self->quotechar) { 19 if (self->doublequote) { /* doublequote; " represented by "" */ 19 self->state = QUOTE_IN_QUOTED_FIELD; } else { /* end of quote part of field */ ###### self->state = IN_FIELD; } } else { /* normal character - save in field */ 141 parse_add_char(self, c); } 141 break; case ESCAPE_IN_QUOTED_FIELD: ###### if (c != self->escapechar && c != self->delimiter && c != self->quotechar) ###### parse_add_char(self, self->escapechar); ###### parse_add_char(self, c); ###### self->state = IN_QUOTED_FIELD; ###### break; case QUOTE_IN_QUOTED_FIELD: /* doublequote - seen a quote in an quoted field */ 19 if (self->have_quotechar && c == self->quotechar) { /* save "" as " */ 7 parse_add_char(self, c); 7 self->state = IN_QUOTED_FIELD; } 12 else if (c == self->delimiter) { /* save field - wait for new field */ 4 parse_save_field(self); 4 self->state = START_FIELD; } 8 else if (c == '\0') { /* end of line - return [fields] */ 6 parse_save_field(self); 6 self->state = START_RECORD; } 2 else if (!self->strict) { 2 parse_add_char(self, c); 2 self->state = IN_FIELD; } else { /* illegal */ ###### self->had_parse_error = 1; ###### raise_exception("%c expected after %c", self->delimiter, self->quotechar); } break; } } static void clear_fields_and_status(ParserObj *self) ###### { ###### if (self->fields) { ###### Py_XDECREF(self->fields); } ###### self->fields = PyList_New(0); ###### self->field_len = 0; ###### self->state = START_RECORD; ###### self->had_parse_error = 0; } /* ---------------------------------------------------------------- */ PyDoc_STRVAR(Parser_parse_doc, "parse(s) -> list of strings\n" "\n" "CSV parse the single line in the string s and return a\n" "list of string fields. If the CSV record contains multi-line\n" "fields, the function will return None until all lines of the\n" "record have been parsed."); static PyObject * Parser_parse(ParserObj *self, PyObject *args) 19 { 19 char *line; 19 if (!PyArg_ParseTuple(args, "s", &line)) ###### return NULL; 19 if (self->autoclear && self->had_parse_error) ###### clear_fields_and_status(self); /* Process line of text - send '\0' to processing code to represent end of line. End of line which is not at end of string is an error. */ 262 while (*line) { 246 char c; 246 c = *line++; 246 if (c == '\r') { ###### c = *line++; ###### if (c == '\0') /* macintosh end of line */ ###### break; ###### if (c == '\n') { ###### c = *line++; ###### if (c == '\0') /* DOS end of line */ ###### break; } ###### self->had_parse_error = 1; ###### return raise_exception("newline inside string"); } 246 if (c == '\n') { 3 c = *line++; 3 if (c == '\0') /* unix end of line */ 3 break; ###### self->had_parse_error = 1; ###### return raise_exception("newline inside string"); } 243 parse_process_char(self, c); 243 if (PyErr_Occurred()) ###### return NULL; } 19 parse_process_char(self, '\0'); 19 if (self->state == START_RECORD) { 17 PyObject *fields = self->fields; 17 self->fields = PyList_New(0); 17 return fields; } 2 Py_INCREF(Py_None); 2 return Py_None; } /* ---------------------------------------------------------------- */ PyDoc_STRVAR(Parser_clear_doc, "clear() -> None\n" "\n" "Discard partially parsed record. This must be called to reset\n" "parser state after an exception."); static PyObject * Parser_clear(ParserObj *self) ###### { ###### clear_fields_and_status(self); ###### Py_INCREF(Py_None); ###### return Py_None; } /* ---------------------------------------------------------------- */ static void join_reset(ParserObj *self) 11 { 11 self->rec_len = 0; 11 self->num_fields = 0; } #define MEM_INCR 32768 /* Calculate new record length or append field to record. Return new * record length. */ static int join_append_data(ParserObj *self, char *field, int quote_empty, int *quoted, int copy_phase) 270 { 270 int i, rec_len; 270 rec_len = self->rec_len; /* If this is not the first field we need a field separator. */ 270 if (self->num_fields > 0) { 248 if (copy_phase) 124 self->rec[rec_len] = self->delimiter; 248 rec_len++; } /* Handle preceding quote. */ 270 switch (self->quoting) { case QUOTE_ALL: ###### *quoted = 1; ###### if (copy_phase) ###### self->rec[rec_len] = self->quotechar; ###### rec_len++; ###### break; case QUOTE_MINIMAL: case QUOTE_NONNUMERIC: /* We only know about quoted in the copy phase. */ 270 if (copy_phase && *quoted) { 3 self->rec[rec_len] = self->quotechar; 3 rec_len++; } break; case QUOTE_NONE: 270 break; } /* Copy/count field data. */ 1090 for (i = 0;; i++) { 1090 char c = field[i]; 1090 if (c == '\0') 270 break; /* If in doublequote mode we escape quote chars with a * quote. */ 820 if (self->have_quotechar && c == self->quotechar && self->doublequote) { 4 if (copy_phase) 2 self->rec[rec_len] = self->quotechar; 4 *quoted = 1; 4 rec_len++; 816 } else if (self->quoting == QUOTE_NONNUMERIC && !*quoted && !(isdigit(c) || c == '+' || c == '-' || c == '.')) ###### *quoted = 1; /* Some special characters need to be escaped. If we have a * quote character switch to quoted field instead of escaping * individual characters. */ 820 if (!*quoted && (c == self->delimiter || c == self->escapechar || c == '\n' || c == '\r')) { 2 if (self->have_quotechar && self->quoting != QUOTE_NONE) 2 *quoted = 1; ###### else if (self->escapechar) { ###### if (copy_phase) ###### self->rec[rec_len] = self->escapechar; ###### rec_len++; } else { ###### raise_exception("delimiter must be quoted or escaped"); ###### return -1; } } /* Copy field character into record buffer. */ 820 if (copy_phase) 410 self->rec[rec_len] = c; 820 rec_len++; } /* If field is empty check if it needs to be quoted. */ 270 if (i == 0 && quote_empty && self->have_quotechar) ###### *quoted = 1; /* Handle final quote character on field. */ 270 if (*quoted) { 6 if (copy_phase) 3 self->rec[rec_len] = self->quotechar; else /* Didn't know about leading quote until we found it * necessary in field data - compensate for it now. */ 3 rec_len++; 6 rec_len++; } 270 return rec_len; } static int join_check_rec_size(ParserObj *self, int rec_len) 146 { 146 if (rec_len > self->rec_size) { 11 if (self->rec_size == 0) { 11 self->rec_size = (rec_len / MEM_INCR + 1) * MEM_INCR; 11 self->rec = PyMem_Malloc(self->rec_size); } else { ###### char *old_rec = self->rec; ###### self->rec_size = (rec_len / MEM_INCR + 1) * MEM_INCR; ###### self->rec = PyMem_Realloc(self->rec, self->rec_size); ###### if (self->rec == NULL) ###### free(old_rec); } 11 if (self->rec == NULL) { ###### PyErr_NoMemory(); ###### return 0; } } 146 return 1; } static int join_append(ParserObj *self, char *field, int quote_empty) 135 { 135 int rec_len, quoted; 135 quoted = 0; 135 rec_len = join_append_data(self, field, quote_empty, "ed, 0); 135 if (rec_len < 0) ###### return 0; /* grow record buffer if necessary */ 135 if (!join_check_rec_size(self, rec_len)) ###### return 0; 135 self->rec_len = join_append_data(self, field, quote_empty, "ed, 1); 135 self->num_fields++; 135 return 1; } static int join_append_lineterminator(ParserObj *self) 11 { 11 int terminator_len; 11 terminator_len = PyString_Size(self->lineterminator); /* grow record buffer if necessary */ 11 if (!join_check_rec_size(self, self->rec_len + terminator_len)) ###### return 0; 11 memmove(self->rec + self->rec_len, PyString_AsString(self->lineterminator), terminator_len); 11 self->rec_len += terminator_len; 11 return 1; } static PyObject * join_string(ParserObj *self) 11 { 11 return PyString_FromStringAndSize(self->rec, self->rec_len); } PyDoc_STRVAR(Parser_join_doc, "join(sequence) -> string\n" "\n" "Construct a CSV record from a sequence of fields. Non-string\n" "elements will be converted to string."); static PyObject * Parser_join(ParserObj *self, PyObject *seq) 12 { 12 int len, i; 12 if (!PySequence_Check(seq)) 1 return raise_exception("sequence expected"); 11 len = PySequence_Length(seq); 11 if (len < 0) ###### return NULL; /* Join all fields in internal buffer. */ 11 join_reset(self); 146 for (i = 0; i < len; i++) { 135 PyObject *field; 135 int append_ok; 135 field = PySequence_GetItem(seq, i); 135 if (field == NULL) ###### return NULL; 135 if (PyString_Check(field)) { 59 append_ok = join_append(self, PyString_AsString(field), len == 1); 59 Py_DECREF(field); } 76 else if (field == Py_None) { ###### append_ok = join_append(self, "", len == 1); ###### Py_DECREF(field); } else { 76 PyObject *str; 76 str = PyObject_Str(field); 76 Py_DECREF(field); 76 if (str == NULL) ###### return NULL; 76 append_ok = join_append(self, PyString_AsString(str), len == 1); 76 Py_DECREF(str); } 135 if (!append_ok) ###### return NULL; } /* Add line terminator. */ 11 if (!join_append_lineterminator(self)) ###### return 0; 11 return join_string(self); } static struct PyMethodDef Parser_methods[] = { { "parse", (PyCFunction)Parser_parse, METH_VARARGS, Parser_parse_doc }, { "clear", (PyCFunction)Parser_clear, METH_NOARGS, Parser_clear_doc }, { "join", (PyCFunction)Parser_join, METH_O, Parser_join_doc }, { NULL, NULL } }; static void Parser_dealloc(ParserObj *self) 30 { 30 if (self->field) 12 free(self->field); 30 Py_XDECREF(self->fields); 30 Py_XDECREF(self->lineterminator); 30 if (self->rec) 11 free(self->rec); 30 PyMem_DEL(self); } #define OFF(x) offsetof(ParserObj, x) static struct memberlist Parser_memberlist[] = { { "quotechar", T_CHAR, OFF(quotechar) }, { "delimiter", T_CHAR, OFF(delimiter) }, { "escapechar", T_CHAR, OFF(escapechar) }, { "skipinitialspace", T_INT, OFF(skipinitialspace) }, { "lineterminator", T_OBJECT, OFF(lineterminator) }, { "quoting", T_INT, OFF(quoting) }, { "doublequote", T_INT, OFF(doublequote) }, { "fields", T_OBJECT, OFF(fields) }, { "autoclear", T_INT, OFF(autoclear) }, { "strict", T_INT, OFF(strict) }, { "had_parse_error", T_INT, OFF(had_parse_error), RO }, { NULL } }; static PyObject * Parser_getattr(ParserObj *self, char *name) 48 { 48 PyObject *rv; 48 if ((strcmp(name, "quotechar") == 0 && !self->have_quotechar) || (strcmp(name, "escapechar") == 0 && !self->have_escapechar)) { ###### Py_INCREF(Py_None); ###### return Py_None; } 48 rv = PyMember_Get((char *)self, Parser_memberlist, name); 48 if (rv) ###### return rv; 48 PyErr_Clear(); 48 return Py_FindMethod(Parser_methods, (PyObject *)self, name); } static int _set_char_attr(char *attr, int *have_attr, PyObject *v) 60 { /* Special case for constructor - NULL == use default. */ 60 if (v == NULL) ###### return 0; 60 if (v == Py_None) { 30 *have_attr = 0; 30 *attr = 0; 30 return 0; } 30 else if (PyString_Check(v) && PyString_Size(v) == 1) { 30 *attr = PyString_AsString(v)[0]; 30 *have_attr = 1; 30 return 0; } else { ###### PyErr_BadArgument(); ###### return -1; } } static int Parser_setattr(ParserObj *self, char *name, PyObject *v) ###### { ###### if (v == NULL) { ###### PyErr_SetString(PyExc_AttributeError, "Cannot delete attribute"); ###### return -1; } ###### if (strcmp(name, "quotechar") == 0) ###### return _set_char_attr(&self->quotechar, &self->have_quotechar, v); ###### else if (strcmp(name, "escapechar") == 0) ###### return _set_char_attr(&self->escapechar, &self->have_escapechar, v); ###### else if (strcmp(name, "quoting") == 0 && PyInt_Check(v)) { ###### int n = PyInt_AsLong(v); ###### if (n < 0 || n > QUOTE_NONE) { ###### PyErr_BadArgument(); ###### return -1; } ###### if (n == QUOTE_NONE) ###### self->have_quotechar = 0; ###### self->quoting = n; ###### return 0; } ###### else if (strcmp(name, "lineterminator") == 0 && !PyString_Check(v)) { ###### PyErr_BadArgument(); ###### return -1; } else ###### return PyMember_Set((char *)self, Parser_memberlist, name, v); } static PyObject * csv_parser(PyObject *module, PyObject *args, PyObject *keyword_args); PyDoc_STRVAR(Parser_Type_doc, "CSV parser"); static PyTypeObject Parser_Type = { PyObject_HEAD_INIT(0) 0, /*ob_size*/ "_csv.parser", /*tp_name*/ sizeof(ParserObj), /*tp_basicsize*/ 0, /*tp_itemsize*/ /* methods */ (destructor)Parser_dealloc, /*tp_dealloc*/ (printfunc)0, /*tp_print*/ (getattrfunc)Parser_getattr, /*tp_getattr*/ (setattrfunc)Parser_setattr, /*tp_setattr*/ (cmpfunc)0, /*tp_compare*/ (reprfunc)0, /*tp_repr*/ 0, /*tp_as_number*/ 0, /*tp_as_sequence*/ 0, /*tp_as_mapping*/ (hashfunc)0, /*tp_hash*/ (ternaryfunc)0, /*tp_call*/ (reprfunc)0, /*tp_str*/ 0L, 0L, 0L, 0L, Parser_Type_doc }; PyDoc_STRVAR(csv_parser_doc, "parser(delimiter=',', quotechar='\"', escapechar=None,\n" " doublequote=1, lineterminator='\\r\\n', quoting='minimal',\n" " autoclear=1, strict=0) -> Parser\n" "\n" "Constructs a CSV parser object.\n" "\n" " delimiter\n" " Defines the character that will be used to separate\n" " fields in the CSV record.\n" "\n" " quotechar\n" " Defines the character used to quote fields that\n" " contain the field separator or newlines. If set to None\n" " special characters will be escaped using the escapechar.\n" "\n" " escapechar\n" " Defines the character used to escape special\n" " characters. Only used if quotechar is None.\n" "\n" " doublequote\n" " When True, quotes in a field must be doubled up.\n" "\n" " skipinitialspace\n" " When True spaces following the delimiter are ignored.\n" "\n" " lineterminator\n" " The string used to terminate records.\n" "\n" " quoting\n" " Controls the generation of quotes around fields when writing\n" " records. This is only used when quotechar is not None.\n" "\n" " autoclear\n" " When True, calling parse() will automatically call\n" " the clear() method if the previous call to parse() raised an\n" " exception during parsing.\n" "\n" " strict\n" " When True, the parser will raise an exception on\n" " malformed fields rather than attempting to guess the right\n" " behavior.\n"); static PyObject * csv_parser(PyObject *module, PyObject *args, PyObject *keyword_args) 30 { static char *keywords[] = { "quotechar", "delimiter", "escapechar", "skipinitialspace", "lineterminator", "quoting", "doublequote", "autoclear", "strict", NULL 30 }; 30 PyObject *quotechar, *escapechar; 30 ParserObj *self = PyObject_NEW(ParserObj, &Parser_Type); 30 if (self == NULL) ###### return NULL; 30 self->quotechar = '"'; 30 self->have_quotechar = 1; 30 self->delimiter = ','; 30 self->escapechar = '\0'; 30 self->have_escapechar = 0; 30 self->skipinitialspace = 0; 30 self->lineterminator = NULL; 30 self->quoting = QUOTE_MINIMAL; 30 self->doublequote = 1; 30 self->autoclear = 1; 30 self->strict = 0; 30 self->state = START_RECORD; 30 self->fields = PyList_New(0); 30 if (self->fields == NULL) { ###### Py_DECREF(self); ###### return NULL; } 30 self->had_parse_error = 0; 30 self->field = NULL; 30 self->field_size = 0; 30 self->field_len = 0; 30 self->rec = NULL; 30 self->rec_size = 0; 30 self->rec_len = 0; 30 self->num_fields = 0; 30 quotechar = escapechar = NULL; 30 if (PyArg_ParseTupleAndKeywords(args, keyword_args, "|OcOiSiiii", keywords, "echar, &self->delimiter, &escapechar, &self->skipinitialspace, &self->lineterminator, &self->quoting, &self->doublequote, &self->autoclear, &self->strict) && !_set_char_attr(&self->quotechar, &self->have_quotechar, quotechar) && !_set_char_attr(&self->escapechar, &self->have_escapechar, escapechar)) { 30 if (self->lineterminator == NULL) ###### self->lineterminator = PyString_FromString("\r\n"); else { 30 Py_INCREF(self->lineterminator); } 30 if (self->quoting < 0 || self->quoting > QUOTE_NONE) ###### PyErr_SetString(PyExc_ValueError, "bad quoting value"); else { 30 if (self->quoting == QUOTE_NONE) ###### self->have_quotechar = 0; 30 else if (!self->have_quotechar) ###### self->quoting = QUOTE_NONE; 30 return (PyObject*)self; } } ###### Py_DECREF(self); ###### return NULL; } static struct PyMethodDef csv_methods[] = { { "parser", (PyCFunction)csv_parser, METH_VARARGS | METH_KEYWORDS, csv_parser_doc }, { NULL, NULL } }; PyDoc_STRVAR(csv_module_doc, "This module provides class for performing CSV parsing and writing.\n" "\n" "The CSV parser object (returned by the parser() function) supports the\n" "following methods:\n" " clear()\n" " Discards all fields parsed so far. If autoclear is set to\n" " zero. You should call this after a parser exception.\n" "\n" " parse(string) -> list of strings\n" " Extracts fields from the (partial) CSV record in string.\n" " Trailing end of line characters are ignored, so you do not\n" " need to strip the string before passing it to the parser. If\n" " you pass more than a single line of text, a _csv.Error\n" " exception will be raised.\n" "\n" " join(sequence) -> string\n" " Construct a CSV record from a sequence of fields. Non-string\n" " elements will be converted to string.\n" "\n" "Typical usage:\n" "\n" " import _csv\n" " p = _csv.parser()\n" " fp = open('afile.csv', 'U')\n" " for line in fp:\n" " fields = p.parse(line)\n" " if not fields:\n" " # multi-line record\n" " continue\n" " # process the fields\n"); PyMODINIT_FUNC init_csv(void) 1 { 1 PyObject *mod; 1 PyObject *dict; 1 PyObject *rev; 1 if (PyType_Ready(&Parser_Type) < 0) ###### return; /* Create the module and add the functions */ 1 mod = Py_InitModule3("_csv", csv_methods, csv_module_doc); 1 if (mod == NULL) ###### return; /* Add version to the module. */ 1 dict = PyModule_GetDict(mod); 1 if (dict == NULL) ###### return; 1 rev = PyString_FromString("1.0"); 1 if (rev == NULL) ###### return; 1 if (PyDict_SetItemString(dict, "__version__", rev) < 0) ###### return; /* Add the CSV exception object to the module. */ 1 error_obj = PyErr_NewException("_csv.Error", NULL, NULL); 1 if (error_obj == NULL) ###### return; 1 PyDict_SetItemString(dict, "Error", error_obj); 1 Py_XDECREF(rev); 1 Py_XDECREF(error_obj); } From skip at pobox.com Mon Feb 3 21:23:46 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 14:23:46 -0600 Subject: [Csv] Something's fishy... Message-ID: <15934.53202.757331.426317@montanaro.dyndns.org> I rearranged the dialect initialization stuff a bit earlier today, then started getting segfaults from the interpreter. From gdb I was able to track it down to what looked like an invalid keyword_args parameter passed to csv_parser(): (gdb) p args $1 = (PyObject *) 0x418030 (gdb) pyo args object : () type : tuple refcount: 1911 address : 0x418030 $2 = void (gdb) p keyword_args $3 = (PyObject *) 0xb3b0c0 (gdb) pyo keyword_args object : {'delimiter': ',', 'escapechar': None, 'lineterminator': Program received signal EXC_BAD_INSTRUCTION, Illegal instruction/operand. 0x0067d940 in ?? () The program being debugged was signaled while in a function called from GDB. GDB remains in the frame where the signal was received. To change this behavior use "set unwindonsignal on" Evaluation of the expression containing the function (_PyObject_Dump) will be abandoned. pyo is a user-defined gdb command: define pyo print _PyObject_Dump($arg0) end Figuring there was maybe something wrong there, I stuck a print statement in csv.py:_OCcsv:__init__ just before the last line of the method: print ">>", parser_options The segfault went away. I was left with a lot of output and one error: % python test_csv.py >> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} ..>> {'delimiter': '\t', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} ...E>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 3, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 3, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 2, 'doublequote': True} .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 2, 'doublequote': True} . ====================================================================== ERROR: test_register (__main__.TestDialects) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_csv.py", line 200, in test_register csv.unregister_dialect("myexceltsv") AttributeError: 'module' object has no attribute 'unregister_dialect' ---------------------------------------------------------------------- Ran 38 tests in 0.314s FAILED (errors=1) I added the missing function. Everything seems fine once again. I'm checking in what I have, but _csv.c should probably be carefully inspected to see if there's an argument out of place, a missing INCREF, an array bounds violation, or something similar. Errors don't just magically go away. Whatever was wrong is still wrong, just hiding. Skip From skip at pobox.com Mon Feb 3 21:52:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 14:52:05 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15934.54901.429816.726580@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: sjmachin at lexicon.net (John Machin) Subject: Re: PEP 305 - CSV File API Date: 3 Feb 2003 12:12:50 -0800 Size: 5533 Url: http://mail.python.org/pipermail/csv/attachments/20030203/63ca6a7f/attachment.mht From skip at pobox.com Mon Feb 3 21:53:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 14:53:03 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15934.54959.153280.689174@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: sjmachin at lexicon.net (John Machin) Subject: Re: PEP 305 - CSV File API Date: 3 Feb 2003 12:33:25 -0800 Size: 6295 Url: http://mail.python.org/pipermail/csv/attachments/20030203/331df17e/attachment.mht From andrewm at object-craft.com.au Mon Feb 3 23:57:48 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 04 Feb 2003 09:57:48 +1100 Subject: [Csv] passing dialects directly - class or instance? In-Reply-To: Message from Skip Montanaro <15934.39674.453212.506375@montanaro.dyndns.org> References: <15934.39674.453212.506375@montanaro.dyndns.org> Message-ID: <20030203225748.766A63C1F4@coffee.object-craft.com.au> >I thought users were supposed to pass dialect classes when not using >strings. I see, however, that _OCcsv.__init__ calls isinstance() instead of >issubclass(). Which is it supposed to be? An instance, I think - the PEP needs to be updated. The code started to look really messy when I allowed it to accept either a class, an instance, or a string. It seemed a small cost to lose the "class" option. Thoughts? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Tue Feb 4 00:00:21 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 04 Feb 2003 10:00:21 +1100 Subject: [Csv] test coverage In-Reply-To: Message from Skip Montanaro <15934.40800.760370.712420@montanaro.dyndns.org> References: <15934.40800.760370.712420@montanaro.dyndns.org> Message-ID: <20030203230021.61E103C1F4@coffee.object-craft.com.au> >Attached is the output of running a gcov-instrumented version of Python and >the _csv module against the current test suite. That's very handy - I'll have to build a gcov python for myself. My thought was there would be dialect tests, and a set of tests for the underlying module. The underly module tests would probably be in a better position to get coverage. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Tue Feb 4 00:38:14 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 17:38:14 -0600 Subject: [Csv] Re: Something's fishy... Message-ID: <15934.64870.901035.20516@montanaro.dyndns.org> FYI, after another little be of dialect reshuffling, the segfault is back: % python test_csv.py ESegmentation fault Skip From skip at pobox.com Tue Feb 4 05:00:20 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 22:00:20 -0600 Subject: [Csv] Re: PEP 305 - CSV File API In-Reply-To: <200302040146.27744.cribeiro@mail.inet.com.br> References: <200302040146.27744.cribeiro@mail.inet.com.br> Message-ID: <15935.15060.878910.808643@montanaro.dyndns.org> Carlos> The problem is, almost all my intermediate files have both Carlos> 'date' and 'float' columns. This is highly common in business, Carlos> specially if you are looking at sales figures and stuff like Carlos> that. Carlos> To compound my problem, Python writes floats with a period (.) Carlos> as a decimal separator. However, my copy of Excel is configured Carlos> for the brazilian locale, and it expects a comma (,) as the Carlos> decimal separator. Can't you simply set the locale in your scripts so Python and Excel agree? Carlos> Now for the real issue. If I convert my floats to strings Carlos> *before* writing the CSV file, It will end up quoted (for Carlos> example, '3,1416') - assuming that the CSV library will work as Carlos> Skip said. This is not what I would expect, and in fact, it's Carlos> not what anyone working with different locale settings would Carlos> say. It would only be quoted if you had comma as the delimiter or had set the quoting parameter to QUOTE_ALWAYS. What delimiter do you use in your CSV files? Carlos> Last, even if Python just wrote floats with the 'right' decimal Carlos> separator - comma, in my case - there still would be other Carlos> software packages that would expect to get periods. How would you like us to handle this? Sound like a case of being "damned if we do, damned if we don't". Carlos> Or worse, I could try to send my data files to people in other Carlos> countries that would be unable to read it. In any event, there Carlos> is no automatic solution, but the ability to quickly adjust the Carlos> CSV library to get the correct behavior would be highly useful. We have to come back to the fundamental issue that CSV files as commonly understood contain no data type information. It's possible that type information could be passed in during write operations which would govern the way the data is formatted when written. (We've discussed it, but it's not likely to be in the first release.) Even if we solve the formatting issue, once the data is written out to the file, if you ship it out of your locale, no information remains in the file to indicate that 3,1416 is a number instead of a string containing digits and a comma. Similarly, if you choose to write dates out in an ambiguous format, at the receiving end, the reader won't be able to tell what date "02/03/03" represents. Skip From skip at pobox.com Tue Feb 4 04:09:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 21:09:18 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15935.11998.292771.881378@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Carlos Ribeiro Subject: Re: PEP 305 - CSV File API Date: Tue, 4 Feb 2003 01:46:27 +0000 Size: 7000 Url: http://mail.python.org/pipermail/csv/attachments/20030203/94e8d73d/attachment.mht From skip at pobox.com Tue Feb 4 05:04:13 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 22:04:13 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15935.15293.42587.634709@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Carlos Ribeiro Subject: Re: PEP 305 - CSV File API Date: Tue, 4 Feb 2003 02:18:03 +0000 Size: 5135 Url: http://mail.python.org/pipermail/csv/attachments/20030203/d37d2220/attachment.mht From skip at pobox.com Tue Feb 4 06:05:31 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 3 Feb 2003 23:05:31 -0600 Subject: [Csv] There's definitely something going on Message-ID: <15935.18971.504878.413266@montanaro.dyndns.org> A version of Python from CVS configured using --with-pydebug complains mightily about the test suite. Here are some messages: *** malloc[7685]: Deallocation of a pointer not malloced: 0x437008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x5ef008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x5f8008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x601008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug ..*** malloc[7685]: Deallocation of a pointer not malloced: 0x498318; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug ..*** malloc[7685]: Deallocation of a pointer not malloced: 0x499338; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x49a358; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x49b378; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x60a008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug ..*** malloc[7685]: Deallocation of a pointer not malloced: 0x49c398; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x613008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x49d3b8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x49e3d8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x49f3f8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a0418; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x61c008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a1438; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x625008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x62e008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a2458; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug ....*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a3478; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug ..*** malloc[7685]: Deallocation of a pointer not malloced: 0x637008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .....*** malloc[7685]: Deallocation of a pointer not malloced: 0x640008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x649008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a4698; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a56b8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug .*** malloc[7685]: Deallocation of a pointer not malloced: 0x652008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug . Setting MallocHelp displayed a little help about other settable Malloc variables, but nothing gave any more useful info. I had a version of _csv.c which exported the QUOTE_* constants (safer than defining it in two places I think). That barfed as well, though with a negative reference count trying to (I think) set the lineterminator attribute of a Dialect instance. I'm going to take another look in the morning. Skip From andrewm at object-craft.com.au Tue Feb 4 07:07:57 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 04 Feb 2003 17:07:57 +1100 Subject: [Csv] Re: Something's fishy... In-Reply-To: Message from Skip Montanaro <15934.64870.901035.20516@montanaro.dyndns.org> References: <15934.64870.901035.20516@montanaro.dyndns.org> Message-ID: <20030204060757.8EE953CA89@coffee.object-craft.com.au> >FYI, after another little be of dialect reshuffling, the segfault is back: > > % python test_csv.py > ESegmentation fault Of course, it's working fine here (isn't that always the way). The source is most likely the C module - what I'd suggest you do is try a dummy replacement. Presuming that stops the exception, then start exercising _csv's interface, bit by bit (create parser object, create parser object with options, etc). I'll build a version of python with pydebug on. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Tue Feb 4 07:26:53 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Tue, 04 Feb 2003 17:26:53 +1100 Subject: [Csv] Re: Something's fishy... In-Reply-To: Your message of "Tue, 04 Feb 2003 17:07:57 +1100." <20030204060757.8EE953CA89@coffee.object-craft.com.au> Message-ID: <20030204062653.60B8A3CA89@coffee.object-craft.com.au> >I'll build a version of python with pydebug on. Do I need to do any more than this? $ python2.3-pydebug test_csv.py ...................................... ---------------------------------------------------------------------- Ran 38 tests in 0.066s OK [10916 refs] $ -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From ianb at colorstudy.com Tue Feb 4 09:07:04 2003 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 4 Feb 2003 02:07:04 -0600 Subject: [Csv] Re: PEP 305 - CSV File API In-Reply-To: <15935.15060.878910.808643@montanaro.dyndns.org> Message-ID: On Monday, February 3, 2003, at 10:00 PM, Skip Montanaro wrote: > We have to come back to the fundamental issue that CSV files as > commonly > understood contain no data type information. It's possible that type > information could be passed in during write operations which would > govern > the way the data is formatted when written. (We've discussed it, but > it's > not likely to be in the first release.) I think plain strings should be the basic implementation. I see two ways to provide specialization: for most cases you'd use wrappers, like a reader that uses the first row as column names. You could even do some type conversion that way, but the exception would be a place where you wanted to distinguish between: "1","Bob" 1,"Bob" A wrapper could potentially handle some conversion, e.g., a CSV reader from Webware reads column headers like "id:int", and then converts that column to an integer. Or it could try to convert everything, and those that fail get left as strings. I guess the alternatives I see for dealing more directly with quotes would be (a) having an option to return the string complete with quotes, and force quotes in the output or (b) if the reader/write was implemented with some sort of class interface, a subclass could override the hypothetical quote/unquote methods. Except for the quoting issue, I think all other customizations would best be done with wrappers anyway. You can't magically get locale information into the file, or anything other indication of how to handle the file -- providing a robust reader is the best you can do. Ian From skip at pobox.com Tue Feb 4 14:17:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 07:17:41 -0600 Subject: [Csv] Re: Something's fishy... In-Reply-To: <20030204062653.60B8A3CA89@coffee.object-craft.com.au> References: <20030204060757.8EE953CA89@coffee.object-craft.com.au> <20030204062653.60B8A3CA89@coffee.object-craft.com.au> Message-ID: <15935.48501.938077.703028@montanaro.dyndns.org> Andrew> Do I need to do any more than this? I don't believe so. Did you cvs up your Python tree? Maybe it's not us at all. Skip From skip at pobox.com Tue Feb 4 14:19:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 07:19:27 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15935.48607.467509.289945@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Dennis Lee Bieber Subject: Re: PEP 305 - CSV File API Date: Mon, 03 Feb 2003 20:56:40 -0800 Size: 5016 Url: http://mail.python.org/pipermail/csv/attachments/20030204/09e9e76c/attachment.mht From skip at pobox.com Tue Feb 4 14:25:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 07:25:24 -0600 Subject: [Csv] Re: PEP 305 - CSV File API In-Reply-To: References: <15935.15060.878910.808643@montanaro.dyndns.org> Message-ID: <15935.48964.695258.758710@montanaro.dyndns.org> Ian> Or it could try to convert everything, and those that fail get left Ian> as strings. I think that would be a disaster. What if data in one column consisted of hex numbers? Some would be evaluated as numbers, others left as strings. The programmer would have to defend against that. The only way I see to reliably ask the csv module to convert data is to provide a list of type converters which take a single string as an argument. There are performance implications of this approach, so it can't be the default. One reason I've used Object Craft's csv module up to now is that it's written in C and is at minimum 5-10x faster than the other options available. I routinely read and write 5-10MB CSV files, so I'm sensitive to performance degradation. Skip From skip at pobox.com Tue Feb 4 15:55:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 08:55:24 -0600 Subject: [Csv] RE: PEP 305 - CSV File API (fwd) Message-ID: <15935.54364.997807.967672@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Simon Brunning Subject: RE: PEP 305 - CSV File API Date: Tue, 4 Feb 2003 14:27:50 -0000 Size: 5396 Url: http://mail.python.org/pipermail/csv/attachments/20030204/9890e58d/attachment.mht From skip at pobox.com Tue Feb 4 15:56:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 08:56:03 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15935.54403.326360.118880@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Carlos Ribeiro Subject: Re: PEP 305 - CSV File API Date: Tue, 4 Feb 2003 14:38:49 +0000 Size: 4819 Url: http://mail.python.org/pipermail/csv/attachments/20030204/e759d7fd/attachment.mht From skip at pobox.com Tue Feb 4 16:02:29 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 09:02:29 -0600 Subject: [Csv] bug fixed Message-ID: <15935.54789.655700.175234@montanaro.dyndns.org> I should have paid closer attention to the errors malloc was giving me: *** malloc[1859]: Deallocation of a pointer not malloced: 0x4aa5e8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug especially the bit about "free() called with the middle of an allocated block". Memory allocated with PyMem_Malloc() was being freed with free(). Since Python now uses its own custom allocator layered on top of malloc, those calls really need to be balanced. I've no idea what free() on the platforms you were using was doing (maybe ignoring, maybe scribbling), but thankfully free() on my Mac OS X machine complained. Please "cvs up". BTW, the Object Craft csv module has the same problem. Time to release 1.1? ;-) Cheers, Skip From skip at pobox.com Tue Feb 4 17:37:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 10:37:11 -0600 Subject: [Csv] Re: PEP 305 - CSV File API (fwd) Message-ID: <15935.60471.601265.17928@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: Carlos Ribeiro Subject: Re: PEP 305 - CSV File API Date: Tue, 4 Feb 2003 01:10:43 +0000 Size: 7825 Url: http://mail.python.org/pipermail/csv/attachments/20030204/9816f639/attachment.mht From andrewm at object-craft.com.au Tue Feb 4 23:08:15 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 09:08:15 +1100 Subject: [Csv] Re: Something's fishy... In-Reply-To: Message from Skip Montanaro <15935.48501.938077.703028@montanaro.dyndns.org> References: <20030204060757.8EE953CA89@coffee.object-craft.com.au> <20030204062653.60B8A3CA89@coffee.object-craft.com.au> <15935.48501.938077.703028@montanaro.dyndns.org> Message-ID: <20030204220815.5509A3CA92@coffee.object-craft.com.au> > Andrew> Do I need to do any more than this? > >I don't believe so. Did you cvs up your Python tree? I did. >Maybe it's not us at all. That's what I'm wondering. Are you testing on an x86 platform (the endianess could effect obscure bugs)? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Tue Feb 4 23:36:36 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 09:36:36 +1100 Subject: [Csv] bug fixed In-Reply-To: Message from Skip Montanaro <15935.54789.655700.175234@montanaro.dyndns.org> References: <15935.54789.655700.175234@montanaro.dyndns.org> Message-ID: <20030204223636.850A93CA92@coffee.object-craft.com.au> >especially the bit about "free() called with the middle of an allocated >block". Memory allocated with PyMem_Malloc() was being freed with free(). >Since Python now uses its own custom allocator layered on top of malloc, >those calls really need to be balanced. I've no idea what free() on the >platforms you were using was doing (maybe ignoring, maybe scribbling), but >thankfully free() on my Mac OS X machine complained. It could even be something that endianess triggered, but most likely a different malloc library. I guess you've answered my previous question (re x86). If I remember correctly, Python 2.3 and Python 2.2 are very different with regard to memory allocation - most of our testing has been done with 2.2, PyMem_Malloc is a thin layer on top of the system malloc, is it not? >BTW, the Object Craft csv module has the same problem. Time to release 1.1? >;-) Funnily enough, I had a weird heap corruption problem in a python application that used csv a while back - because csv was the only extension module I was using, I immediatly assumed csv was the source, and spent many hours trying to find a miss-handled memory allocation. I eventually decided the problem was elsewhere (can't remember details). Looks like it might have been csv after all. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Tue Feb 4 23:41:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 16:41:03 -0600 Subject: [Csv] Re: Something's fishy... In-Reply-To: <20030204220815.5509A3CA92@coffee.object-craft.com.au> References: <20030204060757.8EE953CA89@coffee.object-craft.com.au> <20030204062653.60B8A3CA89@coffee.object-craft.com.au> <15935.48501.938077.703028@montanaro.dyndns.org> <20030204220815.5509A3CA92@coffee.object-craft.com.au> Message-ID: <15936.16767.338077.255345@montanaro.dyndns.org> >> Maybe it's not us at all. Andrew> That's what I'm wondering. Are you testing on an x86 platform Andrew> (the endianess could effect obscure bugs)? Well, it was us. ;-) At any rate, my day-to-day computer is a Ti Powerbook running Mac OS X. Hopefully that's different enough than what you all run so we get reasonable platform coverage. Skip From skip at pobox.com Tue Feb 4 23:56:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 16:56:58 -0600 Subject: [Csv] bug fixed In-Reply-To: <20030204223636.850A93CA92@coffee.object-craft.com.au> References: <15935.54789.655700.175234@montanaro.dyndns.org> <20030204223636.850A93CA92@coffee.object-craft.com.au> Message-ID: <15936.17722.20563.750609@montanaro.dyndns.org> Andrew> If I remember correctly, Python 2.3 and Python 2.2 are very Andrew> different with regard to memory allocation - most of our testing Andrew> has been done with 2.2, PyMem_Malloc is a thin layer on top of Andrew> the system malloc, is it not? No, it's the portal into PyMalloc, an efficient, generic small block allocator. Andrew> Funnily enough, I had a weird heap corruption problem in a Andrew> python application that used csv a while back - because csv was Andrew> the only extension module I was using, I immediatly assumed csv Andrew> was the source, and spent many hours trying to find a Andrew> miss-handled memory allocation. I eventually decided the problem Andrew> was elsewhere (can't remember details). Looks like it might Andrew> have been csv after all. And the behavior will be different with different mallocs. free() on Mac OS X is apparently smart enough to realize it was being handed bad memory and refused to really free() it. Other free()'s might blindly charge ahead, corrupting PyMalloc's memory. In my case, I think it mostly just caused memory leaks because those chunks were never freed as far as PyMalloc was concerned. Looking at .../include/python2.N/pymem.h, it looks like PyMem_Malloc and MyMem_Free have needed to be paired up for awhile whenever PyMalloc was enabled. From 2.1/2.2: extern DL_IMPORT(void *) PyMem_Malloc(size_t); extern DL_IMPORT(void *) PyMem_Realloc(void *, size_t); extern DL_IMPORT(void) PyMem_Free(void *); >From 2.3: PyAPI_FUNC(void *) PyMem_Malloc(size_t); PyAPI_FUNC(void *) PyMem_Realloc(void *, size_t); PyAPI_FUNC(void) PyMem_Free(void *); The difference between 2.1/2.2 and 2.3 is that PyMalloc is enabled by default in 2.3. Skip From andrewm at object-craft.com.au Wed Feb 5 00:16:22 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 10:16:22 +1100 Subject: [Csv] bug fixed In-Reply-To: Message from Skip Montanaro <15936.17722.20563.750609@montanaro.dyndns.org> References: <15935.54789.655700.175234@montanaro.dyndns.org> <20030204223636.850A93CA92@coffee.object-craft.com.au> <15936.17722.20563.750609@montanaro.dyndns.org> Message-ID: <20030204231622.BE98E3CA92@coffee.object-craft.com.au> >No, it's the portal into PyMalloc, an efficient, generic small block >allocator. But only when python is built with --enable-pymalloc - this only became the default in 2.3 - prior to that, the default was to use the system malloc. >>From 2.3: > > PyAPI_FUNC(void *) PyMem_Malloc(size_t); > PyAPI_FUNC(void *) PyMem_Realloc(void *, size_t); > PyAPI_FUNC(void) PyMem_Free(void *); > >The difference between 2.1/2.2 and 2.3 is that PyMalloc is enabled by >default in 2.3. And not enabled by default in 2.2... which is approximately the point I was trying to make in my previous e-mail... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 5 00:36:48 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 10:36:48 +1100 Subject: [Csv] QUOTE_* constants In-Reply-To: Message from Skip Montanaro <15935.18971.504878.413266@montanaro.dyndns.org> References: <15935.18971.504878.413266@montanaro.dyndns.org> Message-ID: <20030204233648.BFCD83CA92@coffee.object-craft.com.au> >I had a version of _csv.c which exported the QUOTE_* constants (safer than >defining it in two places I think). That barfed as well, though with a >negative reference count trying to (I think) set the lineterminator >attribute of a Dialect instance. > >I'm going to take another look in the morning. I definitely think this change is a good idea - let me know if you need a hand to make it work. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 5 02:52:10 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 12:52:10 +1100 Subject: [Csv] _csv bug Message-ID: <20030205015210.2714C3CA92@coffee.object-craft.com.au> This is okay: >>> p=_csv.parser() >>> p.join(['1','2','3,4']) '1,2,"3,4"\r\n' >>> p=_csv.parser() As is this: >>> p=_csv.parser() >>> p.quotechar=None >>> p.join(['1','2','3,4']) Traceback (most recent call last): File "", line 1, in ? _csv.Error: delimiter must be quoted or escaped But this is broken: >>> p=_csv.parser(quotechar=None) >>> p.quoting 3 >>> p.quoting=1 >>> p.join(['1','2','3,4']) '\x001\x00,\x002\x00,\x003,4\x00\r\n' The obvious fix is to add an additional test to Parser_setattr to disallow this combination. I've added this: if (!self->have_quotechar && n != QUOTE_NONE) { PyErr_BadArgument(); return -1; } But I don't entirely like the idea of raising such a generic error. If anyone has a better suggestion, let me know. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 5 03:05:14 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 13:05:14 +1100 Subject: [Csv] another _csv question Message-ID: <20030205020514.DC52D3CA92@coffee.object-craft.com.au> In csv_parser, while validating keyword arguments, we set quoting to QUOTE_NONE if quotechar is not set - I think we should be raising an exception in this case (but it must be defered until all keyword arguments have been parsed). Any objections? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Feb 5 03:07:57 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 20:07:57 -0600 Subject: [Csv] _csv bug In-Reply-To: <20030205015210.2714C3CA92@coffee.object-craft.com.au> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> Message-ID: <15936.29181.916055.866700@montanaro.dyndns.org> Andrew> But this is broken: Andrew> [quotechar=None && p.quoting==1] Andrew> The obvious fix is to add an additional test to Parser_setattr Andrew> to disallow this combination. I've added this: ... Andrew> But I don't entirely like the idea of raising such a generic Andrew> error. If anyone has a better suggestion, let me know. We never really decided on the split between sanity checkes in csv.py vs. sanity checks in _csv.c did we? I've got a change to csv.py ready to check in which adds __init__ and _validate methods to the Dialect class. If we do more elaborate checks there, I think we can get away with coarser checks and exceptions in _csv.c, bascially just stuff to keep the interpreter from dumping core. Skip From skip at pobox.com Wed Feb 5 03:10:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 20:10:06 -0600 Subject: [Csv] another _csv question In-Reply-To: <20030205020514.DC52D3CA92@coffee.object-craft.com.au> References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au> Message-ID: <15936.29310.765282.986147@montanaro.dyndns.org> Andrew> In csv_parser, while validating keyword arguments ... Before we add a bunch of checks to _csv.c why don't we decide the split between the Python and C levels as far as validation is concerned? I have Dialect validation happening at instantiation time. I suspect we should provide a __setattr__ that forces Dialect instances to be read-only. Skip From andrewm at object-craft.com.au Wed Feb 5 03:16:19 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 13:16:19 +1100 Subject: [Csv] _csv bug In-Reply-To: Message from Skip Montanaro <15936.29181.916055.866700@montanaro.dyndns.org> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> Message-ID: <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> >We never really decided on the split between sanity checkes in csv.py >vs. sanity checks in _csv.c did we? I've got a change to csv.py ready to >check in which adds __init__ and _validate methods to the Dialect class. If >we do more elaborate checks there, I think we can get away with coarser >checks and exceptions in _csv.c, bascially just stuff to keep the >interpreter from dumping core. I'd like the underlying _csv module to be sane in it's own right - I'd really rather these tests were kept in _csv. It's also where the parameters have meaning - if you're adding a new parameter to _csv, then you're more likely to add appropriate tests than if you also have to update csv.py. I also suspect we can move more functionalty from csv.py into _csv to reduce overhead further. Some benchmarking is required - it might be that we can become significantly faster by _csv talk directly to fileobj when writing, etc. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 5 03:20:35 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 13:20:35 +1100 Subject: [Csv] another _csv question In-Reply-To: Message from Skip Montanaro <15936.29310.765282.986147@montanaro.dyndns.org> References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au> <15936.29310.765282.986147@montanaro.dyndns.org> Message-ID: <20030205022036.F3E3C3CA92@coffee.object-craft.com.au> >I suspect we should provide a __setattr__ that forces Dialect instances to >be read-only. I think this is an unnecessary restriction. You might want to do something like: class SnifferDialect(csv.Dialect): pass def sniff(...): dialect = SnifferDialect() ... try stuff ... dialect.delimiter = '\t' ... try more stuff ... -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Feb 5 03:35:40 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 20:35:40 -0600 Subject: [Csv] _csv bug In-Reply-To: <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> Message-ID: <15936.30844.628733.452308@montanaro.dyndns.org> Andrew> I'd like the underlying _csv module to be sane in it's own right Andrew> - I'd really rather these tests were kept in _csv. No argument here. I'm just thinking that the _csv module only has to defend against rotten inputs. It can raise a generic error as far as I'm concerned. Andrew> I also suspect we can move more functionalty from csv.py into Andrew> _csv to reduce overhead further. Some benchmarking is required - Andrew> it might be that we can become significantly faster by _csv talk Andrew> directly to fileobj when writing, etc. What I'm talking about happens once, at Dialect instantiation time, so I doubt performance is going to be a big issue. It's also easier to give more comprehensive feedback in Python. Skip From skip at pobox.com Wed Feb 5 03:39:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 20:39:27 -0600 Subject: [Csv] another _csv question In-Reply-To: <20030205022036.F3E3C3CA92@coffee.object-craft.com.au> References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au> <15936.29310.765282.986147@montanaro.dyndns.org> <20030205022036.F3E3C3CA92@coffee.object-craft.com.au> Message-ID: <15936.31071.461489.580370@montanaro.dyndns.org> >> I suspect we should provide a __setattr__ that forces Dialect >> instances to be read-only. Andrew> I think this is an unnecessary restriction. You might want to do Andrew> something like: Andrew> class SnifferDialect(csv.Dialect): Andrew> pass Andrew> def sniff(...): Andrew> dialect = SnifferDialect() Andrew> ... try stuff ... Andrew> dialect.delimiter = '\t' Andrew> ... try more stuff ... I can buy that. Maybe what we need then is some way to force validation after changes are made, but before the dialect info is tossed over the wall to the low-level module. Skip From andrewm at object-craft.com.au Wed Feb 5 04:26:17 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 14:26:17 +1100 Subject: [Csv] another _csv question In-Reply-To: Message from Skip Montanaro <15936.31071.461489.580370@montanaro.dyndns.org> References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au> <15936.29310.765282.986147@montanaro.dyndns.org> <20030205022036.F3E3C3CA92@coffee.object-craft.com.au> <15936.31071.461489.580370@montanaro.dyndns.org> Message-ID: <20030205032617.2D5333CA92@coffee.object-craft.com.au> > Andrew> I think this is an unnecessary restriction. You might want to do > Andrew> something like: > > Andrew> class SnifferDialect(csv.Dialect): > Andrew> pass > > Andrew> def sniff(...): > Andrew> dialect = SnifferDialect() > Andrew> ... try stuff ... > Andrew> dialect.delimiter = '\t' > Andrew> ... try more stuff ... > >I can buy that. Maybe what we need then is some way to force validation >after changes are made, but before the dialect info is tossed over the wall >to the low-level module. I think we're trying too hard - it's acceptable for the validation to only occur when the reader or writer factories are called, I think. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 5 04:27:31 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 14:27:31 +1100 Subject: [Csv] _csv bug In-Reply-To: Message from Skip Montanaro <15936.30844.628733.452308@montanaro.dyndns.org> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> Message-ID: <20030205032732.049303CA92@coffee.object-craft.com.au> > Andrew> I also suspect we can move more functionalty from csv.py into > Andrew> _csv to reduce overhead further. Some benchmarking is required - > Andrew> it might be that we can become significantly faster by _csv talk > Andrew> directly to fileobj when writing, etc. > >What I'm talking about happens once, at Dialect instantiation time, so I >doubt performance is going to be a big issue. It's also easier to give more >comprehensive feedback in Python. What sort of comprehensive feedback did you have in mind? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Feb 5 04:41:59 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 4 Feb 2003 21:41:59 -0600 Subject: [Csv] _csv bug In-Reply-To: <20030205032732.049303CA92@coffee.object-craft.com.au> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> <20030205032732.049303CA92@coffee.object-craft.com.au> Message-ID: <15936.34823.558627.998337@montanaro.dyndns.org> >> It's also easier to give more comprehensive feedback in Python. Andrew> What sort of comprehensive feedback did you have in mind? Stuff like: class myexcel(csv.excel): quotechar = ',' ... quotechar and delimiter must be different or class myexcel(csv.excel): lineterminator = '\n' ... lineterminator and the hard return character should be different That sort of thing. (Speaking of which, we should probably all the user to specify the hard (embedded) return character.) It's tough enough in C to generate really good messages (because it often requires pasting strings together on-the-fly to provide the necessary context) that it frequently doesn't get done. For example, if I pass None instead of an int for parameters with 'i' format characters, all PyArg_PTAK says is "int was required". However, there are nine args to the constructor, five of which are ints. Skip From andrewm at object-craft.com.au Wed Feb 5 04:49:32 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 14:49:32 +1100 Subject: [Csv] more tests Message-ID: <20030205034932.3A1C83CA92@coffee.object-craft.com.au> I've checked in some more tests - while not comprehensive, they get us close to 90% coverage, as calculated by gcov. The remaining untested lines are mainly checks for failed memory allocations. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 5 04:54:24 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 05 Feb 2003 14:54:24 +1100 Subject: [Csv] _csv bug In-Reply-To: Message from Skip Montanaro <15936.34823.558627.998337@montanaro.dyndns.org> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> <20030205032732.049303CA92@coffee.object-craft.com.au> <15936.34823.558627.998337@montanaro.dyndns.org> Message-ID: <20030205035424.48D253CA92@coffee.object-craft.com.au> >That sort of thing. (Speaking of which, we should probably all the user to >specify the hard (embedded) return character.) It's tough enough in C to >generate really good messages (because it often requires pasting strings >together on-the-fly to provide the necessary context) that it frequently >doesn't get done. For example, if I pass None instead of an int for >parameters with 'i' format characters, all PyArg_PTAK says is "int was >required". However, there are nine args to the constructor, five of which >are ints. I'm not sure this is a good enough reason to move the checks away from the "coalface" - with a little more work, we can generate friendly messages from the C level, while at the same time keeping them tightly coupled to the implementation. I'd certainly agree the PyArg_PTAK validation is less than useful in our context - but I think it highlights a more fundemental problem in the way the C code is structured. I'll talk to Dave tonight and see if we can come up with something better. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Wed Feb 5 11:11:21 2003 From: djc at object-craft.com.au (Dave Cole) Date: 05 Feb 2003 21:11:21 +1100 Subject: [Csv] _csv bug In-Reply-To: <20030205035424.48D253CA92@coffee.object-craft.com.au> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> <20030205032732.049303CA92@coffee.object-craft.com.au> <15936.34823.558627.998337@montanaro.dyndns.org> <20030205035424.48D253CA92@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> That sort of thing. (Speaking of which, we should probably all the >> user to specify the hard (embedded) return character.) It's tough >> enough in C to generate really good messages (because it often >> requires pasting strings together on-the-fly to provide the >> necessary context) that it frequently doesn't get done. For >> example, if I pass None instead of an int for parameters with 'i' >> format characters, all PyArg_PTAK says is "int was required". >> However, there are nine args to the constructor, five of which are >> ints. Andrew> I'm not sure this is a good enough reason to move the checks Andrew> away from the "coalface" - with a little more work, we can Andrew> generate friendly messages from the C level, while at the same Andrew> time keeping them tightly coupled to the implementation. I'd Andrew> certainly agree the PyArg_PTAK validation is less than useful Andrew> in our context - but I think it highlights a more fundemental Andrew> problem in the way the C code is structured. I'll talk to Dave Andrew> tonight and see if we can come up with something better. We spoke for a short while and decided that it might make more sense to remove the PyArg_PTAK stuff altogether and just use the __setattr__ stuff in _csv. One of the problems in this approach is that PyArg_PTAK allows you to set multiple attributes simultaneously while __setattr__ is one attribute at a time. This means that it is not really feasible to validate settings in the __setattr__ method - the user would have to work out a sequence of __setattr__ steps to go from one dialect to the next without ever having illegal parameter settings. There are two obvious way around this that I can see. 1. Mark the parser dirty whenever __setattr__ is called then check the dirty flag on the next method call which uses the parser. If parser is dirty, check that the parameter set is valid. 2. Only check the legality of the parameter set when the user calls the check_attrs() (or whatever) method. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Feb 5 11:14:04 2003 From: djc at object-craft.com.au (Dave Cole) Date: 05 Feb 2003 21:14:04 +1100 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv _csv.c,1.9,1.10 In-Reply-To: <15933.57844.794060.305738@montanaro.dyndns.org> References: <15933.57844.794060.305738@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: dave> Oops - forgot to check for '+-.' when quoting is dave> QUOTE_NONNUMERIC. Skip> Looking at the code, I wonder if when quoting is set to Skip> NONNUMERIC a single attempt to call PyFloat_FromString(field) Skip> should be made and the result used to identify the field as Skip> numeric or not. (Not for performance, but for accuracy of the Skip> setting.) You are probably right. The current code is completely ignorant of locale settings. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Wed Feb 5 11:30:12 2003 From: djc at object-craft.com.au (Dave Cole) Date: 05 Feb 2003 21:30:12 +1100 Subject: [Csv] csv.QUOTE_NEVER? In-Reply-To: <20030203035102.34B183C1F4@coffee.object-craft.com.au> References: <15930.60672.18719.407166@montanaro.dyndns.org> <20030203035102.34B183C1F4@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: Skip> The three quoting constants are currently defined as Skip> QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC. Didn't we decide Skip> there would be a QUOTE_NEVER constant as well? >> I was going to define QUOTE_NEVER then realised that all you have >> to do is set quotechar to None. Why add the effort of implementing >> two ways to achieve the same thing. Andrew> "quotechar" as None probably should be illegal in the new Andrew> module, and the "quoting" parameter used exclusively. This Andrew> would be consistent with the direction we've taken with other Andrew> parameters. OK. I have made the changes to the _csv module. I am not sure what to do with the tests. It seems a shame to delete them - can you have a look and see if there is some way you can change the failing tests to meaningful tests which succeed with the new module? - Dave -- http://www.object-craft.com.au From skip at pobox.com Wed Feb 5 15:28:00 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Feb 2003 08:28:00 -0600 Subject: [Csv] _csv bug In-Reply-To: References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> <20030205032732.049303CA92@coffee.object-craft.com.au> <15936.34823.558627.998337@montanaro.dyndns.org> <20030205035424.48D253CA92@coffee.object-craft.com.au> Message-ID: <15937.8048.817556.835043@montanaro.dyndns.org> Dave> We spoke for a short while and decided that it might make more Dave> sense to remove the PyArg_PTAK stuff altogether and just use the Dave> __setattr__ stuff in _csv. Why not replace the 'i' and 'S' flags with 'O' flags in PyArg_PTAK, then validate on the resulting group of Python objects? Skip From skip at pobox.com Wed Feb 5 15:47:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Feb 2003 08:47:28 -0600 Subject: [Csv] _csv bug In-Reply-To: References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> <20030205032732.049303CA92@coffee.object-craft.com.au> <15936.34823.558627.998337@montanaro.dyndns.org> <20030205035424.48D253CA92@coffee.object-craft.com.au> Message-ID: <15937.9216.15907.519181@montanaro.dyndns.org> Dave> We spoke for a short while and decided that it might make more Dave> sense to remove the PyArg_PTAK stuff altogether and just use the Dave> __setattr__ stuff in _csv. Why not replace the 'i' and 'S' flags with 'O' flags in PyArg_PTAK, then validate on the resulting group of Python objects? Skip From andrewm at object-craft.com.au Wed Feb 5 23:35:48 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 06 Feb 2003 09:35:48 +1100 Subject: [Csv] csv.QUOTE_NEVER? In-Reply-To: Message from Dave Cole References: <15930.60672.18719.407166@montanaro.dyndns.org> <20030203035102.34B183C1F4@coffee.object-craft.com.au> Message-ID: <20030205223548.6E3E73CA92@coffee.object-craft.com.au> >I have made the changes to the _csv module. I am not sure what to do >with the tests. It seems a shame to delete them - can you have a look >and see if there is some way you can change the failing tests to >meaningful tests which succeed with the new module? Nearly all of those tests were no longer relevant when the parameters were rationalised - they were doing things like checking that "quoting" was changed to something reasonable when "quotechar" was changed, or that an exception was raised when their values conflicted. The new scheme prevents that happening in a far more reasonable manner. BTW, I think we've introduced an bug when we split some of the variables into "have_" and "", where could then contain null - in many places, we assume we're dealing with null terminated C strings, and I suspect the user might now be able to inject a null in places we don't expect it. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From cribeiro at mail.inet.com.br Wed Feb 5 23:52:18 2003 From: cribeiro at mail.inet.com.br (Carlos Ribeiro) Date: Wed, 5 Feb 2003 22:52:18 +0000 Subject: [Csv] PEP 305 - Comments (really long post) Message-ID: <200302052252.18694.cribeiro@mail.inet.com.br> To the CSV'ers, I was discussing the CSV implementation with Skip 'et alii' over the Python list, but i decided to wait a little bit to put my ideas in a better form and then contribute to the PEP. Please bear with me as this message is rather long and may be confusing at times, but I sincerely hope it helps. MOTIVATION I use CSV files (almost) daily. I have written a few CSV parsers in my life, and for reasons that should be apparent to anyone who ever worked with them, I never managed to put everything in a single generic library. It's at the same time simple but 'catchy' - when you think you've done it, there appears some other different piece of software that manages it differently. CONCEPTUAL ISSUES ------------ [1] I know that localization is a hard problem, and that it will not probably be directly supported, but let me explain why it is important and what are the issues at hand. One of the biggest issues with reading and writing CSV data is localization. In countries where the comma is used as a decimal separator, it is common to have some other character to serve as a delimiter. The semi comma is the standard choice for MS packages using the Brazilian locale; I'm not aware of the settings for other countries. So far, fine, because the csv library can be configured for alternative field delimiters. But parsing numbers is a real problem. For example, look at these lines: "row 1";10 "row 2";3,1416 --> I assume that the csv library will parse the first line as ("row 1",10); the number 10 will be probably returned as a integer (which is not the correct interpretation for this particular file - more on this item [2]). --> The second line will probably be parsed as ("row 2","3,1416"); it may even raise an exception, depending on the implementation details! What do you intend to do in this case? Another point that you should bear in mind: even here in Brazil, some programs will use the standard (US) delimiters and number formats, while others will use the localized ones. So we end up needing to read/write both formats - for example, when reading data from Excel, and then exporting the same data to some scientific package that is not locale-aware. So any localization-related parameters have to be flexible and easily customizable. ------------ [2] I assume that the csv library will convert any numbers read from the csv file to some of numeric types available in Python upon reading. There are some issues here. In most cases, it is important to keep a regular conversion rule inside a given 'column' of the csv file. For example, in this file: "row 1";10;1 "row 2";3,1416;2 "row 3";-1;3 The obvious choice is to parse the column 1 as strings; column 2 as floats; and column 3 as integers. But the problem is, how is the csv library supposed to know that the second column hold float values, and not integers? Look ahead is out of question - after all, the only line containing a decimal point may be the last on a 10 GB file. For this problem, I propose the following semantics: a) Numbers will be interpreted accordingly to a parameter set in the dialect: - NUMBER_AS_AUTO: (default value) numeric values will be converted to the simplest type available. - NUMBER_AS_FLOAT: all numeric values will be converted to floats. - NUMBER_AS_INT: all numeric values will be converted to ints. b) assuming that the default column types were not supplied, the csv library will try to detect the correct values from the ones read from the first line of the file, but respecting the parameters mentioned above. If the first line contains column headers, then it will use the second line for this purpose. c) from the second line onwards, the csv library will keep the same conversion used for the first line. In case of error (for example, a float is found in a integer-only column), the library may take one of these actions: - raise an exception - coerce the value to the standard type for that particular column - return the value as read, even if using a different type ------------ [3] I really liked the concept of the csvreader interface. I liked the fact that the constructor takes a file object, not a file name, for the simple reason that it leads to a very nice design pattern. It makes easier to compound objects and to reuse the csvreader with other stuff that not files (just one idea, why not read CSV values directly from the clipboard? it's a good application of this design). ------------ [4] That said, I have one concern: setting the line terminator in the CSV library (using the dialect class) does not seem right. If I pass a generic iterable objects as the CSV file parameter, then it imply that the iterable itself should bear the choice on how to break lines. For instance, one should be able to write code like this: # export a file with CR/LF linebreaks (DOS/Windows style) csvwriter = csv.writer(file_crlf("some.csv", "w")) for row in mydata: csvwriter.write(row) # export a file with LF linebreaks (Unix style) csvwriter = csv.writer(file_lf_only("some.csv", "w")) for row in mydata: csvwriter.write(row) ... where 'file_crlf' and 'file_lf_only' are subclasses of 'file' that implement different line terminators (it has to be independent of the underlying OS and/or C library, of course). [Of course, this is religious stuff - the old Unix x DOS/Window line break debate. But please let us avoid it and focus on the problem at hand.] My point here is that the line terminator in the CSV library will end up being useless, as it depends ultimately on the ability of the csvwriter.write() method to convince the file object to use the 'correct' line terminator. I'm not sure if this can be done in a generic fashion, unless more restrictions are placed on the 'file-like' object that can be passed to the constructor. In other words: IF the csv library takes a file object as a parameter, IN SUCH A WAY that all that the csv library sees are entire lines (as strings), THEN it has to delegate line termination to the file object. On the other hand, if the csv library wants to have full control of line delimiters, then it should take a file name only and treat the file as a binary stream. ------------ [5] It is not clear to me what is returned as a row in the example given: csvreader = csv.reader(file("some.csv")) for row in csvreader: process(row) It is obvious to assume that 'row' is a sequence, probably a tuple. Anyway, it should be clearly stated in the PEP. ------------ [6] Empty strings can be mistaken for NULL fields (or None values in Python). How do you think to manage this case, both when reading and writing? Please note that, depending on the selection of the quote behavior and due to some side effects, it may be impossible for the reader to discern the two cases; so the library will need to be informed about the default choice. For example, for a given quote and delimiter choice, what does the reader do? Example (a): [quotechar=", quoting=QUOTE_ALL] "",,"" --> ("", None, "") In this case, the reader can assume safely that the empty field holds 'None', because empty strings should be quoted in this case. Example (b): [quotechar=", quoting=QUOTE_NONE] ,, --> (None, None, None) or ("","","")? "",,"" --> ("", None, "") or ("", "", "")? The example shows the ambiguity of empty fields when delimiters are not mandatory (as with QUOTE_NONE or QUOTE_MINIMAL). In this example, I still think that the reader should interpret empty fields as None; in the second case, it's easier to guess, but the first case is open for debate. My suggestion is to add a parameter in the dialect to set the correct behavior: class excel: delimiter = ',' quotechar = '"' escapechar = None doublequote = True skipinitialspace = False lineterminator = '\r\n' quoting = QUOTE_MINIMAL emptyfield = None # add this definition to the dialect emptyfield = "" # another possible choice BTW, the same reasoning may be applied to the decision between returning 'None' or 'zero' when reading an empty numeric field. ------------ [7] A minor suggestion, why not number the items in the "Issues" section? It would make easier to reference comments... For example, 'issue #1', etc... ------------ [8] My comments on the last issue of the list - rows of different lengths: It depends on the goals of the csv library. If the library is intended to be small and simple, and not to do anything automatically, then the reader should simply return a sequence of fields read, independent of the length. It should be left to the programmer to handle special cases. On the other hand, if the csv library is being proposed as a more generic solution, it may be interesting to study the options presented (raise an exception, fill in short lines, etc.). However, in this case, the csv reader will need to know more about the particular structure of the csv file in order to be able to make the correct choice. It may include things as knowing the column type, etc. That's complex, and it is one of the reasons why special treatment for floats and dates is being left out of this implementation. ------------ [9] A very similar architecture can be used to handle fixed-width text files. It can be done in a separate library, but using a similar interface; or it could be part of the csv library, either as another class, or by means of a proper selection of the parameters passed to the constructor. It would be useful as some applications may like best the fixed-width files instead of the delimited ones (old COBOL programs are likely to behave this way; this format is still common when passing data to/from mainframes). Carlos Ribeiro cribeiro at mail.inet.com.br From andrewm at object-craft.com.au Thu Feb 6 02:04:09 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 06 Feb 2003 12:04:09 +1100 Subject: [Csv] Dialect passing... Message-ID: <20030206010409.BE0F23CA92@coffee.object-craft.com.au> Just to throw the cat amongst the pigeons, it occured to me that my logic for making the dialect a instance rather than a dict was slightly bogus: the inheritance can still be done with a dictionary, simply by copying it: excel = { 'delimiter': ',' } excel_tab = excel.copy() excel_tab['delimiter'] = '\t' That said, the instance approach looks a little more natural to my eye. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Thu Feb 6 02:29:04 2003 From: djc at object-craft.com.au (Dave Cole) Date: 06 Feb 2003 12:29:04 +1100 Subject: [Csv] Dialect passing... In-Reply-To: <20030206010409.BE0F23CA92@coffee.object-craft.com.au> References: <20030206010409.BE0F23CA92@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: Andrew> Just to throw the cat amongst the pigeons, it occured to me Andrew> that my logic for making the dialect a instance rather than a Andrew> dict was slightly bogus: the inheritance can still be done Andrew> with a dictionary, simply by copying it: Andrew> excel = { 'delimiter': ',' } Andrew> excel_tab = excel.copy() Andrew> excel_tab['delimiter'] = '\t' Noooo..... I hereby sentence you to 2 weeks hard labour writing Perl so you can learn the error of your ways! Andrew> That said, the instance approach looks a little more natural Andrew> to my eye. I can see you have expressed remorse. I commute your punishment to community work - to be served online. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Thu Feb 6 02:29:36 2003 From: djc at object-craft.com.au (Dave Cole) Date: 06 Feb 2003 12:29:36 +1100 Subject: [Csv] _csv bug In-Reply-To: <15937.9216.15907.519181@montanaro.dyndns.org> References: <20030205015210.2714C3CA92@coffee.object-craft.com.au> <15936.29181.916055.866700@montanaro.dyndns.org> <20030205021619.6AC2C3CA92@coffee.object-craft.com.au> <15936.30844.628733.452308@montanaro.dyndns.org> <20030205032732.049303CA92@coffee.object-craft.com.au> <15936.34823.558627.998337@montanaro.dyndns.org> <20030205035424.48D253CA92@coffee.object-craft.com.au> <15937.9216.15907.519181@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Dave> We spoke for a short while and decided that it might make more Dave> sense to remove the PyArg_PTAK stuff altogether and just use the Dave> __setattr__ stuff in _csv. Skip> Why not replace the 'i' and 'S' flags with 'O' flags in Skip> PyArg_PTAK, then validate on the resulting group of Python Skip> objects? That idea is even more gooder. - Dave -- http://www.object-craft.com.au From skip at pobox.com Thu Feb 6 05:54:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Feb 2003 22:54:11 -0600 Subject: [Csv] Please check this out... Message-ID: <15937.60019.749774.278638@montanaro.dyndns.org> Gang, I just checked in an update to csv.py and test/test_csv.py which allows csv.reader objects to return dicts. In much the same way that the writer can write a dict if told what the field name order is, the reader, if given a list of fieldnames to use as keys can map the incoming list to a dictionary. There's just one little hitch. I see a negative ref count abort in a CVS debug build *if* there is a typo in the call to csv.reader(). This short_test.py script demonstrates it on my Mac: import sys import unittest from StringIO import StringIO import csv class TestDictFields(unittest.TestCase): def test_read_short_with_rest(self): reader = csv.reader(StringIO("1,2,abc,4,5,6\r\n"), dialect="excel", fieldnames=["f1", "f2"], restfields="_rest") self.assertEqual(reader.next(), {"f1": '1', "f2": '2', "_rest": ["abc", "4", "5", "6"]}) def _testclasses(): mod = sys.modules[__name__] return [getattr(mod, name) for name in dir(mod) if name.startswith('Test')] def suite(): suite = unittest.TestSuite() for testclass in _testclasses(): suite.addTest(unittest.makeSuite(testclass)) return suite if __name__ == '__main__': unittest.main(defaultTest='suite') Compare the csv.reader() call with the declaration of the __init__ method. You'll see I've misspelled "restfield", giving it a needless 's'. This pushes it into the **options dict, and since that's not an understood keyword arg, _csv.parser() complains, like so: Traceback (most recent call last): File "short_test.py", line 9, in test_read_short_with_rest fieldnames=["f1", "f2"], restfields="_rest") File "/Users/skip/local/lib/python2.2/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/Users/skip/local/lib/python2.2/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'restfields' is an invalid keyword argument for this function Under 2.2, all I get is the above traceback (haven't yet tried a 2.2 debug build). With the latest CVS and a debug build I get: % /usr/local/bin/python short_test.py E ====================================================================== ERROR: test_read_short_with_rest (__main__.TestDictFields) ---------------------------------------------------------------------- Traceback (most recent call last): File "short_test.py", line 9, in test_read_short_with_rest fieldnames=["f1", "f2"], restfields="_rest") File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'restfields' is an invalid keyword argument for this function ---------------------------------------------------------------------- Ran 1 test in 0.029s FAILED (errors=1) Fatal Python error: Objects/dictobject.c:686 object at 0x476e98 has negative ref count -606348326 Abort trap "-606348326" expressed as hex is '0xdbdbdbda' which looks suspiciously like the 0xdb bytes which debug Pythons scribble in freed memory. It's time for a long winter's nap here. I'm sure you'll have it figured out by the time I check my mail in the morning. Actually, I'm suspicious there's a refcounting bug in 2.3a1... Thx, Skip From andrewm at object-craft.com.au Thu Feb 6 05:56:11 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 06 Feb 2003 15:56:11 +1100 Subject: [Csv] dict argument to writer.writerow Message-ID: <20030206045611.930023CA92@coffee.object-craft.com.au> I don't think this belongs in writer.writerow - I'd suggest it belongs in the as yet unwritten csv.util module. The problem is that it's going to have an appreciable impact on the normal case of writing a tuple. There's no need to have the code auto-detect a dictionary - the user of the module will know before-hand whether they have a dict or tuple, and can use the appropriate layer. # if fields is a dict, we need a valid fieldnames list # if self.fieldnames is None we'll get a TypeError in the for stmt # if fields is not a dict we'll get an AttributeError on .get() try: flist = [] for k in self.fieldnames: flist.append(fields.get(k, "")) fields = flist except (TypeError, AttributeError): pass -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Thu Feb 6 06:59:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 5 Feb 2003 23:59:02 -0600 Subject: [Csv] Re: dict argument to writer.writerow In-Reply-To: <20030206045611.930023CA92@coffee.object-craft.com.au> References: <20030206045611.930023CA92@coffee.object-craft.com.au> Message-ID: <15937.63910.387970.198505@montanaro.dyndns.org> Andrew> I don't think this belongs in writer.writerow - I'd suggest it Andrew> belongs in the as yet unwritten csv.util module. The problem is Andrew> that it's going to have an appreciable impact on the normal case Andrew> of writing a tuple. Hmmm... I think of reading/writing dicts as more integration with the DB API. I rarely use plain fetchall() when getting rows from a table. Dictionaries are much saner objects. Accordingly, I'd like it to be as painless as possible for people to write them out to CSV files. Also, one can frequently think of CSV files as a file of dicts with the simple optimization that the dictionary keys are only written once, in the first row. That's not to say my code couldn't have been done differently. I was trying hard to avoid testing the type of the object being written. In retrospect the code I have will cause an exception to be raised and caught most of the time. Perhaps it would be better as: if hasattr(fields, "has_key"): # if fields is a dict, we need a valid fieldnames list # if self.fieldnames is None we'll get a TypeError in the for stmt # if fields is not a dict we'll get an AttributeError on .get() try: flist = [] for k in self.fieldnames: flist.append(fields.get(k, "")) fields = flist except (TypeError, AttributeError): pass That should lessen the load in the common case (call to hasattr() vs raised and caught exception). Alternatively, perhaps a writedict() method makes sense. It would be extremely rare (and nearly insane) for a user to write a mixture of lists and dicts. The user could either know what type of row to write and call the proper method or test the type of the first row outside the loop and assign a variable to the appropriate method. In any case, I'd like it to be as easy as possible for people to write dicts to CSV files and read rows into dicts as possible. Skip From skip at pobox.com Thu Feb 6 08:06:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 01:06:02 -0600 Subject: [Csv] Re: PEP 305 - Comments (really long post) In-Reply-To: <200302052252.18694.cribeiro@mail.inet.com.br> References: <200302052252.18694.cribeiro@mail.inet.com.br> Message-ID: <15938.2394.88405.59280@montanaro.dyndns.org> Carlos> I was discussing the CSV implementation with Skip 'et alii' over Carlos> the Python list, but i decided to wait a little bit to put my Carlos> ideas in a better form and then contribute to the PEP. Please Carlos> bear with me as this message is rather long and may be confusing Carlos> at times, but I sincerely hope it helps. Don't worry. Most of us read the list, and I've been forwarding all messages which were sent to c.l.py but not cc'd to the csv list so we at least have them archived. Accordingly, we're already familiar with your plight. ;-) Carlos> For example, look at these lines: Carlos> "row 1";10 Carlos> "row 2";3,1416 Carlos> I assume that the csv library will parse the first line as ("row Carlos> 1",10); the number 10 will be probably returned as a integer Carlos> (which is not the correct interpretation for this particular Carlos> file - more on this item [2]). You'd be wrong to assume that. The csv reader will return a list of two strings, "row 1" and "10". How to interpret the contents of the strings is completely up to you. Carlos> The second line will probably be parsed as ("row 2","3,1416"); Carlos> it may even raise an exception, depending on the implementation Carlos> details! What do you intend to do in this case? No exception will be raised. Assuming you have the quotechar set to '"' and the delimiter set to ';', you will, as you surmised, get the pair of strings you indicated. You are completely free to call int() with "3,1416" as an argument. As long as your locale is set correctly, it will work. Carlos> Another point that you should bear in mind: even here in Brazil, Carlos> some programs will use the standard (US) delimiters and number Carlos> formats, while others will use the localized ones. So we end up Carlos> needing to read/write both formats - for example, when reading Carlos> data from Excel, and then exporting the same data to some Carlos> scientific package that is not locale-aware. So any Carlos> localization-related parameters have to be flexible and easily Carlos> customizable. I understand this is going to be a problem, however I have no way of solving it for you in a way that will make everybody happy, so I'm not going to even try. The csv module is about abstracting away all the little weirdnesses which crop up in different dialects of delimited files. You, as the application programmer, have to be sensitive to the locales in which your data will be interpreted. If you expect to dump an Excel spreadsheet to a CSV file for analysis by a colleague in the US, everyone's going to be a lot happier if you send the data encoded for either the en_US or C locales. If that's not possible, you need to transmit locale information along with the data. If you have your locale set appropriately, when writing numeric data, the csv module should just do the right thing. It calls str() on all numeric data to write it out. I believe str() is locale-sensitive. Carlos> [2] I assume that the csv library will convert any numbers read Carlos> from the csv file to some of numeric types available in Python Carlos> upon reading. There are some issues here. Nope. You get strings. Carlos> "row 1";10;1 Carlos> "row 2";3,1416;2 Carlos> "row 3";-1;3 Carlos> The obvious choice is to parse the column 1 as strings; column 2 Carlos> as floats; and column 3 as integers. But the problem is, how is Carlos> the csv library supposed to know that the second column hold Carlos> float values, and not integers? Look ahead is out of question - Carlos> after all, the only line containing a decimal point may be the Carlos> last on a 10 GB file. Carlos> For this problem, I propose the following semantics: ... Just apply the necessary semantics yourself. Here's a suggestion. Suppose you know you want the first column to be strings, the second floats and the third ints. Code your read loop something like so: types = (str, float, int) reader = csv.reader(myfile) for row in reader: row = [t(v) for (t,v) in zip(types, row)] process(row) That way you have complete control over the interpretation of the data. Nobody guesses. No decisions have to be made at the csv level when a piece data doesn't fit the mold. Carlos> b) assuming that the default column types were not supplied, the Carlos> csv library will try to detect the correct values from the Carlos> ones read from the first line of the file, but respecting the Carlos> parameters mentioned above. If the first line contains column Carlos> headers, then it will use the second line for this purpose. This is bound to fail. You showed an example where floats and ints were mixed up. What if I had a column containing hex digits? Heck, make it more likely to guess wrong and make them base 9 or base 11 digits. Most of the time with base 11 numbers the will consist only of the digits 0 through 9. No 'a' will appear. With base 9 numbers it's even worse. They can always be interpreted as decimal numbers, but that interpretation will always be incorrect. Carlos> c) from the second line onwards, the csv library will keep the Carlos> same conversion used for the first line. In case of error Carlos> (for example, a float is found in a integer-only column), the Carlos> library may take one of these actions: Carlos> - raise an exception Carlos> - coerce the value to the standard type for that particular column Carlos> - return the value as read, even if using a different type You're asking us to do way too much. It is just not going to work in the general case, and you can do a much better job much more simply at the application level, because you know the properties of your data. If we attempted to do something very elaborate, we'd probably get it wrong. Even if we managed to get it right, it would probably be slow. Carlos> [4] That said, I have one concern: setting the line terminator Carlos> in the CSV library (using the dialect class) does not seem Carlos> right. One thing (among many) that's still missing from the PEP is the admonition that you have to pass in files opened in binary mode. That lets the csv module have complete control over line endings using the lineterminator attribute. Carlos> My point here is that the line terminator in the CSV library Carlos> will end up being useless, as it depends ultimately on the Carlos> ability of the csvwriter.write() method to convince the file Carlos> object to use the 'correct' line terminator. That's why we expect you to open files in binary mode. I plan to make another pass through the PEP tomorrow. I will make sure I add this. Carlos> ------------ Carlos> [5] It is not clear to me what is returned as a row in the Carlos> example given: Carlos> csvreader = csv.reader(file("some.csv")) Carlos> for row in csvreader: Carlos> process(row) Carlos> It is obvious to assume that 'row' is a sequence, probably a Carlos> tuple. Anyway, it should be clearly stated in the PEP. Thanks, will do. I'm trying to twist my colleagues' arms into letting the reader return dicts and the writer accept dicts under the proper circumstances, but the default case is that the reader will return lists and the writer will accept sequences (lists, tuples, strings, unicode objects and arrays from the standard library, though any other sequence should do as well). Carlos> [6] Empty strings can be mistaken for NULL fields (or None Carlos> values in Python). How do you think to manage this case, both Carlos> when reading and writing? Please note that, depending on the Carlos> selection of the quote behavior and due to some side effects, it Carlos> may be impossible for the reader to discern the two cases; so Carlos> the library will need to be informed about the default choice. I don't like writing None out at all, but my colleagues assure me the SQL people want SQL's NULL to map to None and that the most reasonable text representation of None is the empty string. Quoting doesn't count. We have no intention to imply semantics using quotes. I believe we still have some thinking to do about whether to allow the user to specify the actual string representation of None. ... Carlos> BTW, the same reasoning may be applied to the decision between Carlos> returning 'None' or 'zero' when reading an empty numeric field. Again, don't forget that the csv module does no infer types. When you read a row you get a list of strings. It's up to the application to decide how to interpret it. Carlos> [7] A minor suggestion, why not number the items in the "Issues" Carlos> section? It would make easier to reference comments... For Carlos> example, 'issue #1', etc... Thanks, that's a good idea. Carlos> [8] My comments on the last issue of the list - rows of Carlos> different lengths: Rows can be returned of different lengths. How to deal with short or long rows is the job of the application. Carlos> [9] A very similar architecture can be used to handle Carlos> fixed-width text files. It can be done in a separate library, Carlos> but using a similar interface; or it could be part of the csv Carlos> library, either as another class, or by means of a proper Carlos> selection of the parameters passed to the constructor. It would Carlos> be useful as some applications may like best the fixed-width Carlos> files instead of the delimited ones (old COBOL programs are Carlos> likely to behave this way; this format is still common when Carlos> passing data to/from mainframes). We thought about this briefly, but fixed-width data is not what CSV files are all about. The csv module is about parsing tabular data which uses various delimiters, quoting and escaping techniques. In addition, fixed-width data is pretty trivial to read anyway, and probably doesn't deserve a module of its own. There are no issues of quoting or delimiters. You just need to read the file in chunks of the row size and split each row along chunks of the element size. Thanks for your comments, Skip From andrewm at object-craft.com.au Thu Feb 6 14:01:09 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Feb 2003 00:01:09 +1100 Subject: [Csv] Re: dict argument to writer.writerow In-Reply-To: Message from Skip Montanaro <15937.63910.387970.198505@montanaro.dyndns.org> References: <20030206045611.930023CA92@coffee.object-craft.com.au> <15937.63910.387970.198505@montanaro.dyndns.org> Message-ID: <20030206130109.16FD23CA92@coffee.object-craft.com.au> > Andrew> I don't think this belongs in writer.writerow - I'd suggest it > Andrew> belongs in the as yet unwritten csv.util module. The problem is > Andrew> that it's going to have an appreciable impact on the normal case > Andrew> of writing a tuple. There's an even better reason now - I've almost completely re-written the C module so that the reader and writer classes are implemented in C. I haven't checked the changes in yet, because I need to do some cleaning up, and I'm too tired - don't make any conflicting changes to _csv.c or they will be lost. This should help performance slightly, but the real reason was to sweep a whole bunch of giblets out of the dialect parsing - my feeling is that it's a lot cleaner now, but only time will tell. >Hmmm... I think of reading/writing dicts as more integration with the DB >API. I rarely use plain fetchall() when getting rows from a table. >Dictionaries are much saner objects. Accordingly, I'd like it to be as >painless as possible for people to write them out to CSV files. Sure. I don't think it's too big an ask that they use an alternate interface, however. >Also, one can frequently think of CSV files as a file of dicts with the >simple optimization that the dictionary keys are only written once, in the >first row. Yeah, but then it's something more than a CSV file, isn't it.. 8-) >That's not to say my code couldn't have been done differently. I was trying >hard to avoid testing the type of the object being written. In retrospect >the code I have will cause an exception to be raised and caught most of the >time. Perhaps it would be better as: > > if hasattr(fields, "has_key"): The hasattr is about 4 times faster, but by having two interfaces, we don't even have to pay that cost. >In any case, I'd like it to be as easy as possible for people to write dicts >to CSV files and read rows into dicts as possible. Sure. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From cribeiro at mail.inet.com.br Thu Feb 6 14:36:21 2003 From: cribeiro at mail.inet.com.br (Carlos Ribeiro) Date: Thu, 6 Feb 2003 13:36:21 +0000 Subject: [Csv] Re: PEP 305 - Comments (really long post) In-Reply-To: <15938.2394.88405.59280@montanaro.dyndns.org> References: <200302052252.18694.cribeiro@mail.inet.com.br> <15938.2394.88405.59280@montanaro.dyndns.org> Message-ID: <200302061336.21724.cribeiro@mail.inet.com.br> Skip, Well, nobody can say that I didn't try :-) I almost giving up on my crusade to convince you that numbers should be converted by the csv library. It seems that we started from different assumptions, but now I think I've understood what are your objectives. I still have a few points to make, though: 1) There is one reason left to convert numbers before returning them, and this has a lot to do with information that is discarded in the process. Let us follow this example: "row 1";10 --> ("row 1", "10") The second item of the returned tuple is a string, as you stated in your answer. The problem is that my application has no way to know if the value was originally written in the csv file with or without quotes; this information is lost because all values are 'normalized' by the csv library. If I know the structure of the csv file, then it's fine, but it's not so nice when you're trying to detect the structure of an arbitrary csv file. Take a look at another example, where the first column is called 'code', the second column is 'description', and the third one is 'cost'. Note that this example is similar to the structure used for files exported from project management software: "1", "Project phase", 2000 "1.1", "Requirement analysis", 1000 "1.1", "Architectural design", 1000 In this case, MS Excel will detect that the first column as a string, but will convert values in the third one to numeric format. It can do that because he knows that the first column values were quoted, and the third one isn't. Now, when you return a tuple of strings, the user has no way to know if the quotes were or not present in the original file. There are few solutions for this problem, none of them fully satisfactory: a) return the strings as proposed by you, which leaves the library unusable for situations as described above; b) return strings in such a way that the original quotes are preserved. Then it will be up to the user to remove the extra quotes from the "real" strings; c) convert unquoted numeric values to native numbers (ints or floats) when returning the row (as proposed by myself in my previous messages); d) provide an alternative method to retrieve more information - for example, a second tuple with a more detailed description of how was the line analysed. While more complex, this approach has some advantages: (1) it does not make ths usual code any more complex, and (2) the extra information will help to implement 'smarter' csvreaders. Other alternatives may exist, but I think that the list above sums up very well the practical options. 2) In your answer, you cite the case where some numeric values can be hex, or whatever base it is. Well, I don't agree with your argument. One of the Python's mottos is "to make simple things simple". The simplest case are base 10 integers; if the library can deal with them in a sane way, you're solving the problems of the vast majority of the users. Special cases are just that, special, and will be treated in a special fashion anyway. 3) I'm not sure if str() is localized for floats. Using the standard installation of PythonWin with a fully localized copy of Windows, it still uses periods as decimal point - not commas. I didn't try to change the locale manually (I never did that before for Python); I'll try and tell you what happens. BTW, I'm sure that repr() isn't localized, because the syntax for floats is not locale-dependent, but you are probably aware of this fact. But I'm afraid that str() and repr() calls may end up calling the same function in the case of floats. 4) I'm not convinced that passing a binary file is a good idea. Reading the PEP I assumed that the csvreader constructor just takes any object that can return lines. Well, binary file objects do not meet this definition. It would make the system much less flexible, making it more difficult to pass arbitrary iterables to the csv library. For the sake of simplicity and clarity, why not leave the line termination option out of the csv library, in such a way that it can be implemented in the file object passed to the reader? The csv file would be less dependent on implementation details of the file, focusing more on how to interpret the content of the lines. 5) I agree that fixed width text files are different beasts. Anyway, it should be possible to implement it using the same interface (or API, whatever you like calling it). Things like that make the learning curve smoother. But we can leave this discussion for a later time. Thanks for your comments, and please forgive my insistence :-) Carlos Ribeiro cribeiro at mail.inet.com.br From skip at pobox.com Thu Feb 6 17:07:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 10:07:01 -0600 Subject: [Csv] Re: PEP 305 - Comments (really long post) In-Reply-To: <200302061336.21724.cribeiro@mail.inet.com.br> References: <200302052252.18694.cribeiro@mail.inet.com.br> <15938.2394.88405.59280@montanaro.dyndns.org> <200302061336.21724.cribeiro@mail.inet.com.br> Message-ID: <15938.34853.575324.942183@montanaro.dyndns.org> Carlos> 1) There is one reason left to convert numbers before returning Carlos> them, and this has a lot to do with information that is Carlos> discarded in the process. Let us follow this example: Carlos> "row 1";10 --> ("row 1", "10") Carlos> The second item of the returned tuple is a string, as you stated Carlos> in your answer. The problem is that my application has no way to Carlos> know if the value was originally written in the csv file with or Carlos> without quotes; this information is lost because all values are Carlos> 'normalized' by the csv library. Carlos, You're interpreting the quote character incorrectly. Quotes are necessary only to disambiguate fields which contain the delimiter character. There is no restriction that they be used minimally, however. Your example can just as easily (and just as correctly) have been written as any of the following: "row 1";10 "row 1";"10" row 1;10 row 1;"10" All have precisely the same meaning. We do have plans to implement a csvutils module. One of the things it will contain is a "sniffer" (actually, it may contain multiple sniffers to sniff out different properties of the file). One thing a sniffer might do is try to determine column types by looking at a relatively short prefix of a CSV file (20 rows or so). This may be helpful to you in situations where your application doesn't know the type information, but in general, your application should know column types better than the csv module. Carlos> "1", "Project phase", 2000 Carlos> "1.1", "Requirement analysis", 1000 Carlos> "1.1", "Architectural design", 1000 Carlos> In this case, MS Excel will detect that the first column as a Carlos> string, but will convert values in the third one to numeric Carlos> format. Perhaps, but Microsoft has the advantage of arrogance. ;-) MS is the 800-pound gorilla, and can thus assume that any CSV data which is fed to Excel must be in a format Excel understands. We don't have that luxury. We want to make sure people can read CSV data generated by many different applications, many of which are incompatible with Excel's assumptions. Carlos> There are few solutions for this problem, none of them fully Carlos> satisfactory: ... There's the key: "none of them fully satisfactory". If there was a satisfactory solution, we'd be more open to extracting type information from the raw data. Since there isn't we will limit this csv module's to just parsing the data. Carlos> 2) In your answer, you cite the case where some numeric values Carlos> can be hex, or whatever base it is. Well, I don't agree with Carlos> your argument. One of the Python's mottos is "to make simple Carlos> things simple". The simplest case are base 10 integers; if the Carlos> library can deal with them in a sane way, you're solving the Carlos> problems of the vast majority of the users. Special cases are Carlos> just that, special, and will be treated in a special fashion Carlos> anyway. True, the simplest case is base 10. However, like I said above, many different applications may be the source of this data (or may want to read the CSV data we write). It's just not possible to be all things to all people. We're doing what we feel we can do better than anyone else. Carlos> 3) I'm not sure if str() is localized for floats. Using the Carlos> standard installation of PythonWin with a fully localized copy Carlos> of Windows, it still uses periods as decimal point - not Carlos> commas. I didn't try to change the locale manually (I never did Carlos> that before for Python); I'll try and tell you what happens. That would be much appreciated. Another area we need to deal with but which we have avoided so far is Unicode. Carlos> 4) I'm not convinced that passing a binary file is a good Carlos> idea. Reading the PEP I assumed that the csvreader constructor Carlos> just takes any object that can return lines. Well, binary file Carlos> objects do not meet this definition. It would make the system Carlos> much less flexible, making it more difficult to pass arbitrary Carlos> iterables to the csv library. The reader takes an iterable object. If that object has a binary mode flag we expect it to have been given. This stuff all works fine now. I don't anticipate changes. Carlos> For the sake of simplicity and clarity, why not leave the line Carlos> termination option out of the csv library, in such a way that it Carlos> can be implemented in the file object passed to the reader? Because we might be generating CSV files on a Linux system (LF line terminator) which is supposed to be consumed by a user on a Mac OS 8 system running ClarisWorks 4 which (being the feeble tool it was) doesn't know diddley squat about LF line terminators. Accordingly, we have to set the lineterminator to CR. We can't do that with text mode files. Nor can we assume that a person still running CW4 and Mac OS 8 will have any sort of file conversion tools available. Carlos> 5) I agree that fixed width text files are different beasts. Carlos> Anyway, it should be possible to implement it using the same Carlos> interface (or API, whatever you like calling it). Things like Carlos> that make the learning curve smoother. But we can leave this Carlos> discussion for a later time. Sure, but "same API" != "same module". ;-) Carlos> Thanks for your comments, and please forgive my insistence :-) No problem. Just don't move to New Zealand and change your name to Graham. ;-) [see the recent python-dev flamefest about a native code compiler for Python] Skip From skip at pobox.com Thu Feb 6 23:39:19 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 16:39:19 -0600 Subject: [Csv] multi-character delimiters, take two Message-ID: <15938.58391.603139.209913@montanaro.dyndns.org> At work I've been installing Firewall-1. I finally got it installed and enabled today, protecting a single test machine. Of course, the bad guys are knocking on the door, so I have a growing logfile. "Hmmm, 'twould be nice to pop this data into Python and see what it looks like," I thought. I dumped the logfile on the firewall itself. No dice, it's just plain, undifferentiated text. Damn. So I tried exporting it through the export interface on the management client which runs on Windows. Lo and behold, there was a fairly nice looking CSV file, all fields quoted with '"', except... the delimiter is two spaces. I popped it up in XEmacs to be sure it wasn't a TAB. What are these people thinking? So now I've encountered two examples (including my old client in Austria) of honest-to-goodness tabular data (that is, not fabricated by mad perl hackers out to trip us up with "well, what if?" games) where the delimiter between fields is more than a single character. There are probably others out there, just waiting to be discovered. Any chance the len(delimiter) == 1 restriction could be relaxed? Skip From andrewm at object-craft.com.au Thu Feb 6 23:52:59 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Feb 2003 09:52:59 +1100 Subject: [Csv] multi-character delimiters, take two In-Reply-To: Message from Skip Montanaro <15938.58391.603139.209913@montanaro.dyndns.org> References: <15938.58391.603139.209913@montanaro.dyndns.org> Message-ID: <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> >So I tried exporting >it through the export interface on the management client which runs on >Windows. Lo and behold, there was a fairly nice looking CSV file, all >fields quoted with '"', except... the delimiter is two spaces. I popped it >up in XEmacs to be sure it wasn't a TAB. You might find that two spaces never appear in the data fields, in which case this might work: fields = line.split(' ') BTW, have you tried using the csv parser with delimiter set to space, and skipinitialspace set to true? >So now I've encountered two examples (including my old client in Austria) of >honest-to-goodness tabular data (that is, not fabricated by mad perl hackers >out to trip us up with "well, what if?" games) where the delimiter between >fields is more than a single character. There are probably others out >there, just waiting to be discovered. Any chance the len(delimiter) == 1 >restriction could be relaxed? Not without some hairy work on the state machine. The more complicated we make the state machine, the more likely we are to let a nasty bug slip through, so I'm rather reluctant. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From LogiplexSoftware at earthlink.net Thu Feb 6 23:55:18 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 06 Feb 2003 14:55:18 -0800 Subject: [Csv] multi-character delimiters, take two In-Reply-To: <15938.58391.603139.209913@montanaro.dyndns.org> References: <15938.58391.603139.209913@montanaro.dyndns.org> Message-ID: <1044572117.23236.1359.camel@software1.logiplex.internal> On Thu, 2003-02-06 at 14:39, Skip Montanaro wrote: > What are these people thinking? "Mmm, spaces..." > So now I've encountered two examples (including my old client in Austria) of > honest-to-goodness tabular data (that is, not fabricated by mad perl hackers > out to trip us up with "well, what if?" games) where the delimiter between > fields is more than a single character. There are probably others out > there, just waiting to be discovered. Any chance the len(delimiter) == 1 > restriction could be relaxed? Or, in this case, the "treat consecutive delimiters as one" might have been useful. This *is* an option (on import) in Excel. BTW, sorry I've gone missing for a while. I've been putting out fires on our customers' systems. Some of them are still smoking, but I had time to chime in on this =) -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Fri Feb 7 00:07:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 17:07:02 -0600 Subject: [Csv] multi-character delimiters, take two In-Reply-To: <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> References: <15938.58391.603139.209913@montanaro.dyndns.org> <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> Message-ID: <15938.60054.445005.692672@montanaro.dyndns.org> >> Lo and behold, there was a fairly nice looking CSV file, all fields >> quoted with '"', except... the delimiter is two spaces. Andrew> You might find that two spaces never appear in the data fields, Andrew> in which case this might work: Andrew> fields = line.split(' ') Sure, it might, but then I'm back to hackish wing-and-a-prayer parsing. Andrew> BTW, have you tried using the csv parser with delimiter set to Andrew> space, and skipinitialspace set to true? Not yet. Good suggestion though. I will give it a try later. >> So now I've encountered two examples .... Any chance the >> len(delimiter) == 1 restriction could be relaxed? Andrew> Not without some hairy work on the state machine. The more Andrew> complicated we make the state machine, the more likely we are to Andrew> let a nasty bug slip through, so I'm rather reluctant. Point taken, and since you guys on summer vacation are the BDFLs of that code, your word is law. Still, don't be surprised to hear someone ask for it just after 2.3 is out. ;-) Skip From skip at pobox.com Fri Feb 7 00:07:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 17:07:44 -0600 Subject: [Csv] multi-character delimiters, take two In-Reply-To: <1044572117.23236.1359.camel@software1.logiplex.internal> References: <15938.58391.603139.209913@montanaro.dyndns.org> <1044572117.23236.1359.camel@software1.logiplex.internal> Message-ID: <15938.60096.822480.773111@montanaro.dyndns.org> Cliff> Or, in this case, the "treat consecutive delimiters as one" might Cliff> have been useful. This *is* an option (on import) in Excel. Hmmm... another useful suggestion. Thx, Skip From skip at pobox.com Fri Feb 7 00:59:39 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 17:59:39 -0600 Subject: [Csv] multi-character delimiters, take two In-Reply-To: <15938.60054.445005.692672@montanaro.dyndns.org> References: <15938.58391.603139.209913@montanaro.dyndns.org> <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> <15938.60054.445005.692672@montanaro.dyndns.org> Message-ID: <15938.63211.957812.39094@montanaro.dyndns.org> Andrew> BTW, have you tried using the csv parser with delimiter set to Andrew> space, and skipinitialspace set to true? Skip> Not yet. Good suggestion though. I will give it a try later. Here's the result. Inputs look like this: "842" "6Feb2003" "16:22:42" "ce0" "log" "drop" "1433" "pD955C67D.dip.t-dialin.net" "stonewall" "2" "" "843" "6Feb2003" "16:25:21" "ce0" "log" "drop" "325" "powered.by.bgames.be" "129.105.117.83" "" " th_flags 14 message_info TCP packet out of state" "844" "6Feb2003" "16:28:13" "ce0" "log" "drop" "nbname" "200.212.86.130" "stonewall" "2" "" The dialect class was defined as: class spc(csv.excel): delimiter=' ' skipinitialspace=1 The resulting output looks like: ['842', '', '6Feb2003', '', '16:22:42', '', 'ce0', '', 'log', '', 'drop', '', '1433', '', 'pD955C67D.dip.t-dialin.net', '', 'stonewall', '', '2', '', '', '', ''] ['843', '', '6Feb2003', '', '16:25:21', '', 'ce0', '', 'log', '', 'drop', '', '325', '', 'powered.by.bgames.be', '', '129.105.117.83', '', '', '', ' th_flags 14 message_info TCP packet out of state', '', ''] ['844', '', '6Feb2003', '', '16:28:13', '', 'ce0', '', 'log', '', 'drop', '', 'nbname', '', '200.212.86.130', '', 'stonewall', '', '2', '', '', '', ''] It didn't actually skip the space, but the data is fairly regular, so I can live with it. Thanks again for the suggestion. Skip From skip at pobox.com Fri Feb 7 01:17:29 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 18:17:29 -0600 Subject: [Csv] Check this out... Message-ID: <15938.64281.99865.883746@montanaro.dyndns.org> I filed a bug report about a negative refcount problem I've been seeing recently. Neal Norwitz replied that he thinks it's a bug in the csv module. This simple example demonstrates the problem: % /usr/local/bin/python Python 2.3a1 (#2, Feb 5 2003, 20:57:52) [GCC 3.1 20020420 (prerelease)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import csv [28609 refs] >>> csv.reader("1", fieldnames=["f1"], restfields=["_rest"], dialect="excel") Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'restfields' is an invalid keyword argument for this function [28713 refs] >>> csv.reader("1", fieldnames=["f1"], restfields=["_rest"], dialect="excel") Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'restfields' is an invalid keyword argument for this function [28712 refs] >>> csv.reader("1", fieldnames=["f1"], restfields=["_rest"], dialect="excel") Fatal Python error: Objects/dictobject.c:373 object at 0x532648 has negative ref count -606348326 Abort trap Note that the total number of references decreases each time the reader is instantiated. If I fix the typo in the code ("restfields" -> "restfield"), I don't see the problem: >>> import csv [28609 refs] >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel") [28633 refs] >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel") [28633 refs] >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel") [28633 refs] >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel") [28633 refs] It would appear there is a bug somewhere in the parameter parsing when an invalid keyword parameter is passed. Since I didn't modify the C code to add the dict support, I don't think that's where the problem lies. In fact, it would appear that any bogus arg causes the abort: >>> rdr = csv.reader("1", bogus="hi bob") Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'bogus' is an invalid keyword argument for this function [28728 refs] >>> rdr = csv.reader("1", bogus="hi bob") Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'bogus' is an invalid keyword argument for this function [28727 refs] >>> rdr = csv.reader("1", bogus="hi bob") Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__ _OCcsv.__init__(self, dialect, **options) File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__ self.parser = _csv.parser(**parser_options) TypeError: 'bogus' is an invalid keyword argument for this function [28726 refs] >>> rdr = csv.reader("1", bogus="hi bob") Fatal Python error: Objects/dictobject.c:373 object at 0x532648 has negative ref count -3 Abort trap I can provoke this with both debug and non-debug builds of Python CVS as well as Python 2.2 (non-debug). I'll try to take a look at the PyArg_PTAK code. I suspect it's in that region. Skip From skip at pobox.com Fri Feb 7 01:19:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 18:19:02 -0600 Subject: [Csv] bye bye 2.1? Message-ID: <15938.64374.523505.231835@montanaro.dyndns.org> Dave & Andrew, I notice I can't build w/ Python 2.1 (I thought I was able to early on): % python2.1 setup.py install running install running build running build_py creating build/lib.darwin-6.3-Power Macintosh-2.1 copying csv.py -> build/lib.darwin-6.3-Power Macintosh-2.1 running build_ext building '_csv' extension creating build/temp.darwin-6.3-Power Macintosh-2.1 gcc -g -O2 -Wall -Wstrict-prototypes -no-cpp-precomp -I/Users/skip/local/include/python2.1 -c _csv.c -o build/temp.darwin-6.3-Power Macintosh-2.1/_csv.o _csv.c:654: `METH_NOARGS' undeclared here (not in a function) _csv.c:654: initializer element is not constant _csv.c:654: (near initialization for `Parser_methods[1].ml_flags') _csv.c:655: initializer element is not constant _csv.c:655: (near initialization for `Parser_methods[1]') _csv.c:656: `METH_O' undeclared here (not in a function) _csv.c:656: initializer element is not constant _csv.c:656: (near initialization for `Parser_methods[2].ml_flags') _csv.c:657: initializer element is not constant _csv.c:657: (near initialization for `Parser_methods[2]') _csv.c:658: initializer element is not constant _csv.c:658: (near initialization for `Parser_methods[3]') _csv.c: In function `init_csv': _csv.c:950: warning: implicit declaration of function `PyType_Ready' error: command 'gcc' failed with exit status 1 I don't know how far back you want this code supported. It's up to you guys. Skip From andrewm at object-craft.com.au Fri Feb 7 01:32:36 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Feb 2003 11:32:36 +1100 Subject: [Csv] multi-character delimiters, take two In-Reply-To: Message from Skip Montanaro <15938.63211.957812.39094@montanaro.dyndns.org> References: <15938.58391.603139.209913@montanaro.dyndns.org> <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> <15938.60054.445005.692672@montanaro.dyndns.org> <15938.63211.957812.39094@montanaro.dyndns.org> Message-ID: <20030207003236.5D2DE3CA92@coffee.object-craft.com.au> >Here's the result. Inputs look like this: > > "842" "6Feb2003" "16:22:42" "ce0" "log" "drop" "1433" "pD955C67D.dip.t-dialin.net" "stonewall" "2" "" > "843" "6Feb2003" "16:25:21" "ce0" "log" "drop" "325" "powered.by.bgames.be" "129.105.117.83" "" " th_flags 14 message_info TCP packet out of state" > "844" "6Feb2003" "16:28:13" "ce0" "log" "drop" "nbname" "200.212.86.130" "stonewall" "2" "" Everything is quoted? Then this will work like a charm: line[1:-1].split('" "') >It didn't actually skip the space, but the data is fairly regular, so I can >live with it. Okay - looks like the skipinitialspace stuff needs more testing - I doubt Dave coded it with delimiter=' ' in mind - it's a pretty pathological case... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Feb 7 01:34:27 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Feb 2003 11:34:27 +1100 Subject: [Csv] Check this out... In-Reply-To: Message from Skip Montanaro <15938.64281.99865.883746@montanaro.dyndns.org> References: <15938.64281.99865.883746@montanaro.dyndns.org> Message-ID: <20030207003427.E8A373CA92@coffee.object-craft.com.au> >I can provoke this with both debug and non-debug builds of Python CVS as >well as Python 2.2 (non-debug). I'll try to take a look at the PyArg_PTAK >code. I suspect it's in that region. Don't bother - this code has been completely re-written. I've been watching the refcounts carefully as I wrote the code, and the new code seems to be doing the right thing. Sorry to waste your time tracking this one... 8-( -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Feb 7 01:40:17 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Feb 2003 11:40:17 +1100 Subject: [Csv] bye bye 2.1? In-Reply-To: Message from Skip Montanaro <15938.64374.523505.231835@montanaro.dyndns.org> References: <15938.64374.523505.231835@montanaro.dyndns.org> Message-ID: <20030207004017.23F0C3CA92@coffee.object-craft.com.au> > _csv.c:654: `METH_NOARGS' undeclared here (not in a function) > _csv.c:656: `METH_O' undeclared here (not in a function) These two are relatively easy to fix (and the others might simply be side-effects of these errors). I'll have to have a think whether we should bother. Python-2.2 is a decent goal - I haven't even tested against it yet. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Fri Feb 7 05:11:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 6 Feb 2003 22:11:05 -0600 Subject: [Csv] Passing along a comment from Tim Peters Message-ID: <15939.12761.653479.207742@montanaro.dyndns.org> Andrew, et al, Here's a comment from Tim Peters regarding the negative ref count problem I reported. Comment By: Tim Peters (tim_one) Date: 2003-02-06 20:23 Message: Logged In: YES user_id=31435 I think csv_parser is too clevar. If the PyArg_ParseTuple call fails, it may have already stored a borrowed reference into self->lineterminator, and then it's madness to decref that.in Parser_dealloc(). "The usual way" to allocate a new object is not to materialize self until *after* PyArg_ParseTuple succeeds. Then nothing delicate needs to be done to clean up, since nothing was done at all yet . Good evidence: adding the pure hack Py_XINCREF(self->lineterminator); before Py_DECREF(self); stops the negative refcount errors in Neal's example. While the current module doesn't seem to exhibit the bug, Tim's advice might still be useful. The full bug report is at http://python.org/sf/681902 Skip From andrewm at object-craft.com.au Fri Feb 7 07:05:22 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 07 Feb 2003 17:05:22 +1100 Subject: [Csv] Update PEP? Message-ID: <20030207060522.706163CA93@coffee.object-craft.com.au> I think the PEP needs updating - the API hasn't changed too much, but there's a few warts in there. The docstring for the C module is probably the most accurate reference at the moment. Can someone give me a hand and go over the PEP? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Fri Feb 7 16:16:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 7 Feb 2003 09:16:30 -0600 Subject: [Csv] Update PEP? In-Reply-To: <20030207060522.706163CA93@coffee.object-craft.com.au> References: <20030207060522.706163CA93@coffee.object-craft.com.au> Message-ID: <15939.52686.652879.386851@montanaro.dyndns.org> Andrew> I think the PEP needs updating - the API hasn't changed too Andrew> much, but there's a few warts in there. The docstring for the C Andrew> module is probably the most accurate reference at the Andrew> moment. Can someone give me a hand and go over the PEP? Sure, I'll try to update it today. Also, note that libcsv.tex is supposed to be a section for the library reference manual. That probably needs significant attention at this point as well. Skip From skip at pobox.com Fri Feb 7 17:43:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 7 Feb 2003 10:43:24 -0600 Subject: [Csv] why push so much code into C? Message-ID: <15939.57900.514900.606475@montanaro.dyndns.org> I see that csv.py has dwindled down to next-to-nothing. Even the dialect registry stuff is in C. (Is the Dialect class in csv.py used anymore? I see something which looks like a dialect object in the C code.) It's not obvious to me that there's any performance gain to be had by having anything other than the raw parsing and writing code in the C module. On the other hand, by pushing code which isn't performance-critical into C it becomes harder to maintain and extend, and significantly limits the number of people who can contribute to the code's growth and maturity. Skip From andrewm at object-craft.com.au Sat Feb 8 04:42:29 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Sat, 08 Feb 2003 14:42:29 +1100 Subject: [Csv] why push so much code into C? In-Reply-To: Message from Skip Montanaro <15939.57900.514900.606475@montanaro.dyndns.org> References: <15939.57900.514900.606475@montanaro.dyndns.org> Message-ID: <20030208034229.DD8623CA92@coffee.object-craft.com.au> >I see that csv.py has dwindled down to next-to-nothing. Even the dialect >registry stuff is in C. (Is the Dialect class in csv.py used anymore? I >see something which looks like a dialect object in the C code.) It's not >obvious to me that there's any performance gain to be had by having anything >other than the raw parsing and writing code in the C module. On the other >hand, by pushing code which isn't performance-critical into C it becomes >harder to maintain and extend, and significantly limits the number of people >who can contribute to the code's growth and maturity. The dialect registry went into C so that the reader and writer had access to it. The code involved was trivial, so it made sense to move it. The underlying modules accept any instance or class that has appropriate attributes as a dialect - they don't compare against Dialect. But the dialect definitions in Python are still used. The code that has moved into C is relative straight forward - the really hairy stuff is the parser and the generator (as it has always been). Limiting the number of people who can modify the *interface* is not a bad thing... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Sat Feb 8 06:06:17 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 7 Feb 2003 23:06:17 -0600 Subject: [Csv] unicode read test checked in Message-ID: <15940.36937.471963.983008@montanaro.dyndns.org> I checked in a separate unicode test (test/unicode_test.csv). It causes a bus error on my machine, so I figured it was best to keep it separate for now. Skip From skip at pobox.com Sat Feb 8 06:35:56 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 7 Feb 2003 23:35:56 -0600 Subject: [Csv] This surprised me Message-ID: <15940.38716.154285.557948@montanaro.dyndns.org> This code surprised me: >>> class foo: pass ... >>> csv.register_dialect("excel", foo) >>> csv.get_dialect("excel") <__main__.foo instance at 0x5309f8> >>> import StringIO >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n")) >>> list(rdr) [['1', '2', '3']] >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect="excel") >>> list(rdr) [['1', '2', '3']] >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect=foo) >>> list(rdr) [['1', '2', '3']] >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect=foo) Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/site-packages/csv.py", line 27, in __init__ raise Error, "Dialect did not validate: %s" % ", ".join(errors) _csv.Error: Dialect did not validate: delimiter not set, quotechar not set, lineterminator not set, doublequote setting must be True or False, skipinitialspace setting must be True or False Why didn't it complain anywhere that 'foo' was worthless as a dialect until the last statement? Skip From andrewm at object-craft.com.au Sat Feb 8 14:08:18 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Sun, 09 Feb 2003 00:08:18 +1100 Subject: [Csv] This surprised me In-Reply-To: Message from Skip Montanaro <15940.38716.154285.557948@montanaro.dyndns.org> References: <15940.38716.154285.557948@montanaro.dyndns.org> Message-ID: <20030208130818.B61E53CA92@coffee.object-craft.com.au> >This code surprised me: > > >>> class foo: pass [...] > >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect=foo) > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/local/lib/python2.3/site-packages/csv.py", line 27, in __init__ > raise Error, "Dialect did not validate: %s" % ", ".join(errors) > _csv.Error: Dialect did not validate: delimiter not set, quotechar not set, lineterminator not set, doublequote setting must be True or False, skipinitialspace setting must be True or False > >Why didn't it complain anywhere that 'foo' was worthless as a dialect until >the last statement? Surely there's more to your example than you quoted in this e-mail? The exception you mention came from the python code, not the C module (specifically the Dialect class), but I can't see where it referenced in the quoted code? The C code will instanciate (and thus call Dialect's _validate) when register_dialect is called, or when the class is passed to reader or writer. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Sat Feb 8 16:14:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 09:14:55 -0600 Subject: [Csv] This surprised me In-Reply-To: <20030208130818.B61E53CA92@coffee.object-craft.com.au> References: <15940.38716.154285.557948@montanaro.dyndns.org> <20030208130818.B61E53CA92@coffee.object-craft.com.au> Message-ID: <15941.7919.794691.236800@montanaro.dyndns.org> >> This code surprised me: ... Andrew> Surely there's more to your example than you quoted in this Andrew> e-mail? The exception you mention came from the python code, not Andrew> the C module (specifically the Dialect class), but I can't see Andrew> where it referenced in the quoted code? Nope, nothing more. I guess the point I was trying to make is that if I pass a dialect object which is not subclassed from csv.Dialect (as you suggested I should be able to do), it seems to be silently accepted. Andrew> The C code will instanciate (and thus call Dialect's _validate) Andrew> when register_dialect is called, or when the class is passed to Andrew> reader or writer. Correct. But you indicated that was no longer necessary. I was wondering where the error checking went to. Skip From skip at pobox.com Sat Feb 8 16:25:16 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 09:25:16 -0600 Subject: [Csv] This surprised me Message-ID: <15941.8540.607571.202309@montanaro.dyndns.org> >> This code surprised me: ... Andrew> Surely there's more to your example than you quoted in this Andrew> e-mail? The exception you mention came from the python code, not Andrew> the C module (specifically the Dialect class), but I can't see Andrew> where it referenced in the quoted code? Nope, nothing more. I guess the point I was trying to make is that if I pass a dialect object which is not subclassed from csv.Dialect (as you suggested I should be able to do), it seems to be silently accepted. Andrew> The C code will instanciate (and thus call Dialect's _validate) Andrew> when register_dialect is called, or when the class is passed to Andrew> reader or writer. Correct. But you indicated that was no longer necessary. I was wondering where the error checking went to. Skip From skip at pobox.com Sat Feb 8 19:41:00 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 12:41:00 -0600 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test unicode_test.py,NONE,1.1 (fwd) Message-ID: <15941.20284.334750.638247@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: "M.-A. Lemburg" Subject: Re: [Python-checkins] python/nondist/sandbox/csv/test unicode_test.py,NONE,1.1 Date: Sat, 08 Feb 2003 18:24:22 +0100 Size: 5008 Url: http://mail.python.org/pipermail/csv/attachments/20030208/2eb0c32f/attachment.mht From skip at pobox.com Sat Feb 8 19:48:17 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 12:48:17 -0600 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test unicode_test.py,NONE,1.1 In-Reply-To: <3E453D46.7080307@lemburg.com> References: <15941.8582.901618.823053@montanaro.dyndns.org> <3E453D46.7080307@lemburg.com> Message-ID: <15941.20721.268117.216891@montanaro.dyndns.org> (redirecting to the csv mailing list so this stuff gets archived.) >> http://mail.python.org/pipermail/python-list/2003-February/145151.html mal> Why not convert the input data to UTF-8 and take it from there ? Good suggestion, thanks. The only issue is the variable width nature of utf-8. I think if we are going to convert to a concrete encoding it would be easier to convert to something which has constant-width characters wouldn't it? Of course, if I can convince the guys in Australia writing the actual code to deal with a variable-width encoding, it can't be far from there to allowing multi-character delimiters. ;-) mal> Are you sure that Unicode objects will be lower in processing ? Operating on Python string or unicode objects without converting them to some sort of C string will almost certainly be slower than the current code which is a relatively modest finite state machine operating on individual bytes. mal> (Is there a standard for encodings in CSV files ?) No, there is none, hence the use of codecs.EncodedFile to allow the programmer to specify the encoding. Excel can export to two formats it calls "Unicode CSV" and "Unicode Text". Exporting a spreadsheet containing nothing but ASCII as Unicode CSV produced exactly the same comma-separated file as would have been dumped using the usual CSV export format. Exporting the same spreadsheet as Unicode Text produced a tab-separated file which I guessed to be utf-16. It started with a little-endian utf-16 BOM and all the characters were two bytes wide with one byte being an ASCII NUL. Thanks for the feedback, Skip From skip at pobox.com Sat Feb 8 21:13:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 14:13:01 -0600 Subject: [Csv] confused about wrapping readers and writers Message-ID: <15941.25805.234008.663342@montanaro.dyndns.org> I still want to be able to read from and write to dictionaries. ;-) I would like to add a pair of classes to csv.py which implement this, but I don't quite know what's required, never having written any iterators before. If I create a reader: >>> rdr = csv.reader(["a,b,c\r\n"]) and ask for its attributes, all I get back are the data attributes: >>> dir(rdr) ['delimiter', 'doublequote', 'escapechar', 'lineterminator', 'quotechar', 'quoting', 'skipinitialspace', 'strict'] Does the underlying reader object need to expose its Reader_iternext function as a next() method? Based upon http://www.python.org/doc/current/lib/typeiter.html I sort of suspect it does. It looks like it also needs an __iter__() method which just returns self. I thought a DictReader would look something like class DictReader: def __init__(self, f, fieldnames, rest=None, dialect="excel", *args): self.fieldnames = fieldnames # list of keys for the dict self.rest = rest # key to catch long rows self.reader = reader(f, dialect, *args) def next(self): row = self.reader.next() d = dict(zip, self.fieldnames, row) if len(self.fieldnames) < len(row): d[self.rest] = row[len(self.fieldnames):] return d Is all that's missing a next() method for reader objects? Thx, Skip From skip at pobox.com Sat Feb 8 21:38:17 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 14:38:17 -0600 Subject: [Csv] Re: How best to handle Unicode where only 8-bit chars are now? In-Reply-To: References: <15940.37847.263991.794301@montanaro.dyndns.org> Message-ID: <15941.27321.855319.916966@montanaro.dyndns.org> >> Option 3 seems the cleanest, but would slow everything down >> significantly because character extraction and comparison would >> require a function call instead of an array index operation or a >> simple comparison. Fredrik> what makes you think 8-bit == fast and unicode == slow? Nothing, just unfamiliarity. That's why I was asking. Fredrik> have you looked at SRE? it compiles portions of itself twice, Fredrik> to get 8-bit and unicode versions of the core engine. on Fredrik> modern machines, the unicode version often runs *faster* than Fredrik> the corresponding 8-bit code. I'll refer the csv authors to this. Thx, Skip From skip at pobox.com Sat Feb 8 21:38:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 8 Feb 2003 14:38:25 -0600 Subject: [Csv] Re: How best to handle Unicode where only 8-bit chars are now? (fwd) Message-ID: <15941.27329.404107.28526@montanaro.dyndns.org> archive -------------- next part -------------- An embedded message was scrubbed... From: "Fredrik Lundh" Subject: Re: How best to handle Unicode where only 8-bit chars are now? Date: Sat, 8 Feb 2003 15:54:41 +0100 Size: 6728 Url: http://mail.python.org/pipermail/csv/attachments/20030208/eb9895ea/attachment.mht From djc at object-craft.com.au Sun Feb 9 01:56:46 2003 From: djc at object-craft.com.au (Dave Cole) Date: 09 Feb 2003 11:56:46 +1100 Subject: [Csv] multi-character delimiters, take two In-Reply-To: <20030207003236.5D2DE3CA92@coffee.object-craft.com.au> References: <15938.58391.603139.209913@montanaro.dyndns.org> <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> <15938.60054.445005.692672@montanaro.dyndns.org> <15938.63211.957812.39094@montanaro.dyndns.org> <20030207003236.5D2DE3CA92@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> Here's the result. Inputs look like this: >> >> "842" "6Feb2003" "16:22:42" "ce0" "log" "drop" "1433" >> "pD955C67D.dip.t-dialin.net" "stonewall" "2" "" "843" "6Feb2003" >> "16:25:21" "ce0" "log" "drop" "325" "powered.by.bgames.be" >> "129.105.117.83" "" " th_flags 14 message_info TCP packet out of >> state" "844" "6Feb2003" "16:28:13" "ce0" "log" "drop" "nbname" >> "200.212.86.130" "stonewall" "2" "" Andrew> Everything is quoted? Then this will work like a charm: Andrew> line[1:-1].split('" "') >> It didn't actually skip the space, but the data is fairly regular, >> so I can live with it. Andrew> Okay - looks like the skipinitialspace stuff needs more Andrew> testing - I doubt Dave coded it with delimiter=' ' in mind - Andrew> it's a pretty pathological case... 8-) It might be as simple as swapping the following tests: case START_FIELD: : : else if (c == self->dialect.delimiter) { /* save empty field */ parse_save_field(self); } else if (c == ' ' && self->dialect.skipinitialspace) /* ignore space at start of field */ ; The state machine for handling multi-character delimiters is not necessarily much more compilcated. Instead of switching to new state on the basis of a single character, the state machine would have to introduce transitional states which iterate over the multi-character delimiter before going to the destination state. There would have to be some very basic backtracking which allowed the parser state machine to indicate a false match of delimiter in the transitional state. This would rewind the input stream (careful about infinite loops). Looking at the state machine for code which reacts to the delimiter. We would need the following transitional states. DELIMITER_START_FIELD DELIMITER_ESCAPED_CHAR DELIMITER_IN_FIELD DELIMITER_ESCAPE_IN_QUOTED_FIELD DELIMITER_QUOTE_IN_QUOTED_FIELD Mind you all of this code falls over once you decide to allow multiple characters in the quotechar as well. What happens when delimiter = 'DD' and quotechar = 'DQ' (where D and Q are some arbitrary character)? You start building a partial regex engine. - Dave -- http://www.object-craft.com.au From mal at lemburg.com Sun Feb 9 12:38:03 2003 From: mal at lemburg.com (M.-A. Lemburg) Date: Sun, 09 Feb 2003 12:38:03 +0100 Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test unicode_test.py,NONE,1.1 In-Reply-To: <15941.20721.268117.216891@montanaro.dyndns.org> References: <15941.8582.901618.823053@montanaro.dyndns.org> <3E453D46.7080307@lemburg.com> <15941.20721.268117.216891@montanaro.dyndns.org> Message-ID: <3E463D9B.40007@lemburg.com> Skip Montanaro wrote: > (redirecting to the csv mailing list so this stuff gets archived.) > > >> http://mail.python.org/pipermail/python-list/2003-February/145151.html > > mal> Why not convert the input data to UTF-8 and take it from there ? > > Good suggestion, thanks. The only issue is the variable width nature of > utf-8. I think if we are going to convert to a concrete encoding it would > be easier to convert to something which has constant-width characters > wouldn't it? Of course, if I can convince the guys in Australia writing the > actual code to deal with a variable-width encoding, it can't be far from > there to allowing multi-character delimiters. ;-) We chose UTF-8 in the Python tokenizer/compiler to turn a previously byte based program part into a Unicode capable one. Many other tools have used the same approach. Variable length encodings have problems with slicing and indexing, but unless you need these, I don't see much of a problem. > mal> Are you sure that Unicode objects will be lower in processing ? > > Operating on Python string or unicode objects without converting them to > some sort of C string will almost certainly be slower than the current code > which is a relatively modest finite state machine operating on individual > bytes. You could use a hybrid approach similar to sre or mxTextTools for dealing with both types of base types (char vs. Py_UNICODE). > mal> (Is there a standard for encodings in CSV files ?) > > No, there is none, hence the use of codecs.EncodedFile to allow the > programmer to specify the encoding. Excel can export to two formats it > calls "Unicode CSV" and "Unicode Text". Exporting a spreadsheet containing > nothing but ASCII as Unicode CSV produced exactly the same comma-separated > file as would have been dumped using the usual CSV export format. Exporting > the same spreadsheet as Unicode Text produced a tab-separated file which I > guessed to be utf-16. It started with a little-endian utf-16 BOM and all > the characters were two bytes wide with one byte being an ASCII NUL. The BOM mark is what MS uses to indicate Unicode in text files. It's a rather practical approach to the problem, but it works :-) Perhaps you could add some magic to detect these BOM marks and then default to UTF-16 input ?! There's also the possiblity to use UTF-8 BOMs, BTW. See codecs.py for a list of possible BOM marks. > Thanks for the feedback, You're welcome, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From skip at pobox.com Sun Feb 9 17:47:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 9 Feb 2003 10:47:05 -0600 Subject: [Csv] multi-character delimiters, take two In-Reply-To: References: <15938.58391.603139.209913@montanaro.dyndns.org> <20030206225259.0E0EB3CA92@coffee.object-craft.com.au> <15938.60054.445005.692672@montanaro.dyndns.org> <15938.63211.957812.39094@montanaro.dyndns.org> <20030207003236.5D2DE3CA92@coffee.object-craft.com.au> Message-ID: <15942.34313.543359.488243@montanaro.dyndns.org> Dave> Mind you all of this code falls over once you decide to allow Dave> multiple characters in the quotechar as well. What happens when Dave> delimiter = 'DD' and quotechar = 'DQ' (where D and Q are some Dave> arbitrary character)? You start building a partial regex engine. Would it work to simply use regular expressions to recognize delimiters and quotes? (I'll let you do the math. ;-) Skip From andrewm at object-craft.com.au Mon Feb 10 00:09:05 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 10 Feb 2003 10:09:05 +1100 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: Message from Skip Montanaro <15941.25805.234008.663342@montanaro.dyndns.org> References: <15941.25805.234008.663342@montanaro.dyndns.org> Message-ID: <20030209230905.816893CA92@coffee.object-craft.com.au> >I still want to be able to read from and write to dictionaries. ;-) I would >like to add a pair of classes to csv.py which implement this, but I don't >quite know what's required, never having written any iterators before. An object that supports iteration needs an __iter__() method. When called, this method returns an object that supports iteration (in other words, has a next() method). __iter__() can return self (in which case, self needs a next() method). >If I create a reader: > > >>> rdr = csv.reader(["a,b,c\r\n"]) > >and ask for its attributes, all I get back are the data attributes: > > >>> dir(rdr) > ['delimiter', 'doublequote', 'escapechar', 'lineterminator', > 'quotechar', 'quoting', 'skipinitialspace', 'strict'] For reasons that I haven't looked into, dir() is not finding methods on the objects we're creating - I suspect this is a hang-over from the type/class unification (i.e., we need to exercise an extended API to get our methods exposted). >Does the underlying reader object need to expose its Reader_iternext >function as a next() method? Based upon > > http://www.python.org/doc/current/lib/typeiter.html > >I sort of suspect it does. It looks like it also needs an __iter__() method >which just returns self. Hmmm - the C parts of Python are obviously finding them: >>> import csv >>> r=csv.reader([]) >>> iter(r) <_csv.reader object at 0x40188810> >Is all that's missing a next() method for reader objects? I suspect so... will let you know. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Feb 10 09:47:24 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 10 Feb 2003 19:47:24 +1100 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: Message from Andrew McNamara <20030209230905.816893CA92@coffee.object-craft.com.au> References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> Message-ID: <20030210084724.8D6063CA89@coffee.object-craft.com.au> >> >>> dir(rdr) >> ['delimiter', 'doublequote', 'escapechar', 'lineterminator', >> 'quotechar', 'quoting', 'skipinitialspace', 'strict'] > >For reasons that I haven't looked into, dir() is not finding methods >on the objects we're creating - I suspect this is a hang-over from the >type/class unification (i.e., we need to exercise an extended API to >get our methods exposted). We were using "old-style" getattr/setattr - a day of pawing over python internals showed how to use the new interfaces, so __iter__ and next are now exposed the way they should be, and dir(...) lists all methods. I also changed the Dialect structure into a fully fledged python type - this was something I'd been considering for a while, but had assumed there would be too much of a performance impact. Turns out there wasn't, and it's made the code cleaner. Note that the reader and writer objects no longer have attributes corresponding to the individual settings - instead, they have a "dialect" attribute, which contains the settings. It would be a relatively trivial matter to proxy getattr/setattr requests from reader/writer to the dialect instance - I can do this if people think it's worthwhile. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Mon Feb 10 10:44:28 2003 From: djc at object-craft.com.au (Dave Cole) Date: 10 Feb 2003 20:44:28 +1100 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: <20030210084724.8D6063CA89@coffee.object-craft.com.au> References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> <20030210084724.8D6063CA89@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >>> >>> dir(rdr) ['delimiter', 'doublequote', 'escapechar', >>> 'lineterminator', 'quotechar', 'quoting', 'skipinitialspace', >>> 'strict'] >> For reasons that I haven't looked into, dir() is not finding >> methods on the objects we're creating - I suspect this is a >> hang-over from the type/class unification (i.e., we need to >> exercise an extended API to get our methods exposted). Andrew> We were using "old-style" getattr/setattr - a day of pawing Andrew> over python internals showed how to use the new interfaces, so Andrew> __iter__ and next are now exposed the way they should be, and Andrew> dir(...) lists all methods. Andrew> I also changed the Dialect structure into a fully fledged Andrew> python type - this was something I'd been considering for a Andrew> while, but had assumed there would be too much of a Andrew> performance impact. Turns out there wasn't, and it's made the Andrew> code cleaner. Andrew> Note that the reader and writer objects no longer have Andrew> attributes corresponding to the individual settings - instead, Andrew> they have a "dialect" attribute, which contains the Andrew> settings. It would be a relatively trivial matter to proxy Andrew> getattr/setattr requests from reader/writer to the dialect Andrew> instance - I can do this if people think it's worthwhile. I think it is well nigh time to let this code loose on the Python community. The only possible addition now would be some kind of mechanism whereby something like the db_row could be linked in with the module. http://opensource.theopalgroup.com/ Mind you the application might be the best place to do this kind of linkage. - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Mon Feb 10 11:26:33 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 10 Feb 2003 21:26:33 +1100 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: Message from Dave Cole References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> <20030210084724.8D6063CA89@coffee.object-craft.com.au> Message-ID: <20030210102633.DDCD03CA89@coffee.object-craft.com.au> >I think it is well nigh time to let this code loose on the Python >community. It now works with Python 2.2, which certainly makes this more feasible. Supporting versions of Python prior to 2.2 is problematic - the type model is very different, and they don't have iterators (which the C code uses in some key locations). >The only possible addition now would be some kind of >mechanism whereby something like the db_row could be linked in with >the module. > > http://opensource.theopalgroup.com/ > >Mind you the application might be the best place to do this kind of >linkage. Maybe Skip's dictionary stuff would get us closer? We haven't made any impression on the csv.utils sub-module yet - things like the sniffer. We want to watch we don't miss the 2.3 boat - what's the next step? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Feb 10 15:38:32 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 10 Feb 2003 08:38:32 -0600 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: <20030210102633.DDCD03CA89@coffee.object-craft.com.au> References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> <20030210084724.8D6063CA89@coffee.object-craft.com.au> <20030210102633.DDCD03CA89@coffee.object-craft.com.au> Message-ID: <15943.47464.805209.596214@montanaro.dyndns.org> >> The only possible addition now would be some kind of mechanism >> whereby something like the db_row could be linked in with the module. >> >> http://opensource.theopalgroup.com/ >> >> Mind you the application might be the best place to do this kind of >> linkage. Andrew> Maybe Skip's dictionary stuff would get us closer? Maybe, but there are enough object-relational mappers out there (I gather that's sort of what db_row is) that we can't possibly make everyone happy. I say we punt. I haven't cvs up'd yet this morning. Hopefully my DictReader and DictWriter classes still work. ;-) Andrew> We haven't made any impression on the csv.utils sub-module yet - Andrew> things like the sniffer. We want to watch we don't miss the 2.3 Andrew> boat - what's the next step? That's Cliff's expertise, and judging from his recent silence, I suspect he's still pretty busy with other things. Cliff, assuming the rest of the code is pretty much set how are you fixed for time to work on a sniffer? Should we propose that's what's there now be incorporated into 2.3 and then aim for a separate csv.utils module between 2.3 and 2.4 (to be added in 2.4)? Skip From skip at pobox.com Mon Feb 10 16:41:13 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 10 Feb 2003 09:41:13 -0600 Subject: [Csv] DictReader/DictWriter behavior question Message-ID: <15943.51225.611672.963525@montanaro.dyndns.org> I have DictReader set up to handle short or long rows in a reasonable fashion. If the number of fields in the input row is more than the number len of the "fieldnames" list passed to the constructor, an extra field, keyed by the optional "rest" argument gathers the remaining data. For example: >>> rdr = csv.DictReader (["a,b,c,d,e\r\n"], fieldnames="1 2 3".split()) >>> rdr.next() {'1': 'a', None: ['d', 'e'], '3': 'c', '2': 'b'} >>> rdr = csv.DictReader (["a,b,c,d,e\r\n"], fieldnames="1 2 3".split(), rest="foo") >>> rdr.next() {'1': 'a', '3': 'c', '2': 'b', 'foo': ['d', 'e']} Similarly, if the row is short: >>> rdr = csv.DictReader (["a,b,c\r\n"], fieldnames="1 2 3 4 5 6".split(), restval="dflt") >>> rdr.next() {'1': 'a', '3': 'c', '2': 'b', '5': 'dflt', '4': 'dflt', '6': 'dflt'} (I'm about to change the "rest" parameter to "restkey".) My problem is the DictWriter. It uses a similar mechanism to map dicts to output rows: >>> f = StringIO.StringIO() >>> wrtr = csv.DictWriter(f, fieldnames="1 2 3".split()) >>> wrtr.writerow({"1":30,"2":20,"3":10}) >>> f.getvalue() '30,20,10\r\n' When writing though, I face the dilemma of what to do if the dictionary being written has one or more keys which don't appear in the fieldnames list. I can silently ignore them (that's the current behavior), I can raise an exception, or I can give the user control. There's no way to actually write that data because you have no obvious way to order those values. (I could do something hokey like write out the key and the value somehow.) What do you think is the best behavior, ignore values or raise an exception? Or do you have other ideas? Skip From LogiplexSoftware at earthlink.net Mon Feb 10 19:19:19 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 10 Feb 2003 10:19:19 -0800 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: <15943.47464.805209.596214@montanaro.dyndns.org> References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> <20030210084724.8D6063CA89@coffee.object-craft.com.au> <20030210102633.DDCD03CA89@coffee.object-craft.com.au> <15943.47464.805209.596214@montanaro.dyndns.org> Message-ID: <1044901159.1376.74.camel@software1.logiplex.internal> On Mon, 2003-02-10 at 06:38, Skip Montanaro wrote: > >> The only possible addition now would be some kind of mechanism > >> whereby something like the db_row could be linked in with the module. > >> > >> http://opensource.theopalgroup.com/ > >> > >> Mind you the application might be the best place to do this kind of > >> linkage. > > Andrew> Maybe Skip's dictionary stuff would get us closer? > > Maybe, but there are enough object-relational mappers out there (I gather > that's sort of what db_row is) that we can't possibly make everyone happy. > I say we punt. I haven't cvs up'd yet this morning. Hopefully my > DictReader and DictWriter classes still work. ;-) > > Andrew> We haven't made any impression on the csv.utils sub-module yet - > Andrew> things like the sniffer. We want to watch we don't miss the 2.3 > Andrew> boat - what's the next step? > > That's Cliff's expertise, and judging from his recent silence, I suspect > he's still pretty busy with other things. Cliff, assuming the rest of the > code is pretty much set how are you fixed for time to work on a sniffer? > Should we propose that's what's there now be incorporated into 2.3 and then > aim for a separate csv.utils module between 2.3 and 2.4 (to be added in > 2.4)? Hi all, Sorry about my MIA status. I've gotten things at work reduced to smoldering ashes, which is the usual state-of-affairs, so hopefully I can actually contribute a bit. I think we need to once again decide what we want/need in cvsutils. Obvious candidates are: 1. Sniffer for guessing delimiter 2. Sniffer for guessing quotechar 3. Sniffer for guessing whether first row is header These were easy as they already exist in DSV ;) We just need to decide what the API will look like for the algorithms. Right now the DSV stuff just returns char, char, bool, respectively for the above functions. It would be easy to write a wrapper that calls all three consecutively and returns a dialect object (I don't think it's necessary to match against existing dialects, but maybe we should?). 4. Row -> dict converter. This should be easy as well. The user can use the results of the guessHeaders() sniffer or just provide their own list of names to use as keys. I haven't looked at Skip's code yet, but I don't see how this can be anything but trivial. What other things are we looking at? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Mon Feb 10 19:30:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 10 Feb 2003 12:30:44 -0600 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: <1044901159.1376.74.camel@software1.logiplex.internal> References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> <20030210084724.8D6063CA89@coffee.object-craft.com.au> <20030210102633.DDCD03CA89@coffee.object-craft.com.au> <15943.47464.805209.596214@montanaro.dyndns.org> <1044901159.1376.74.camel@software1.logiplex.internal> Message-ID: <15943.61396.366088.546062@montanaro.dyndns.org> Cliff> 1. Sniffer for guessing delimiter Cliff> 2. Sniffer for guessing quotechar Cliff> 3. Sniffer for guessing whether first row is header These all sound fine. Cliff> It would be easy to write a wrapper that calls all three Cliff> consecutively and returns a dialect object (I don't think it's Cliff> necessary to match against existing dialects, but maybe we Cliff> should?). You'd have to assume reasonable defaults for the other parameters. How about line terminator and QUOTE_{ALL,MINIMAL,NONE,NONNUMERIC} sniffers? (Are the QUOTE_* values used by readers?) Cliff> 4. Row -> dict converter. I don't think this will be necessary. I already added DictReader and DictWriter classes to csv.py which do the pretty much obvious (to me) thing. Cliff> What other things are we looking at? Some proofreading/editing of the PEP and the libcsv.tex file? Skip From LogiplexSoftware at earthlink.net Mon Feb 10 21:50:21 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 10 Feb 2003 12:50:21 -0800 Subject: [Csv] confused about wrapping readers and writers In-Reply-To: <15943.61396.366088.546062@montanaro.dyndns.org> References: <15941.25805.234008.663342@montanaro.dyndns.org> <20030209230905.816893CA92@coffee.object-craft.com.au> <20030210084724.8D6063CA89@coffee.object-craft.com.au> <20030210102633.DDCD03CA89@coffee.object-craft.com.au> <15943.47464.805209.596214@montanaro.dyndns.org> <1044901159.1376.74.camel@software1.logiplex.internal> <15943.61396.366088.546062@montanaro.dyndns.org> Message-ID: <1044910221.2250.6.camel@software1.logiplex.internal> On Mon, 2003-02-10 at 10:30, Skip Montanaro wrote: > Cliff> 1. Sniffer for guessing delimiter > Cliff> 2. Sniffer for guessing quotechar > Cliff> 3. Sniffer for guessing whether first row is header > > These all sound fine. > > Cliff> It would be easy to write a wrapper that calls all three > Cliff> consecutively and returns a dialect object (I don't think it's > Cliff> necessary to match against existing dialects, but maybe we > Cliff> should?). > > You'd have to assume reasonable defaults for the other parameters. How > about line terminator and QUOTE_{ALL,MINIMAL,NONE,NONNUMERIC} sniffers? > (Are the QUOTE_* values used by readers?) Line terminator would seem necessary, QUOTE_* doesn't seem necessary for import. > Cliff> 4. Row -> dict converter. > > I don't think this will be necessary. I already added DictReader and > DictWriter classes to csv.py which do the pretty much obvious (to me) thing. Okay. > Cliff> What other things are we looking at? > > Some proofreading/editing of the PEP and the libcsv.tex file? Can do. I've got a Python meeting tonight and I'm helping someone clean a barn tomorrow night (the life of a programmer, you know) but I might be able to squeeze a bit of time in to get some of this done at least by Wednesday. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Mon Feb 10 23:47:16 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon, 10 Feb 2003 16:47:16 -0600 Subject: [Csv] update - csv.py & libcsv.tex Message-ID: <15944.11252.212318.864943@montanaro.dyndns.org> Just checked in new versions of csv.py and libcsv.tex. The former includes a couple changes from previous notes ("rest" param is not "restkey", DictWriter objects now have user-configurable "extrasaction" to deal with case of dicts which have keys not in the known fieldnames). I added text regarding the DictReader and DictWriter classes and fixed a number of Latex errors in libcsv.tex. Skip From skip at pobox.com Tue Feb 11 15:52:21 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 11 Feb 2003 08:52:21 -0600 Subject: [Csv] writerow() leakage? Message-ID: <15945.3621.91285.415100@montanaro.dyndns.org> I just checked in some attempts at leakage testing in test/test_csv.py. Creating readers and writers appears okay, as does reading data. It appears that writerow() leaks though. The new tests will only be run if sys.gettotalrefcount() is available, so you'll need to run them with a --with-pydebug build. Skip From skip at pobox.com Tue Feb 11 22:49:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 11 Feb 2003 15:49:58 -0600 Subject: [Csv] ignore blank lines? Message-ID: <15945.28678.631121.25754@montanaro.dyndns.org> Would there be any value in telling the csv module to ignore blank lines? I notice that the logfile exporter of Firewall-1 seems to always append three blank lines to its output. (BTW, I discovered a command-line logfile export capability which runs on Solaris, so I can dispense with the hokey two-space separator the Windows-based log viewer uses.) Skip From djc at object-craft.com.au Wed Feb 12 00:02:22 2003 From: djc at object-craft.com.au (Dave Cole) Date: 12 Feb 2003 10:02:22 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: <15945.28678.631121.25754@montanaro.dyndns.org> References: <15945.28678.631121.25754@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> Would there be any value in telling the csv module to ignore Skip> blank lines? I notice that the logfile exporter of Firewall-1 Skip> seems to always append three blank lines to its output. (BTW, I Skip> discovered a command-line logfile export capability which runs Skip> on Solaris, so I can dispense with the hokey two-space separator Skip> the Windows-based log viewer uses.) Couldn't the application just ignore records which have zero fields? - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Wed Feb 12 00:05:59 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 12 Feb 2003 10:05:59 +1100 Subject: [Csv] writerow() leakage? In-Reply-To: Message from Skip Montanaro <15945.3621.91285.415100@montanaro.dyndns.org> References: <15945.3621.91285.415100@montanaro.dyndns.org> Message-ID: <20030211230559.D0A723CB83@coffee.object-craft.com.au> >I just checked in some attempts at leakage testing in test/test_csv.py. >Creating readers and writers appears okay, as does reading data. It appears >that writerow() leaks though. It's the implementation of StringIO that's making it look like writerow is leaking references: StringIO() appends the data you write to a list. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Feb 12 01:24:56 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 11 Feb 2003 18:24:56 -0600 Subject: [Csv] ignore blank lines? In-Reply-To: References: <15945.28678.631121.25754@montanaro.dyndns.org> Message-ID: <15945.37976.592926.369940@montanaro.dyndns.org> Skip> Would there be any value in telling the csv module to ignore blank Skip> lines? Dave> Couldn't the application just ignore records which have zero Dave> fields? I suppose so, but it seems somehow cleaner to me if I know the parser won't return empty lists. Skip From sjmachin at lexicon.net Wed Feb 12 01:46:59 2003 From: sjmachin at lexicon.net (John Machin) Date: Wed, 12 Feb 2003 11:46:59 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: <15945.37976.592926.369940@montanaro.dyndns.org> References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> Message-ID: On Tue, 11 Feb 2003 18:24:56 -0600, Skip Montanaro wrote: > > Skip> Would there be any value in telling the csv module to ignore blank > Skip> lines? > > Dave> Couldn't the application just ignore records which have zero > Dave> fields? > > I suppose so, but it seems somehow cleaner to me if I know the parser > won't > return empty lists. > Does Skip mean a blank line "...\n \n..." or an empty line "...\n\n..." ??? It might help if first this were discussed (and the answer documented): When writing, both [] and [""] will produce empty lines. Dave seems to imply that on reading, an "empty" line will produce [], not [""]. There is some ground for arguing the latter, on the basis that a record with only n delimiters and nothing else should produce (n+1) * [""] and in this case n is zero Whatever, I'm with Dave. The caller can handle this. In any case, ignoring empty lines (except maybe one or two that have inadvertently appeared at the end of the file) seems perilous to me. I've seen lots of dud data in my time, but never once a file where I could happily ignore non-terminal empty lines. -- From djc at object-craft.com.au Wed Feb 12 02:40:15 2003 From: djc at object-craft.com.au (Dave Cole) Date: 12 Feb 2003 12:40:15 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> Message-ID: >>>>> "John" == John Machin writes: John> On Tue, 11 Feb 2003 18:24:56 -0600, Skip Montanaro John> wrote: >> Skip> Would there be any value in telling the csv module to ignore Skip> blank lines? >> Dave> Couldn't the application just ignore records which have zero Dave> fields? >> I suppose so, but it seems somehow cleaner to me if I know the >> parser won't return empty lists. >> John> Does Skip mean a blank line "...\n \n..." or an empty line John> "...\n\n..." ??? John> It might help if first this were discussed (and the answer John> documented): John> When writing, both [] and [""] will produce empty lines. Dave John> seems to imply that on reading, an "empty" line will produce [], John> not [""]. There is some ground for arguing the latter, on the John> basis that a record with only n delimiters and nothing else John> should produce (n+1) * [""] and in this case n is zero John> Whatever, I'm with Dave. The caller can handle this. In any John> case, ignoring empty lines (except maybe one or two that have John> inadvertently appeared at the end of the file) seems perilous to John> me. I've seen lots of dud data in my time, but never once a file John> where I could happily ignore non-terminal empty lines. Let's try it out: >>> import csv >>> r = csv.reader(['', '""', ' ']) >>> r.next() [] >>> r.next() [''] >>> r.next() [' '] >>> class F: ... def write(self, s): ... print repr(s) ... >>> w = csv.writer(F()) >>> w.writerow([]) '\r\n' >>> w.writerow(['']) '""\r\n' >>> w.writerow([' ']) ' \r\n' Seems like the module is doing the right (sensible) thing. - Dave -- http://www.object-craft.com.au From skip at pobox.com Wed Feb 12 03:45:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue, 11 Feb 2003 20:45:10 -0600 Subject: [Csv] writerow() leakage? In-Reply-To: <20030211230559.D0A723CB83@coffee.object-craft.com.au> References: <15945.3621.91285.415100@montanaro.dyndns.org> <20030211230559.D0A723CB83@coffee.object-craft.com.au> Message-ID: <15945.46390.240131.505750@montanaro.dyndns.org> Andrew> It's the implementation of StringIO that's making it look like Andrew> writerow is leaking references: StringIO() appends the data you Andrew> write to a list. Thanks. Test fixed. S From andrewm at object-craft.com.au Wed Feb 12 05:00:17 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 12 Feb 2003 15:00:17 +1100 Subject: [Csv] csv.writer, file must be binary mode... Message-ID: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> A posting by Tim Peters on the Python list reminded me that csv.writer() is not the only module that requires it be passed a file in binary module - Pickle is classic example. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 12 07:19:37 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 12 Feb 2003 17:19:37 +1100 Subject: [Csv] This surprised me In-Reply-To: Message from Skip Montanaro <15941.8540.607571.202309@montanaro.dyndns.org> References: <15941.8540.607571.202309@montanaro.dyndns.org> Message-ID: <20030212061937.863823CB83@coffee.object-craft.com.au> > >> This code surprised me: > ... > Andrew> Surely there's more to your example than you quoted in this > Andrew> e-mail? The exception you mention came from the python code, not > Andrew> the C module (specifically the Dialect class), but I can't see > Andrew> where it referenced in the quoted code? > >Nope, nothing more. I guess the point I was trying to make is that if I >pass a dialect object which is not subclassed from csv.Dialect (as you >suggested I should be able to do), it seems to be silently accepted. Uh? If I recall correctly, the exception quoted came from the python Dialect class, but it wasn't involved in the line that threw the exception? 8-) > Andrew> The C code will instanciate (and thus call Dialect's _validate) > Andrew> when register_dialect is called, or when the class is passed to > Andrew> reader or writer. > >Correct. But you indicated that was no longer necessary. I was wondering >where the error checking went to. I decided it wasn't necessary - if the instance has the necessary bits and no more, we can use it as parameters, whether it's a descendant of Dialect or not. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 12 07:49:43 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 12 Feb 2003 17:49:43 +1100 Subject: [Csv] Re: Unicode again In-Reply-To: Message from Skip Montanaro <15929.5633.929389.67150@montanaro.dyndns.org> References: <15929.5633.929389.67150@montanaro.dyndns.org> Message-ID: <20030212064943.6C8C63CB83@coffee.object-craft.com.au> >I've been thinking a little about the Unicode issue some more. I really >think you don't want to dive into picking apart Unicode strings. If >nothing else, you'll have to deal with a mixture of wide and narrow >characters. How about two paths? If you know everything's a plain >string, execute your current code. If any elements are Unicode strings, >take the slower, high-level path. I've had a bit of a chance to look at the C unicode implementation, and it's pretty clean - essentially you just have a string of unsigned shorts (or unsigned longs if python was build with wide support) instead of unsigned chars. Generally you don't have to worry about variable length data (we'd cover 99.99% of use cases by ignoring the exceptions). I think I currently favour the approach used in sre, where preprocessor tricks are used to compile two versions of the core, but I'm sure this won't be trivial. Probably not something we can deal with before 2.3. Hopefully this won't preclude integration with 2.3. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Feb 12 15:22:48 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 08:22:48 -0600 Subject: [Csv] csv.writer, file must be binary mode... In-Reply-To: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> Message-ID: <15946.22712.538271.973935@montanaro.dyndns.org> Andrew> A posting by Tim Peters on the Python list reminded me that Andrew> csv.writer() is not the only module that requires it be passed a Andrew> file in binary module - Pickle is classic example. Thanks for the tip. I'll mention this in the PEP. Skip From skip at pobox.com Wed Feb 12 15:31:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 08:31:25 -0600 Subject: [Csv] This surprised me In-Reply-To: <20030212061937.863823CB83@coffee.object-craft.com.au> References: <15941.8540.607571.202309@montanaro.dyndns.org> <20030212061937.863823CB83@coffee.object-craft.com.au> Message-ID: <15946.23229.379024.7389@montanaro.dyndns.org> >> Correct. But you indicated that was no longer necessary. I was >> wondering where the error checking went to. Andrew> I decided it wasn't necessary - if the instance has the Andrew> necessary bits and no more, we can use it as parameters, whether Andrew> it's a descendant of Dialect or not. Yeah, but what if it has no necessary bits? Shouldn't the user be alerted to that fact? >>> import csv >>> class foo: pass ... >>> rdr = csv.reader(["a,b,c\r\n"], dialect=foo) >>> rdr.next() ['a', 'b', 'c'] If nothing else, we need to define the specific defaults for the various parameters. In the above case, clearly my foo class isn't overriding anything. Skip From skip at pobox.com Wed Feb 12 16:02:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 09:02:45 -0600 Subject: [Csv] csv.writer, file must be binary mode... In-Reply-To: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> Message-ID: <15946.25109.366279.925230@montanaro.dyndns.org> Andrew> A posting by Tim Peters on the Python list reminded me that Andrew> csv.writer() is not the only module that requires it be passed a Andrew> file in binary module - Pickle is classic example. On second thought, the only reason Pickle requires binary mode is when the binary pickle format is selected, right? Hmmm... I don't think we really need to say "Pickle requires binary mode, so we can too." Skip From skip at pobox.com Wed Feb 12 16:11:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 09:11:06 -0600 Subject: [Csv] ignore blank lines? In-Reply-To: References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> Message-ID: <15946.25610.202929.623399@montanaro.dyndns.org> John> Does Skip mean a blank line "...\n \n..." or an empty line John> "...\n\n..." ??? I meant a line which consists of just the lineterminator sequence. Here's my use case. In the DictReader class, if the underlying reader object returns an empty list and I don't catch it, I wind up returning a dictionary all of whose fields are set to the restval (typically None). The caller can't simply compare that against {} as the caller of csv.reader() can compare the returned value against [], so it makes sense for me to elide that case in the DictReader code. I modified DictReader.next() to start like: def next(self): row = self.reader.next() while row == []: row = self.reader.next() ... process row ... Does that behavior make sense in this case? Skip From skip at pobox.com Wed Feb 12 18:39:16 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 11:39:16 -0600 Subject: [Csv] Ready for another announcement? Message-ID: <15946.34500.633614.47737@montanaro.dyndns.org> Are we ready to make another announcement? It seems most of the PEP 308 furor has died down, so perhaps an announcement will actually be seen. Skip From sjmachin at lexicon.net Wed Feb 12 21:29:11 2003 From: sjmachin at lexicon.net (John Machin) Date: Thu, 13 Feb 2003 07:29:11 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: <15946.25610.202929.623399@montanaro.dyndns.org> References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> Message-ID: On Wed, 12 Feb 2003 09:11:06 -0600, Skip Montanaro wrote: > > John> Does Skip mean a blank line "...\n \n..." or an empty line > John> "...\n\n..." ??? > > I meant a line which consists of just the lineterminator sequence. > > Here's my use case. In the DictReader class, if the underlying reader > object returns an empty list and I don't catch it, I wind up returning a > dictionary all of whose fields are set to the restval (typically None). > The > caller can't simply compare that against {} as the caller of csv.reader() > can compare the returned value against [], so it makes sense for me to > elide > that case in the DictReader code. > > I modified DictReader.next() to start like: > > def next(self): > row = self.reader.next() > while row == []: > row = self.reader.next() > ... process row ... > > Does that behavior make sense in this case? I am +0 on suppressing empty lines at the end of the input stream, but -1 on suppressing these (especially with neither notice nor option for non- suppression) if they appear between non-empty data rows. Rather than petition Dave & Andrew for yet another toggle, I would say make it easier for the caller to detect this situation ... if row == []: return {} Cheers, John -- From andrewm at object-craft.com.au Wed Feb 12 23:23:15 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 13 Feb 2003 09:23:15 +1100 Subject: [Csv] csv.writer, file must be binary mode... In-Reply-To: Message from Skip Montanaro <15946.25109.366279.925230@montanaro.dyndns.org> References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> <15946.25109.366279.925230@montanaro.dyndns.org> Message-ID: <20030212222315.722AA3CB83@coffee.object-craft.com.au> > Andrew> A posting by Tim Peters on the Python list reminded me that > Andrew> csv.writer() is not the only module that requires it be passed a > Andrew> file in binary module - Pickle is classic example. > >On second thought, the only reason Pickle requires binary mode is when the >binary pickle format is selected, right? Hmmm... I don't think we really >need to say "Pickle requires binary mode, so we can too." That wasn't really what I was thinking - it was more like "requiring binary mode is going to confuse people and be an endless source of bugs", but then I saw the Pickle stuff and now I'm not so worried. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Feb 12 23:57:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 16:57:09 -0600 Subject: [Csv] csv.writer, file must be binary mode... In-Reply-To: <20030212222315.722AA3CB83@coffee.object-craft.com.au> References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au> <15946.25109.366279.925230@montanaro.dyndns.org> <20030212222315.722AA3CB83@coffee.object-craft.com.au> Message-ID: <15946.53573.89825.118609@montanaro.dyndns.org> Andrew> That wasn't really what I was thinking - it was more like Andrew> "requiring binary mode is going to confuse people and be an Andrew> endless source of bugs", but then I saw the Pickle stuff and now Andrew> I'm not so worried. Ah, okay. It's good I didn't didn't do anything then. ;-) Skip From skip at pobox.com Thu Feb 13 02:16:56 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 12 Feb 2003 19:16:56 -0600 Subject: [Csv] ini file fumbling broke Message-ID: <15946.61960.251535.643909@montanaro.dyndns.org> Someone recently decreed that all files mentioned in BAYESCUSTOMIZE must end in ".ini" and modified Options.py (I named my customize file ~/hammie.opt). Was this related to the embedded-spaces-in-paths problem? Sumthin's gotta give I think. If spaces are common in filenames, we need to pick a better separator. (Or allow the separator to be platform-specific.) On Unix systems, ":" is a good path separator (but would be bad on MacOS < X systems). I think ";" is more common on Windows. I don't think forcing customize files to end in ".ini" is right. Even one of the default files searched for in Options.py is "~/.spambayesrc". Thoughts? Skip From sjmachin at lexicon.net Thu Feb 13 12:13:18 2003 From: sjmachin at lexicon.net (John Machin) Date: Thu, 13 Feb 2003 22:13:18 +1100 Subject: [Csv] non-portable initialisation of types in _csv.c Message-ID: static PyTypeObject Dialect_Type = { /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */ PyObject_HEAD_INIT(NULL) 0, /* ob_size */ -- From skip at pobox.com Thu Feb 13 15:58:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 13 Feb 2003 08:58:18 -0600 Subject: [Csv] non-portable initialisation of types in _csv.c In-Reply-To: References: Message-ID: <15947.45706.139024.281581@montanaro.dyndns.org> John> static PyTypeObject Dialect_Type = { John> /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */ John> PyObject_HEAD_INIT(NULL) John> 0, /* ob_size */ John, Thanks, is this a Windows thing? What about the head initializer for the Reader_Type and Writer_Type types? Skip From skip at pobox.com Thu Feb 13 20:00:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 13 Feb 2003 13:00:36 -0600 Subject: [Csv] trial zip/tar packages of csv module available Message-ID: <15947.60244.162082.486394@montanaro.dyndns.org> If you are interested in reading or writing CSV files from Python and you have Python 2.2 or 2.3 available, please take a moment to download, extract and install either or both of the following URLs: http://manatee.mojam.com/~skip/csv.tar.gz http://manatee.mojam.com/~skip/csv.zip If you'd prefer, you can grab the files the the Python CVS sandbox: http://sourceforge.net/cvs/?group_id=5470 http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/ Not included in the above zip/tgz files is the latest version of PEP 305. You can view it here: http://www.python.org/peps/pep-0305.html The goal is to get this package into Python 2.3, though we've tried to keep it working under 2.2. It uses iterators, so I don't know if it will work with anything before 2.2. The package has been built on Linux and Mac OS X at this point. I think it's been built on Windows though I'm not positive. There shouldn't be anything terribly platform-dependent there. To build and install, just do the usual distutils dance: python setup.py install If you cd to the test subdirectory, you can run the 60 or so unit tests: cd test python test_csv.py If your Python interpreter was configured using --with-pydebug it will run a few memory leak tests. If not it will let you know they are being skipped. (If you try it both ways, make sure to delete the build subdirectory between builds, otherwise you'll get link errors.) Feedback is welcomed on both the package and the PEP, but please remember to include csv at mail.mojam.com in your mail. Thanks, Skip From sjmachin at lexicon.net Thu Feb 13 23:16:51 2003 From: sjmachin at lexicon.net (John Machin) Date: Fri, 14 Feb 2003 09:16:51 +1100 Subject: [Csv] trial zip/tar packages of csv module available In-Reply-To: <15947.60244.162082.486394@montanaro.dyndns.org> References: <15947.60244.162082.486394@montanaro.dyndns.org> Message-ID: On Thu, 13 Feb 2003 13:00:36 -0600, Skip Montanaro wrote: > > If you are interested in reading or writing CSV files from Python and you > have Python 2.2 or 2.3 available, please take a moment to download, > extract > and install either or both of the following URLs: > > http://manatee.mojam.com/~skip/csv.tar.gz > http://manatee.mojam.com/~skip/csv.zip > The goal is to get this package into Python 2.3, though we've tried to > keep > it working under 2.2. It uses iterators, so I don't know if it will work > with anything before 2.2. The package has been built on Linux and Mac OS > X > at this point. I think it's been built on Windows though I'm not > positive. > There shouldn't be anything terribly platform-dependent there. > Good news first, whinges at the end of the message :-) === Compiles & installs OK out-of-the-box with Python 2.2, Windows 2000, BCC32 (Borland 5.5 freebie command-line compiler) -- thanks to revision 1.30 :-) === C:\csv\test>python test_csv.py *** skipping leakage tests *** ........................................................ ---------------------------------------------------------------------- Ran 56 tests in 0.030s OK === Slurped through a 150Mb CSV file at a reasonable speed without any memory leak that could be detected by the primitive method of watching the Task Manager memory graph. === Doco: """0.1.1 Module Contents The csv module defines the following functions. reader(iterable[, dialect=?excel? ] [, fmtparam]) Return a reader object which will iterate over lines in the given csvfile.""" Huh? What "given csvfile"? Need to define carefully what iterable.next() is expected to deliver; a line, with or without a trailing newline? a string of 1 or more bytes which may contain embedded line separators, either as true separators or as (quoted) data? [e.g. iterable could be a generator which uses say read(16384)]. I have noticed in the csv mailing list some muttering along the lines of "the iterable's underlying file must have been opened in binary mode"!? Que? This might necessitate a FAQ entry: >>> cr = csv.reader("iterable is string!") >>> [x for x in cr] [['i'], ['t'], ['e'], ['r'], ['a'], ['b'], ['l'], ['e'], [' '], ['i'], ['s'], [' '], ['s'], ['t'], ['r'], ['i'], ['n'], ['g'], ['!'] ] >>> === Does the reader detect any errors at all? E.g. I expected some complaint here, instead of silently doing nothing: >>> import csv >>> cr = csv.reader(['f1,"unterminated quoted field,f3']) >>> for x in cr: print x ... >>> cr = csv.reader(['f1,"terminated quoted field",f3']) >>> for x in cr: print x ... ['f1', 'terminated quoted field', 'f3'] >>> cr = csv.reader(['f1,"unterminated quoted field,f3\n']) >>> for x in cr: print x ... >>> === Judging by the fact that in _csv.c '\0' is passed around as a line-ending signal, it's not 8-bit-clean. This fact should be at least documented, if not fixed (which looks like a bit of a rewrite). Strange behaviour on embedded '\0' may worry not only pedants but also folk who are recipients of data files created by J. Random Boofhead III and friends. === Cheers, John From sjmachin at lexicon.net Thu Feb 13 23:33:22 2003 From: sjmachin at lexicon.net (John Machin) Date: Fri, 14 Feb 2003 09:33:22 +1100 Subject: [Csv] non-portable initialisation of types in _csv.c In-Reply-To: <15947.45706.139024.281581@montanaro.dyndns.org> References: <15947.45706.139024.281581@montanaro.dyndns.org> Message-ID: On Thu, 13 Feb 2003 08:58:18 -0600, Skip Montanaro wrote: > > John> static PyTypeObject Dialect_Type = { > John> /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */ > John> PyObject_HEAD_INIT(NULL) > John> 0, /* ob_size */ > > John, > > Thanks, is this a Windows thing? > > What about the head initializer for the > Reader_Type and Writer_Type types? My understanding is this: The offending code is strictly not correct C -- the initialiser is not a constant; it's the address of a gadget not declared in the current source file. However some compiler/linker combinations can nut it out. Some compilers take advantage of this; some can't or won't; Windows compilers seem to be in the can't or won't category. > What about the head initializer for the > Reader_Type and Writer_Type types? Skip, what's sauce for the first goose is also sauce for the second and subsequent geese. You seem to have sauced all 3 birds in rev 1.30. I notice that it seems to work without the "FooType.ob_type = &PyType_Type;" incantation in the module initialisation. Perhaps PyType_Ready() fixes this up. Cheers, John -- From skip at pobox.com Thu Feb 13 23:53:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 13 Feb 2003 16:53:41 -0600 Subject: [Csv] non-portable initialisation of types in _csv.c In-Reply-To: References: <15947.45706.139024.281581@montanaro.dyndns.org> Message-ID: <15948.8693.867705.309628@montanaro.dyndns.org> John> I notice that it seems to work without the "FooType.ob_type = John> &PyType_Type;" incantation in the module initialisation. Perhaps John> PyType_Ready() fixes this up. Yes, that's one of the things it does. Perhaps it would have been better named as "PyType_MakeReady". Thanks for the other feedback as well. I'll let Dave and Andrew mull over that stuff. The issue of 8-bit next-to-godliness will probably have to be addressed once Unicode is tackled. Skip From andrewm at object-craft.com.au Fri Feb 14 07:03:18 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 14 Feb 2003 17:03:18 +1100 Subject: [Csv] non-portable initialisation of types in _csv.c In-Reply-To: Message from John Machin of "Fri, 14 Feb 2003 09:33:22 +1100." References: <15947.45706.139024.281581@montanaro.dyndns.org> Message-ID: <20030214060319.F0D933CC5D@coffee.object-craft.com.au> >>> John> /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */ [...] >>The offending code is strictly not correct C -- the initialiser is not a >>constant; it's the address of a gadget not declared in the current source >>file. However some compiler/linker combinations can nut it out. Some >>compilers take advantage of this; some can't or won't; Windows compilers >>seem to be in the can't or won't category. Indeed - thanks for picking this up. I can only assume it was a cut-n-paste accident, because it was originally PyObject_HEAD_INIT(0). >>Skip, what's sauce for the first goose is also sauce for the second and >>subsequent geese. You seem to have sauced all 3 birds in rev 1.30. Thanks Skip - all three needed doing. >>I notice that it seems to work without the "FooType.ob_type = >>&PyType_Type;" incantation in the module initialisation. Perhaps >>PyType_Ready() fixes this up. > >Yes, that's one of the things it does. Perhaps it would have been better >named as "PyType_MakeReady". And the consequences of *not* calling PyType_Ready() are particularly obscure. There's enough information to allow the Python core to assert if a type hasn't been finalised - I wonder why it doesn't? >The issue of 8-bit next-to-godliness will probably have to be >addressed once Unicode is tackled. Definitely not this go around, anyway. I doubt it's lack is a big deal (lack of unicode is a bigger deal) - since CSV is a text format, finding a null in the input would be very unusual (and I wouldn't be surprised if excel choked too... 8-). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Feb 14 07:11:30 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 14 Feb 2003 17:11:30 +1100 Subject: [Csv] trial zip/tar packages of csv module available In-Reply-To: Message from John Machin of "Fri, 14 Feb 2003 09:16:51 +1100." References: <15947.60244.162082.486394@montanaro.dyndns.org> Message-ID: <20030214061130.163773CC5D@coffee.object-craft.com.au> >Slurped through a 150Mb CSV file at a reasonable speed without any memory >leak that could be detected by the primitive method of watching the Task >Manager memory graph. I've been using a --enable-pydebug version of python while working on the _csv module, and have been watching the reference counts fairly carefully. While it's still possible there are reference leaks, I'd expect them to be in code off the main path (exception handling, etc, although I watched these carefully too). >"""0.1.1 Module Contents >The csv module defines the following functions. >reader(iterable[, dialect="excel" ] [, fmtparam]) >Return a reader object which will iterate over lines in the given >csvfile.""" > >Huh? What "given csvfile"? >Need to define carefully what iterable.next() is expected to deliver; a >line, with or without a trailing newline? In the docstring, I changed this to: The "iterable" argument can be any object that returns a line of input for each iteration, such as a file object or a list. The optional "dialect" parameter is discussed below. The function also accepts optional keyword arguments which override settings provided by the dialect. The returned object is an iterator. Each iteration returns a row of the CSV file (which can span multiple input lines): Do you think this is clearer? The reader will cope with a file opened binary or not - it *should* do the right thing in either case. >This might necessitate a FAQ entry: >>>> cr = csv.reader("iterable is string!") >>>> [x for x in cr] >[['i'], ['t'], ['e'], ['r'], ['a'], ['b'], ['l'], ['e'], [' '], ['i'], >['s'], [' '], ['s'], ['t'], ['r'], ['i'], ['n'], ['g'], ['!'] >] I don't think there is ever a case where you would want the input iteratable to be a string - I could probably just raise an exception if it is? >Does the reader detect any errors at all? E.g. I expected some complaint >here, instead of silently doing nothing: >>>> import csv >>>> cr = csv.reader(['f1,"unterminated quoted field,f3']) >>>> for x in cr: print x >... >>>> cr = csv.reader(['f1,"terminated quoted field",f3']) >>>> for x in cr: print x >... >['f1', 'terminated quoted field', 'f3'] >>>> cr = csv.reader(['f1,"unterminated quoted field,f3\n']) >>>> for x in cr: print x >... That's a hang-over from the the old Object Craft csv module (where it was the user's problem), and you are right - it needs to be fixed. I'll look into it shortly. Thanks for picking it up. >Judging by the fact that in _csv.c '\0' is passed around as a line-ending >signal, it's not 8-bit-clean. This fact should be at least documented, if >not fixed (which looks like a bit of a rewrite). Strange behaviour on >embedded '\0' may worry not only pedants but also folk who are recipients >of data files created by J. Random Boofhead III and friends. Yep - Skip - can you doco the fact that the input should not contain null characters or be unicode strings? Null characters in the input will be treated as newlines, if I remember correctly. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Fri Feb 14 07:14:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 14 Feb 2003 00:14:58 -0600 Subject: [Csv] Out of town, BDFL pronouncement, incorporation, Unicode Message-ID: <15948.35170.966135.741531@montanaro.dyndns.org> Folks, I'll be at work Friday, but will be leaving Saturday for warm, sunny Mexico for a week of r&r away from Chicago's chilly climate. The latest version of the code and PEP are "out there", hopefully getting poked and prodded a bit. Assuming nothing earth-shattering develops by mid-week, would one of you like to propose on python-dev that Guido pronounce on the PEP and give a thumbs-up or -down on the module? I can take care of merging it into the Python distribution (stitch it into setup.py, the test directory and the libref manual) when I return. Any thoughts from Dave and Andrew about Unicode? Marc Andr? Lemburg (or was it Martin von L?wis?) suggested just encoding Unicode as utf-8. Someone else (Fredrik Lundh I believe) suggested a double-compilation scheme such as Modules/_sre.c uses. One pass gets you 8-bit characters, the other wide characters. Presumably, the correct state machine to execute would be chosen based upon the input data types. Skip From skip at pobox.com Fri Feb 14 07:17:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 14 Feb 2003 00:17:27 -0600 Subject: [Csv] non-portable initialisation of types in _csv.c In-Reply-To: <20030214060319.F0D933CC5D@coffee.object-craft.com.au> References: <15947.45706.139024.281581@montanaro.dyndns.org> <20030214060319.F0D933CC5D@coffee.object-craft.com.au> Message-ID: <15948.35319.353684.464773@montanaro.dyndns.org> >> The issue of 8-bit next-to-godliness will probably have to be >> addressed once Unicode is tackled. Andrew> Definitely not this go around, anyway. I doubt it's lack is a Andrew> big deal (lack of unicode is a bigger deal) - since CSV is a Andrew> text format, finding a null in the input would be very unusual Andrew> (and I wouldn't be surprised if excel choked too... 8-). Don't forget that Excel's "Unicode Text" format seems to dump into utf-16, which is littered with NUL characters (roughly every other character in the common case where all your text is representable as ascii). Moral of the story: If Unicode is important, NUL characters will be important. Skip From skip at pobox.com Fri Feb 14 07:19:22 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 14 Feb 2003 00:19:22 -0600 Subject: [Csv] trial zip/tar packages of csv module available In-Reply-To: <20030214061130.163773CC5D@coffee.object-craft.com.au> References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> Message-ID: <15948.35434.750651.855382@montanaro.dyndns.org> Andrew> Yep - Skip - can you doco the fact that the input should not Andrew> contain null characters or be unicode strings? Will do. Skip From andrewm at object-craft.com.au Fri Feb 14 07:34:45 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 14 Feb 2003 17:34:45 +1100 Subject: [Csv] Out of town, BDFL pronouncement, incorporation, Unicode In-Reply-To: Message from Skip Montanaro <15948.35170.966135.741531@montanaro.dyndns.org> References: <15948.35170.966135.741531@montanaro.dyndns.org> Message-ID: <20030214063445.3A0A73CC5D@coffee.object-craft.com.au> >Assuming nothing earth-shattering develops by mid-week, would one of you >like to propose on python-dev that Guido pronounce on the PEP and give a >thumbs-up or -down on the module? I can take care of merging it into the >Python distribution (stitch it into setup.py, the test directory and the >libref manual) when I return. Okay. >Any thoughts from Dave and Andrew about Unicode? Marc Andr? Lemburg (or was >it Martin von L?wis?) suggested just encoding Unicode as utf-8. Someone >else (Fredrik Lundh I believe) suggested a double-compilation scheme such as >Modules/_sre.c uses. One pass gets you 8-bit characters, the other wide >characters. Presumably, the correct state machine to execute would be >chosen based upon the input data types. What little I know about utf-8 suggests that the current module should be safe - nulls won't appear, and subsequent bytes in multi-byte characters all have their high bit set. None of the special characters can be a unicode character, of course. The user could do something like: csv.reader([line.encode('utf-8') for line in lines]) I think the unicode files emitted by Excel are actually utf-8 encoded, so this won't even be necessary - the user will just have to decode each field with the utf-8 codec. Proper unicode support is something we probably should do (the user might have a UCS-2 encoded file, etc), but it won't happen in the next week or so. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Fri Feb 14 07:37:07 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Fri, 14 Feb 2003 17:37:07 +1100 Subject: [Csv] non-portable initialisation of types in _csv.c In-Reply-To: Message from Skip Montanaro <15948.35319.353684.464773@montanaro.dyndns.org> References: <15947.45706.139024.281581@montanaro.dyndns.org> <20030214060319.F0D933CC5D@coffee.object-craft.com.au> <15948.35319.353684.464773@montanaro.dyndns.org> Message-ID: <20030214063707.D883A3CC5D@coffee.object-craft.com.au> > >> The issue of 8-bit next-to-godliness will probably have to be > >> addressed once Unicode is tackled. > > Andrew> Definitely not this go around, anyway. I doubt it's lack is a > Andrew> big deal (lack of unicode is a bigger deal) - since CSV is a > Andrew> text format, finding a null in the input would be very unusual > Andrew> (and I wouldn't be surprised if excel choked too... 8-). > >Don't forget that Excel's "Unicode Text" format seems to dump into utf-16, >which is littered with NUL characters (roughly every other character in the >common case where all your text is representable as ascii). Moral of the >story: If Unicode is important, NUL characters will be important. If that's so, you'd have to convert the input to utf-8 first - even without the null issue, there would be plenty of other issues feeding 16 bit input to an 8 bit parser... 8-) Once the internals have been modified to support python's internal unicode representation, the current null handling could even stay... 8-) -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Fri Feb 14 08:06:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 14 Feb 2003 01:06:10 -0600 Subject: [Csv] ignore blank lines? In-Reply-To: References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> Message-ID: <15948.38242.974158.425677@montanaro.dyndns.org> John> ... I would say make it easier for the caller to detect this John> situation ... John> if row == []: return {} Except the way the DictReader works (and the way I intended for it to work) is that you specify a default value when creating a reader. When you encounter a short row, the missing keys are all added to the dictionary, each associated with the default. An empty dict should never be returned. You're really trying to treat the CSV file as a table where each row has a constant number of columns. This makes sense if you think of this as analogous to using the DB API to fetch rows from a database table as dictionaries (e.g., c.dictfetchall() with psycopg or using the cursorclass=DictCursor with MySQLdb. You would never get an empty dictionary (or sequence, for that matter) corresponding to an individual row of results. Either it's there with content, or it's not there at all. It's never there and empty. I don't have Excel handy at the moment, but I just tried a little experiment with gnumeric. I entered "abc", "def", and "ghi" in the first three cells of row 1, jumped down to row 3 and entered "123", "456" and "789" in the first three cells of that row. I then dumped it as CSV. Here's the result: abc,def,ghi ,, 123,456,789 Can someone try this with Excel or some other spreadsheet (I'll try Appleworks in the morning if it occurs to me before I rush out the door)? Does it produce truly blank lines or does it prevent that by inserting one or more field separators? Skip From sjmachin at lexicon.net Fri Feb 14 11:30:31 2003 From: sjmachin at lexicon.net (John Machin) Date: Fri, 14 Feb 2003 21:30:31 +1100 Subject: [Csv] trial zip/tar packages of csv module available In-Reply-To: <20030214061130.163773CC5D@coffee.object-craft.com.au> References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> Message-ID: On Fri, 14 Feb 2003 17:11:30 +1100, Andrew McNamara wrote: >> Slurped through a 150Mb CSV file at a reasonable speed without any >> memory leak that could be detected by the primitive method of watching >> the Task Manager memory graph. > > I've been using a --enable-pydebug version of python while working on the > _csv module, and have been watching the reference counts fairly > carefully. Yes, I'd gathered that from various asides in messages on this list. I was just being a little ironical about my own primitive way of checking. > >> """0.1.1 Module Contents >> The csv module defines the following functions. >> reader(iterable[, dialect="excel" ] [, fmtparam]) >> Return a reader object which will iterate over lines in the given >> csvfile.""" >> >> Huh? What "given csvfile"? >> Need to define carefully what iterable.next() is expected to deliver; a >> line, with or without a trailing newline? > > In the docstring, I changed this to: > > The "iterable" argument can be any object that returns a line > of input for each iteration, such as a file object or a list. The > optional "dialect" parameter is discussed below. The function > also accepts optional keyword arguments which override settings > provided by the dialect. > The returned object is an iterator. Each iteration returns a row > of the CSV file (which can span multiple input lines): There is not necessarily a file involved --- say "returns a row of CSV data" > > Do you think this is clearer? Frankly, no. You've dropped the "given csvfile" (almost), but you haven't said whether a "line" is expected to be terminated, and if so with what: (a) \n irrespective of platform (b) platform's native terminator (c) \r or \r\n or \n (don't care which). My guess is that if the "line" is terminated by \r or \r\n or \n, you'll ignore the terminator, and if it's not terminated at all, then there's nothing to ignore, and happiness prevails. Am I correct? > > The reader will cope with a file opened binary or not - it *should* > do the right thing in either case. The reader doesn't know what the iterable is iterating over. The behaviour should be defined in terms of what the reader expects iterable.next() to deliver. > >> This might necessitate a FAQ entry: >>>>> cr = csv.reader("iterable is string!") >>>>> [x for x in cr] >> [['i'], ['t'], ['e'], ['r'], ['a'], ['b'], ['l'], ['e'], [' '], ['i'], >> ['s'], [' '], ['s'], ['t'], ['r'], ['i'], ['n'], ['g'], ['!'] >> ] > > I don't think there is ever a case where you would want the input > iteratable to be a string - I could probably just raise an exception if > it is? You certainly wouldn't want the behaviour demonstrated above. However the punter may get confused and go cr = csv.reader(file("raboof.csv".read())) > >> Judging by the fact that in _csv.c '\0' is passed around as a line- >> ending signal, it's not 8-bit-clean. This fact should be at least >> documented, if not fixed (which looks like a bit of a rewrite). Strange >> behaviour on embedded '\0' may worry not only pedants but also folk who >> are recipients of data files created by J. Random Boofhead III and >> friends. > > Yep - Skip - can you doco the fact that the input should not contain null > characters or be unicode strings? > > Null characters in the input will be treated as newlines, if I remember > correctly. Docoing that would be useful as well. Cheers, John -- From djc at object-craft.com.au Fri Feb 14 14:39:28 2003 From: djc at object-craft.com.au (Dave Cole) Date: 15 Feb 2003 00:39:28 +1100 Subject: [Csv] trial zip/tar packages of csv module available In-Reply-To: References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> Message-ID: >>>>> "John" == John Machin writes: John> Frankly, no. You've dropped the "given csvfile" (almost), but John> you haven't said whether a "line" is expected to be terminated, John> and if so with what: (a) \n irrespective of platform (b) John> platform's native terminator (c) \r or \r\n or \n (don't care John> which). John> My guess is that if the "line" is terminated by \r or \r\n or John> \n, you'll ignore the terminator, and if it's not terminated at John> all, then there's nothing to ignore, and happiness prevails. Am John> I correct? Almost. Since the parser expects you to deliver a sequence of lines via an iterable, it requires that line termination be at the end of any string supplied. The parser will raise an exception if any characters follow a line terminator on any individual line string. - Dave -- http://www.object-craft.com.au From sjmachin at lexicon.net Fri Feb 14 20:44:00 2003 From: sjmachin at lexicon.net (John Machin) Date: Sat, 15 Feb 2003 06:44:00 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: <15949.16699.749757.280021@montanaro.dyndns.org> References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> <15948.38242.974158.425677@montanaro.dyndns.org> <15949.16699.749757.280021@montanaro.dyndns.org> Message-ID: On Fri, 14 Feb 2003 13:19:23 -0600, Skip Montanaro wrote: > > >> Except the way the DictReader works (and the way I intended for it to > >> work) is that you specify a default value when creating a reader. > >> When you encounter a short row, the missing keys are all added to the > >> dictionary, each associated with the default. > > John> What software have you met that actually outputs physically short > John> rows in an environment where you are expecting a constant number > John> of columns? > > Aside from Firewall-1's logfile exporter, none. It generates a bunch of > rows with a constant number of fields, then mysteriously appends three > blank > lines (no commas) to the end. The change I implemented to DictReader was > precisely because of this (broken, in my opinion) behavior. I doubt a > database worth its salt would do anything like that. > > John> This is in fact in agreement with the point that I was trying to > John> make: a completely empty line (as well as a line containing > John> ",,,,,,,,") is unexpected and/or meaningless in your > John> dictionary/database paradigm. We just need to agree on whether > John> such a line should be silently jettisoned, or an easy-to-detect > John> value should be returned to the caller. > > Well, I would argue that a row of commas just means a row of empty > strings. It can mean that the database has a row with all values NULL, or some other equally distrubing circumstance. > Other than that, I agree, I wouldn't expect blank lines or lines with too > few columns from properly function programs which are supposed to dump > rows > with constant numbers of columns. Exactly. Which makes me wonder why you have implemented defaults for short rows. > > I guess my Python aphorism for the day is "Practicality beats purity." I don't understand this comment. You are advocating (in fact have implemented) hiding disturbing circumstances from the callers. Do you classify this as practical or pure? From sjmachin at lexicon.net Fri Feb 14 23:48:33 2003 From: sjmachin at lexicon.net (John Machin) Date: Sat, 15 Feb 2003 09:48:33 +1100 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> Message-ID: [John Machin] >>> Judging by the fact that in _csv.c '\0' is passed around as a line- >>> ending signal, it's not 8-bit-clean. This fact should be at least >>> documented, if not fixed (which looks like a bit of a rewrite). Strange >>> behaviour on embedded '\0' may worry not only pedants but also folk who >>> are recipients of data files created by J. Random Boofhead III and >>> friends. [Andrew McNamara] >> Yep - Skip - can you doco the fact that the input should not contain >> null >> characters or be unicode strings? >> >> Null characters in the input will be treated as newlines, if I remember >> correctly. > [John Machin] > Docoing that would be useful as well. [and it's me again:] Actually it doesn't quite treat a NUL exactly like a newline; it throws data away without any warning; see below. >>> import csv >>> guff = ["aaa\0bbb", "x\0\0y"] >>> [x for x in csv.reader(guff)] [['aaa'], ['x']] >>> guff2 = ["aaa\nbbb", "x\n\ny"] >>> [x for x in csv.reader(guff2)] Traceback (most recent call last): File "", line 1, in ? _csv.Error: newline inside string >>> From skip at pobox.com Sat Feb 15 01:54:08 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 14 Feb 2003 18:54:08 -0600 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> Message-ID: <15949.36784.151916.149873@montanaro.dyndns.org> John> Actually it doesn't quite treat a NUL exactly like a newline; it John> throws data away without any warning; see below. This is to be expected I think, considering C strings are being manipulated at the low level. I just added a check to _csv.c and an extra test. It now raises csv.Error if the file being read contains NUL bytes. (Should an exception be raised on output as well?) Skip From sjmachin at lexicon.net Sat Feb 15 05:31:31 2003 From: sjmachin at lexicon.net (John Machin) Date: Sat, 15 Feb 2003 15:31:31 +1100 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: <15949.36784.151916.149873@montanaro.dyndns.org> References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> <15949.36784.151916.149873@montanaro.dyndns.org> Message-ID: On Fri, 14 Feb 2003 18:54:08 -0600, Skip Montanaro wrote: > > John> Actually it doesn't quite treat a NUL exactly like a newline; it > John> throws data away without any warning; see below. > > This is to be expected I think, considering C strings are being > manipulated > at the low level. I just added a check to _csv.c and an extra test. It > now > raises csv.Error if the file being read contains NUL bytes. (Should an > exception be raised on output as well?) Yes, but conditionally -- IMHO the caller should be able to specify (strictwriting=True) that an exception should be raised on *any* attempt to write data that could not be read back "sensibly" using the same dialect etc. Getting exceptions or a different number of rows or columns when the data are read back are certainly not "sensible". This general regime would allow someone who must produce (say) a non-quoted "|"-delimited file format to verify that there were no "|" in the data. OTOH the caller can specify strictwriting=False if it's a "you asked for it, you got it" situation. Cheers, John From adalke at mindspring.com Sat Feb 15 09:24:45 2003 From: adalke at mindspring.com (Andrew Dalke) Date: Sat, 15 Feb 2003 01:24:45 -0700 Subject: [Csv] csv Message-ID: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Hi, I tried out the csv module Skip recently made reference to on c.l.py. I'm afraid I didn't read the docs too clearly -- wanted to see if I could figure out how to use the module without documentation ;) Anyway, my file formats are either space delimited (no quotes -- the following work "infile.readline().split(' ')) or tab delimited. (Note, btw, that that is not split() and two adjacent spaces means there is an empty field.) I wanted to make a "space" dialect. I thought the following would work, but it didn't. >>> class Space(csv.Dialect): ... delimiter = " " ... quotechar = False ... escapechar = False ... doublequote = False ... skipinitialspace = False ... lineterminator = "\n" ... quoting = csv.QUOTE_NONE ... >>> Space() <__main__.Space instance at 0x162ff8> >>> csv.register_dialect("space", Space) >>> csv.reader(open("/home/mug/test.smi")) <_csv.reader object at 0x1df9c0> >>> q=_ >>> for a in q: ... pass ... >>> a ['c1ccccc1 benzene'] >>> len(a) 1 >>> print open("/home/mug/test.smi").read() c1ccccc1 benzene >>> Also, suppose for my own project I have a "SpaceDialect". The current API requires a global registry for that dialect. I don't like the chance of clobbering, though I know it to be rare. Would the ability to pass dialect = SpaceDialect (that is, a Dialect subclass) rather than the name be an appropriate addition to the API? My apologies for not spending much time on this. I need to catch a plane in a couple of hours. :( Andrew dalke at dalkescientific.com From skip at pobox.com Sat Feb 15 12:07:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat, 15 Feb 2003 05:07:18 -0600 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> <15949.36784.151916.149873@montanaro.dyndns.org> Message-ID: <15950.8038.440199.244791@montanaro.dyndns.org> (Last message before leaving for the plane...) >> (Should an exception be raised on output as well?) John> Yes, but conditionally -- IMHO the caller should be able to John> specify (strictwriting=True) that an exception should be raised on John> *any* attempt to write data that could not be read back John> "sensibly"... I believe the issue of reading/writing NUL bytes this is just a temporary limitation of the current implementation. It will be fixed it in the future (it has to, because some Unicode encodings will read or write NULs in the data stream), so we don't need to get very elaborate with our handling of NULs. For now, simply raising an exception should suffice. Skip From sjmachin at lexicon.net Sat Feb 15 19:14:04 2003 From: sjmachin at lexicon.net (John Machin) Date: Sun, 16 Feb 2003 05:14:04 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: On Sat, 15 Feb 2003 01:24:45 -0700, Andrew Dalke wrote: > Anyway, my file formats are either space delimited (no quotes -- > the following work "infile.readline().split(' ')) or tab delimited. > (Note, > btw, that that is not split() and two adjacent spaces means there is > an empty field.) > > I wanted to make a "space" dialect. I thought the following would > work, but it didn't. > >>>> class Space(csv.Dialect): > ... delimiter = " " > ... quotechar = False > ... escapechar = False These should be one-byte strings, not booleans. > ... doublequote = False > ... skipinitialspace = False > ... lineterminator = "\n" > ... quoting = csv.QUOTE_NONE > ... >>>> Space() > <__main__.Space instance at 0x162ff8> >>>> csv.register_dialect("space", Space) >>>> csv.reader(open("/home/mug/test.smi")) You need to tell the reader factory which dialect to use, if you don't want the default ("excel"). csv.reader(open("/home/mug/test.smi"), dialect="space") > > Also, suppose for my own project I have a "SpaceDialect". > The current API requires a global registry for that dialect. > I don't like the chance of clobbering, though I know it to be > rare. Would the ability to pass > > dialect = SpaceDialect > > (that is, a Dialect subclass) rather than the name be > an appropriate addition to the API? > Registration is not persistent. What is the use case for registering a dialect in one module and using it in a csv.reader() or writer() call in another module? If no use case, then registration is pointless, and the class could be passed as the dialect argument. There are various problems brought out by Andrew's example; see attached file dalke.py These are (1) very obscure error message "TypeError: bad argument type for built-in operation" caused by using quotechar = False instead of quotechar = None Also this appears out of the reader() call, not the register_dialect() call!!! *IF* there is a valid use case for registration, then the dialect should be validated then, not when used. (2) says it needs quotechar != None even when quoting=QUOTE_NONE (3) The "quoting" argument is honoured only by writers, not by readers -- i.e. in general you can't reliably read back a file that you've created and in particular to read Andrew D's files you need to set quotechar to some char that you hope is not in the input -- maybe '\0'. (4) Maybe the whole dialect thing is a bit too baroque and Byzantine -- see example 5 in dalke.py. The **dict_of_arguments gadget offers the "don't need to type long list of arguments" advantage claimed for dialect classes, and you get the same obscure error message if you stuff up the type of an argument (see example 6) -- all of this without writing all that register/validate/etc code. Maybe if we jump in quickly we could get an improved error message in the Python core for 2.3: at least identify which arg has the problem, and if lucky get it to say e.g. "expected given " and hey let's go for broke, how about which function is being called and even stop confusing the punters by calling functions in extension modules "built-in". This would benefit all Python users, not just csv users. Cheers, John -------------- next part -------------- A non-text attachment was scrubbed... Name: dalke.py Type: application/octet-stream Size: 2742 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20030216/deb84579/attachment.obj From djc at object-craft.com.au Sun Feb 16 11:59:21 2003 From: djc at object-craft.com.au (Dave Cole) Date: 16 Feb 2003 21:59:21 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: > There are various problems brought out by Andrew's example; see > attached file dalke.py > > These are > (1) very obscure error message "TypeError: bad argument type for > built-in operation" caused by using quotechar = False instead of > quotechar = None Also this appears out of the reader() call, not the > register_dialect() call!!! *IF* there is a valid use case for > registration, then the dialect should be validated then, not when > used. +1 on that one. I scratched my head for a while when seeing that error too. It wasn't until I read through the C code that the penny dropped. > (2) says it needs quotechar != None even when quoting=QUOTE_NONE +1 > (3) The "quoting" argument is honoured only by writers, not by > readers -- i.e. in general you can't reliably read back a file that > you've created and in particular to read Andrew D's files you need > to set quotechar to some char that you hope is not in the input -- > maybe '\0'. Aside from the quote of '\0', I am not sure I follow what you mean. If you set quoting so that it produces ambiguous output that is hardly the fault of the writer. > (4) Maybe the whole dialect thing is a bit too baroque and Byzantine > -- see example 5 in dalke.py. The **dict_of_arguments gadget offers > the "don't need to type long list of arguments" advantage claimed > for dialect classes, and you get the same obscure error message if > you stuff up the type of an argument (see example 6) -- all of this > without writing all that register/validate/etc code. How much clearer would things be if the validation of dialects were pulled up into the Python? Being able to see the Python code which raised the exception would be a huge help to the user. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Sun Feb 16 12:04:59 2003 From: djc at object-craft.com.au (Dave Cole) Date: 16 Feb 2003 22:04:59 +1100 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: <15950.8038.440199.244791@montanaro.dyndns.org> References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> <15949.36784.151916.149873@montanaro.dyndns.org> <15950.8038.440199.244791@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> (Last message before leaving for the plane...) >>> (Should an exception be raised on output as well?) John> Yes, but conditionally -- IMHO the caller should be able to John> specify (strictwriting=True) that an exception should be raised John> on *any* attempt to write data that could not be read back John> "sensibly"... Skip> I believe the issue of reading/writing NUL bytes this is just a Skip> temporary limitation of the current implementation. It will be Skip> fixed it in the future (it has to, because some Unicode Skip> encodings will read or write NULs in the data stream), so we Skip> don't need to get very elaborate with our handling of NULs. For Skip> now, simply raising an exception should suffice. The '\0' to indicate line termination is a hang over from my original code. There is no reason why the code could not just use '\n' to signal end of line (like every one else on the planet). - Dave -- http://www.object-craft.com.au From sjmachin at lexicon.net Sun Feb 16 12:11:46 2003 From: sjmachin at lexicon.net (John Machin) Date: Sun, 16 Feb 2003 22:11:46 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: On Sun, 16 Feb 2003 05:14:04 +1100, John Machin wrote: > (4) Maybe the whole dialect thing is a bit too baroque and Byzantine -- > see example 5 in dalke.py. The **dict_of_arguments gadget offers the > "don't need to type long list of arguments" advantage claimed for dialect > classes, and you get the same obscure error message if you stuff up the > type of an argument (see example 6) -- all of this without writing all > that register/validate/etc code. I was wrong; I guessed that _csv.c used PyArg_PTAK, but it doesn't -- it rams the Dialect (Python level) instance's attributes plus the csv.reader keyword arguments into a DialectType (C level) instance, the setattr being eventually done either by PyMember_SetOne, or in _csv.c itself -- in both cases, a type mismatch means a call to PyErr_BadArgument() which issues the obscure message "bad argument type for built-in operation". PyArg_PTAK gives a more meaningful message if the required type is a single char, for example "argument 2 must be char, not int". However where the required type is int, you get "an integer is required" ... looks like a patch wouldn't go astray. Cheers, John From sjmachin at lexicon.net Sun Feb 16 20:43:27 2003 From: sjmachin at lexicon.net (John Machin) Date: Mon, 17 Feb 2003 06:43:27 +1100 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> <15949.36784.151916.149873@montanaro.dyndns.org> <15950.8038.440199.244791@montanaro.dyndns.org> Message-ID: On 16 Feb 2003 22:04:59 +1100, Dave Cole wrote: >>>>>> "Skip" == Skip Montanaro writes: > > Skip> (Last message before leaving for the plane...) >>>> (Should an exception be raised on output as well?) > > John> Yes, but conditionally -- IMHO the caller should be able to > John> specify (strictwriting=True) that an exception should be raised > John> on *any* attempt to write data that could not be read back > John> "sensibly"... > > Skip> I believe the issue of reading/writing NUL bytes this is just a > Skip> temporary limitation of the current implementation. It will be > Skip> fixed it in the future (it has to, because some Unicode > Skip> encodings will read or write NULs in the data stream), so we > Skip> don't need to get very elaborate with our handling of NULs. For > Skip> now, simply raising an exception should suffice. > Dave> The '\0' to indicate line termination is a hang over from my original Dave> code. There is no reason why the code could not just use '\n' to Dave> signal end of line (like every one else on the planet). Are you sure? I had the impression that it was used as an out-of-band signal -- something that didn't appear in the data (you hope!) -- so that you could take exception to newlines that weren't at the end of line. From sjmachin at lexicon.net Sun Feb 16 23:32:09 2003 From: sjmachin at lexicon.net (John Machin) Date: Mon, 17 Feb 2003 09:32:09 +1100 Subject: [Csv] escapechar confusion Message-ID: Docstring: " csv.QUOTE_NONE means that quotes are never placed around fields.\n" " * escapechar - specifies a one-character string used to escape \n" " the delimiter when quoting is set to QUOTE_NONE.\n" === libcsv.tex [note especially the alleged treatment of escapechar when doublequote == False]: \begin{memberdesc}[boolean]{doublequote} Controls how instances of \var{quotechar} appearing inside a field should be themselves be quoted. When \constant{True}, the character is doubledd. When \constant{False}, the \var{escapechar} must be a one-character string which is used as a prefix to the \var{quotechar}. It defaults to \constant{True}. \end{memberdesc} \begin{memberdesc}{escapechar} A one-character string used to escape the \var{delimiter} if \var{quoting} is set to \constant{QUOTE_NONE}. It defaults to \constant{None}. \end{memberdesc} === My attempt at clarifying requirements on fiddling the contents of each field being written: [in examples, escapechar = '~' (to avoid backslashorrhea) and assumes delimiter = ',' and quotechar = '"'] if quoting == QUOTE_NONE and escapechar is not None: escape the delimiter, lineterminator(s), and the escapechar itself Level 3, Macackie Mansions -> Level 3~, Macackie Mansions Level 3, "Macackie Mansions" -> Level 3~, "Macackie Mansions" Can~on Grando -> Can~~on Grando # This scheme is plausible, unambiguous and in fact more efficient than the "standard" doubling-of-quotes scheme. elif quoting != QUOTE_NONE and not doublequote: if escapechar is None: raise "..." escape the quotechar and the escapechar itself Note: there is no *need* to escape the delimiter or line terminators, as they are "covered" by the quoting. Level 3, Macackie Mansions -> "Level 3, Macackie Mansions" Level 3, "Macackie Mansions" -> "Level 3, ~"Macackie Mansions~"" Can~on Grando -> "Can~~on Grando" # This scheme is bizarre (like some other CSV mutants) but at least it doesn't cause ambiguity on input. # What software does this? Who sponsored its inclusion? # Does it need option(s) to cater for (redundantly) escaping (a) delimiter (b) line terminator(s) # And it hasn't been implemented on output -- see below else: escapechar is not used === What _csv.c does on output: >>> source = [123456, 'aaa,bbb', 'ccc,"ddd"', '"eee",fff', 9876.5] >>> csv.writer(sys.stdout, escapechar="~", quoting=csv.QUOTE_NONE, >>> doublequote=False).writerow(source) 123456,aaa~,bbb,ccc~,"ddd","eee"~,fff,9876.5 # as expected >>> csv.writer(sys.stdout, escapechar="~", quoting=csv.QUOTE_MINIMAL, >>> doublequote=False).writerow(source) 123456,"aaa,bbb","ccc,"ddd"",""eee",fff",9876.5 # No escaping done === What _csv.c does on input: Firstly, the simple escape scheme: >>> indata1 = ['123456,aaa~,bbb,ccc~,"ddd","eee"~,fff,9876.5'] >>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, >>> doublequote=True)] [['123456', 'aaa,bbb', 'ccc,"ddd"', 'eee~', 'fff', '9876.5']] # wrong or confusing, QUOTE_NONE but still testing for quotechar at start of field >>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, >>> doublequote=False)] [['123456', 'aaa,bbb', 'ccc,"ddd"', 'eee,fff', '9876.5']] # wrong or confusing, QUOTE_NONE but still testing for quotechar at start of field >>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, >>> doublequote=False, quotechar=None)] TypeError: bad argument type for built-in operation # already grumbled about this >>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, >>> doublequote=False, quotechar="!")] [['123456', 'aaa,bbb', 'ccc,"ddd"', '"eee",fff', '9876.5']] # actual == expected Secondly, the bizarre scheme (escaping the quotechar): >>> indata2 = ['123456,aaa~,bbb,ccc~,"ddd","eee"~,fff,"ggg,~"hhh~"",iii- >>> ~"jjj~",9876.5'] >>> [x for x in csv.reader(indata2, escapechar="~", >>> quoting=csv.QUOTE_MINIMAL, doublequote=False, quotechar='"')] [['123456', 'aaa,bbb', 'ccc,"ddd"', 'eee,fff', 'ggg,"hhh"', 'iii-"jjj"', '9876.5']] # bizarre + options; this is assuming that the writer was escaping delimiters -- From djc at object-craft.com.au Sun Feb 16 23:52:51 2003 From: djc at object-craft.com.au (Dave Cole) Date: 17 Feb 2003 09:52:51 +1100 Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages of csv module available) In-Reply-To: References: <15947.60244.162082.486394@montanaro.dyndns.org> <20030214061130.163773CC5D@coffee.object-craft.com.au> <15949.36784.151916.149873@montanaro.dyndns.org> <15950.8038.440199.244791@montanaro.dyndns.org> Message-ID: >>>>> "John" == John Machin writes: Dave> The '\0' to indicate line termination is a hang over from my Dave> original code. There is no reason why the code could not just Dave> use '\n' to signal end of line (like every one else on the Dave> planet). John> Are you sure? I had the impression that it was used as an John> out-of-band signal -- something that didn't appear in the data John> (you hope!) -- so that you could take exception to newlines that John> weren't at the end of line. The outer loop of the parser was detecting end of line variations '\n', '\r\n', and '\r' and checks for following characters. If characters are discovered an exception is raised. If no characters are following, the inner parsing code is passed '\0' to indicate end of line. Since there was no way for the inner code to ever receive a '\n' as data, I changed the '\0' special value to '\n'. - Dave -- http://www.object-craft.com.au From sjmachin at lexicon.net Mon Feb 17 00:00:23 2003 From: sjmachin at lexicon.net (John Machin) Date: Mon, 17 Feb 2003 10:00:23 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: On 16 Feb 2003 21:59:21 +1100, Dave Cole wrote: [John Machin] >> There are various problems brought out by Andrew's example; see >> attached file dalke.py >> (3) The "quoting" argument is honoured only by writers, not by >> readers -- i.e. in general you can't reliably read back a file that >> you've created and in particular to read Andrew D's files you need >> to set quotechar to some char that you hope is not in the input -- >> maybe '\0'. [Dave Cole] > Aside from the quote of '\0', I am not sure I follow what you mean. > If you set quoting so that it produces ambiguous output that is hardly > the fault of the writer. Of course not. What I was getting at was that the ability to write various schemes (some ambiguous, some not) is provided, but it is not possible to read back all unambiguous schemes, and there is little if any support for checking that the data corresponds to the scheme the caller thinks was used to write it, and there are no options to drive what to do on input if the writing scheme was ambiguous. [John Machin] >> (4) Maybe the whole dialect thing is a bit too baroque and Byzantine >> -- see example 5 in dalke.py. The **dict_of_arguments gadget offers >> the "don't need to type long list of arguments" advantage claimed >> for dialect classes, and you get the same obscure error message if >> you stuff up the type of an argument (see example 6) -- all of this >> without writing all that register/validate/etc code. > [Dave Cole] > How much clearer would things be if the validation of dialects were > pulled up into the Python? Being able to see the Python code which > raised the exception would be a huge help to the user. How much clearer would things be if the error message said "quotechar must be char, not int"? The clarity should arise from the error message, not from its source. I think it a reasonable goal that a developer should have to inspect the callee's source (if available!) only in desperation. The one line of source that is shown in the traceback from Python modules is sometimes not very helpful e.g. the above reasonably helpful error message could have been produced by something like this: raise NastyError, "%s must be %s, not %s" % (self.attr_name[k], self.attr_type_abbr[k], show_type(input_value)) No comments on the possibility of throwing the whole dialect-via-classes idea away??? -- From andrewm at object-craft.com.au Mon Feb 17 00:17:23 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 17 Feb 2003 10:17:23 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: Message from John Machin of "Sun, 16 Feb 2003 22:11:46 +1100." References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: <20030216231723.202913CC5C@coffee.object-craft.com.au> >PyArg_PTAK gives a more meaningful message if the required type is a single >char, for example "argument 2 must be char, not int". However where the >required type is int, you get "an integer is required" ... looks like a >patch wouldn't go astray. PyArg_PTAK was originally used, but really isn't well suited to what we're trying to do, and ends up raising obscure errors of it's own (or, more to the point, goes subtly wrong without warning the user). Giving the C DialectType a setattr which does the input validation is probably the better answer. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From djc at object-craft.com.au Mon Feb 17 00:30:47 2003 From: djc at object-craft.com.au (Dave Cole) Date: 17 Feb 2003 10:30:47 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: >>>>> "John" == John Machin writes: John> [Dave Cole] >> Aside from the quote of '\0', I am not sure I follow what you mean. >> If you set quoting so that it produces ambiguous output that is >> hardly the fault of the writer. John> Of course not. What I was getting at was that the ability to John> write various schemes (some ambiguous, some not) is provided, John> but it is not possible to read back all unambiguous schemes, and John> there is little if any support for checking that the data John> corresponds to the scheme the caller thinks was used to write John> it, and there are no options to drive what to do on input if the John> writing scheme was ambiguous. I must be a bit thick or something... I have the feeling you are correct, but I just can't see it. Can you provide some (simple) examples and suggest where the code could be improved? John> [Dave Cole] >> How much clearer would things be if the validation of dialects were >> pulled up into the Python? Being able to see the Python code which >> raised the exception would be a huge help to the user. John> How much clearer would things be if the error message said John> "quotechar must be char, not int"? Probably only 7 squillion percent. John> The clarity should arise from the error message, not from its John> source. I think it a reasonable goal that a developer should John> have to inspect the callee's source (if available!) only in John> desperation. The one line of source that is shown in the John> traceback from Python modules is sometimes not very helpful John> e.g. the above reasonably helpful error message could have been John> produced by something like this: John> raise NastyError, "%s must be %s, not %s" % John> (self.attr_name[k], self.attr_type_abbr[k], John> show_type(input_value)) John> No comments on the possibility of throwing the whole John> dialect-via-classes idea away??? The dialect should validate when you instantiate it. This probably means that we should require a csv.Dialect instance rather than a class as the parameter to csv.reader() and csv.writer(). >>> class Space(csv.Dialect): ... delimiter = " " ... quotechar = False ... escapechar = False ... doublequote = False ... skipinitialspace = False ... lineterminator = "\n" ... quoting = csv.QUOTE_NONE ... >>> Space() <__main__.Space instance at 0x401f3dcc> Is it possible for the csv.Dialect to raise an exception when Space is instantiated? I don't know enough about the new style classes. - Dave -- http://www.object-craft.com.au From djc at object-craft.com.au Mon Feb 17 00:35:11 2003 From: djc at object-craft.com.au (Dave Cole) Date: 17 Feb 2003 10:35:11 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: <20030216231723.202913CC5C@coffee.object-craft.com.au> References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> Message-ID: >>>>> "Andrew" == Andrew McNamara writes: >> PyArg_PTAK gives a more meaningful message if the required type is >> a single char, for example "argument 2 must be char, not >> int". However where the required type is int, you get "an integer >> is required" ... looks like a patch wouldn't go astray. Andrew> PyArg_PTAK was originally used, but really isn't well suited Andrew> to what we're trying to do, and ends up raising obscure errors Andrew> of it's own (or, more to the point, goes subtly wrong without Andrew> warning the user). Andrew> Giving the C DialectType a setattr which does the input Andrew> validation is probably the better answer. Does that mean that the validation is only on individual attributes, not on the set of attributes? - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Mon Feb 17 00:44:22 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 17 Feb 2003 10:44:22 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: Message from Dave Cole References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> Message-ID: <20030216234422.6E7AD3CC5C@coffee.object-craft.com.au> >>> PyArg_PTAK gives a more meaningful message if the required type is >>> a single char, for example "argument 2 must be char, not >>> int". However where the required type is int, you get "an integer >>> is required" ... looks like a patch wouldn't go astray. > >Andrew> PyArg_PTAK was originally used, but really isn't well suited >Andrew> to what we're trying to do, and ends up raising obscure errors >Andrew> of it's own (or, more to the point, goes subtly wrong without >Andrew> warning the user). > >Andrew> Giving the C DialectType a setattr which does the input >Andrew> validation is probably the better answer. > >Does that mean that the validation is only on individual attributes, >not on the set of attributes? Yep - at the moment there are no inter-attribute checks. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From sjmachin at lexicon.net Mon Feb 17 00:47:25 2003 From: sjmachin at lexicon.net (John Machin) Date: Mon, 17 Feb 2003 10:47:25 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: <20030216231723.202913CC5C@coffee.object-craft.com.au> References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> Message-ID: On Mon, 17 Feb 2003 10:17:23 +1100, Andrew McNamara wrote: >> PyArg_PTAK gives a more meaningful message if the required type is a >> single char, for example "argument 2 must be char, not int". However >> where the required type is int, you get "an integer is required" ... >> looks like a patch wouldn't go astray. > > PyArg_PTAK was originally used, but really isn't well suited to what > we're > trying to do, Hmmm ... nobody seems to want to discuss my point that what you're trying to do (the whole dialect thing) is a bit over the top. > and ends up raising obscure errors of it's own (or, more to > the point, goes subtly wrong without warning the user). Can you give an example of "goes subtly wrong without warning"? Have you reported these problems? I recall noticing a while back that it would silently truncate a supplied float to fit a desired int w/o any complaint [rationale is evidently : "floats have an int() method, don't they?"] -- is that the sort of thing you mean? -- From djc at object-craft.com.au Mon Feb 17 00:59:27 2003 From: djc at object-craft.com.au (Dave Cole) Date: 17 Feb 2003 10:59:27 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> Message-ID: >>>>> "John" == John Machin writes: John> Hmmm ... nobody seems to want to discuss my point that what John> you're trying to do (the whole dialect thing) is a bit over the John> top. I think that rationale is more along the lines of "validation of "random" objects created in Python is harder than validation of objects created by code we control", but I could be wrong. >> and ends up raising obscure errors of it's own (or, more to the >> point, goes subtly wrong without warning the user). John> Can you give an example of "goes subtly wrong without warning"? John> Have you reported these problems? I recall noticing a while John> back that it would silently truncate a supplied float to fit a John> desired int w/o any complaint [rationale is evidently : "floats John> have an int() method, don't they?"] -- is that the sort of thing John> you mean? When/where did it silently truncate a float? Can you provide an example? - Dave -- http://www.object-craft.com.au From andrewm at object-craft.com.au Mon Feb 17 01:06:13 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 17 Feb 2003 11:06:13 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: Message from John Machin of "Mon, 17 Feb 2003 10:47:25 +1100." References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> Message-ID: <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> >> PyArg_PTAK was originally used, but really isn't well suited to what >> we're trying to do, > >Hmmm ... nobody seems to want to discuss my point that what you're trying >to do (the whole dialect thing) is a bit over the top. Yep - we had this discussion early on - the list archives should have details: http://manatee.mojam.com/pipermail/csv/ Note that the registry stuff is entirely optional. You can pass a class or instance as the dialect, and it will work as expected. The doco should probably be updated to mention this. >> and ends up raising obscure errors of it's own (or, more to >> the point, goes subtly wrong without warning the user). > >Can you give an example of "goes subtly wrong without warning"? Have you >reported these problems? What we're trying to do is not what PyArg_PTAK does well - it's not PyArg_PTAK's fault that it doesn't do what we want... One problem was that PyArg_PTAK tries to hide the distinction between positional and keyword arguments - every keyword argument is given a position. This was more of a problem in the old days (the parameters were originally all positional). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From sjmachin at lexicon.net Mon Feb 17 02:13:17 2003 From: sjmachin at lexicon.net (John Machin) Date: Mon, 17 Feb 2003 12:13:17 +1100 Subject: reading back what you wrote (was Re: Andrew Dalke's space example (was Re: [Csv] csv)) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: On 17 Feb 2003 10:30:47 +1100, Dave Cole wrote: >>>>>> "John" == John Machin writes: > > John> [Dave Cole] >>> Aside from the quote of '\0', I am not sure I follow what you mean. >>> If you set quoting so that it produces ambiguous output that is >>> hardly the fault of the writer. > > John> Of course not. What I was getting at was that the ability to > John> write various schemes (some ambiguous, some not) is provided, > John> but it is not possible to read back all unambiguous schemes, and > John> there is little if any support for checking that the data > John> corresponds to the scheme the caller thinks was used to write > John> it, and there are no options to drive what to do on input if the > John> writing scheme was ambiguous. > > I must be a bit thick or something... I have the feeling you are > correct, but I just can't see it. Can you provide some (simple) > examples and suggest where the code could be improved? > Here is my approach: (1) Define not only a scheme for writing "standard" CSV but schemes for writing the various mutations that I have come across (2) Have a strict_output option to govern behaviour when the input is such that output cannot be reversed (exception immediately, exception at end if error count is not zero, no exception) Example (a) someone wants to write using a no-quoted scheme but they have a delimiter inside a field (b) a doublequote=False, escapechar=None scheme but there is a quotechar in the data (3) On input, require the caller to specify exactly what scheme they think was used to create the data. Check carefully that the incoming data corresponds to the alleged scheme. Again, have a strict_input option. Here we have some data that was written by a doublequote=False, escapechar=None, quoting=QUOTE_ALL scheme: >>> badcsv = ['"quotes not doubled"', '"rear of "Fubar Flats""', '""Thistle >>> Do" RMB 123"'] and it is munged w/o warning if read with standard CSV settings: >>> [x for x in csv.reader(badcsv)] [['quotes not doubled'], ['rear of Fubar Flats""'], ['Thistle Do" RMB 123"']] and trying to tell the csv module what to do doesn't help: >>> [x for x in csv.reader(badcsv, doublequote=False, escapechar=None)] [['quotes not doubled'], ['rear of Fubar Flats""'], ['Thistle Do" RMB 123"']] It is possible to recover the data if each field had an even number of quotes, but this requires a quite different state machine: >>> badcsvstr = '"quotes not doubled"\n"rear of "Fubar Flats""\n""Thistle >>> Do" RMB 123"' # my module requires input iterables only to deliver one or more bytes per iteration i.e can be more or less than exactly one line and the module does the end-of-line detection and yes it special-cases the iterable being a string, for obvious efficiency reasons. >>> [x for x in delimited.importer(badcsvstr, >>> quote_mode=delimited.QUOTE_SINGLE)] [['quotes not doubled'], ['rear of "Fubar Flats"'], ['"Thistle Do" RMB 123']] # We've recovered what was most likely to have been in the original data and will crack it if told that this data is standard CSV: >>> impo = delimited.importer(badcsvstr) >>> list(impo) Traceback (most recent call last): File "", line 1, in ? delimited.DataError: After rear_quote, expected rear_quote, delimiter or newline; found (hex 46) and just in case you're trying to find the offending line in a 100 Mb file: >>> impo.input_row_number, impo.input_char_column (1, 10) # zero-relative Hope this explains where I'm coming from ... Cheers, John From sjmachin at lexicon.net Mon Feb 17 02:35:30 2003 From: sjmachin at lexicon.net (John Machin) Date: Mon, 17 Feb 2003 12:35:30 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> Message-ID: On Mon, 17 Feb 2003 11:06:13 +1100, Andrew McNamara wrote: > > Note that the registry stuff is entirely optional. You can pass a class > or instance as the dialect, and it will work as expected. The doco should > probably be updated to mention this. Yes, it should. What is the use case for the registry, anyway? From andrewm at object-craft.com.au Mon Feb 17 02:40:37 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 17 Feb 2003 12:40:37 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: Message from John Machin of "Mon, 17 Feb 2003 12:35:30 +1100." References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> Message-ID: <20030217014037.899903CC5C@coffee.object-craft.com.au> >> Note that the registry stuff is entirely optional. You can pass a class >> or instance as the dialect, and it will work as expected. The doco should >> probably be updated to mention this. > >Yes, it should. What is the use case for the registry, anyway? Actually, it started out being an internal implementation detail. You're supposted to be able to specify common dialects via a string (for example "excel"), and obviously the module needed some way of recording these. You can, in fact, pretend the dialect classes don't exist. This works fine: r = csv.reader(input_file, delimiter = '\t') The module supplies default values for all the parameters - the defaults correspond to the way Excel parses csv files. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Mon Feb 17 03:09:34 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 17 Feb 2003 13:09:34 +1100 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: Message from John Machin of "Mon, 17 Feb 2003 12:59:30 +1100." References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> <20030217014037.899903CC5C@coffee.object-craft.com.au> Message-ID: <20030217020934.0D5773CC5C@coffee.object-craft.com.au> >The "supposted to be able to specify common dialects via a string" >requirement seems rather superfluous when you can pass in a class or an >instance. Thus the registry caper seems also superfluous. I think the requirement came from the GUI camp - they want to be able to provide their users with a pulldown with a list of supported file formats. >> You can, in fact, pretend the dialect classes don't exist. > >Yes, I'd noticed. This just means that you then need extra code (in C!) to >validate the keyword arguments and cram them into the Dialect instance. > >What do you lose, apart from a maintenance headache, if you throw away the >whole Dialect notion and just stick to key-word arguments (with appropriate >defaults, of course)? Well, then you have Object Craft's csv module (on which the current implementation was based)... 8-) But you still need to do a whole heap of validation whichever way you do it. The current validation could certainly do with more work. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From andrewm at object-craft.com.au Wed Feb 19 01:01:44 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 19 Feb 2003 11:01:44 +1100 Subject: [Csv] Out of town, BDFL pronouncement, incorporation, Unicode In-Reply-To: Message from Andrew McNamara <20030214063445.3A0A73CC5D@coffee.object-craft.com.au> References: <15948.35170.966135.741531@montanaro.dyndns.org> <20030214063445.3A0A73CC5D@coffee.object-craft.com.au> Message-ID: <20030219000144.AF27C3CC5E@coffee.object-craft.com.au> >>Assuming nothing earth-shattering develops by mid-week, would one of you >>like to propose on python-dev that Guido pronounce on the PEP and give a >>thumbs-up or -down on the module? I can take care of merging it into the >>Python distribution (stitch it into setup.py, the test directory and the >>libref manual) when I return. > >Okay. Guido's doing the 2.3a2 release today - we're not going to get into a2, so I'm going to wait until he's finished with a2 before posting. I also think we have a few doco and other issues that have been discussed in the last week that need to be tidied up. I'm rather short of time at the moment - any help others can give (going back through the archive and making a TODO list would be valuable) would be appreciated. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Feb 24 02:15:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun, 23 Feb 2003 19:15:44 -0600 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> Message-ID: <15961.29248.867924.249023@montanaro.dyndns.org> John> Yes, it should. What is the use case for the registry, anyway? My original thought was that the module itself would grow new dialects over time and that it would be easier for programmers and users to remember and recognize strings like "excel" or "gnumeric" or "appleworks". The biggest use for a registry is probably within GUI apps that need to read/write CSV files. The strings make nice pop-up menu items, then are internally used as keys in the "registry", which is nothing more than a dict. Skip From skip at pobox.com Wed Feb 26 16:09:50 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 09:09:50 -0600 Subject: [Csv] What's our status? Message-ID: <15964.55486.242989.782539@montanaro.dyndns.org> Guys, Are we ready to go? As you can see from the attached PEP 283 checkin message, Guido is hopeful. Skip -------------- next part -------------- An embedded message was scrubbed... From: gvanrossum at users.sourceforge.net Subject: [Python-checkins] python/nondist/peps pep-0283.txt,1.31,1.32 Date: Wed, 26 Feb 2003 06:58:15 -0800 Size: 6093 Url: http://mail.python.org/pipermail/csv/attachments/20030226/f98d19f0/attachment.mht From LogiplexSoftware at earthlink.net Wed Feb 26 18:32:01 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 26 Feb 2003 09:32:01 -0800 Subject: [Csv] What's our status? In-Reply-To: <15964.55486.242989.782539@montanaro.dyndns.org> References: <15964.55486.242989.782539@montanaro.dyndns.org> Message-ID: <1046280720.27223.9.camel@software1.logiplex.internal> On Wed, 2003-02-26 at 07:09, Skip Montanaro wrote: > Guys, > > Are we ready to go? As you can see from the attached PEP 283 checkin > message, Guido is hopeful. > > Skip > I'm fairly happy with the state of the csv parser and the PEP. I'm working on csvutils.py right now. The guessDelimiter() function from DSV isn't really the best for our purposes as it expects a fairly fixed number of columns and we're allowing for variable columns per row. Also, allowing spaces around delimiters is going to throw guessQuoteChar(). I've got some ideas for fixing guessQuoteChar() but guessDelimiter is going to need an entirely new approach (which I think I have an idea for =) Sorry for being a slug. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Wed Feb 26 18:42:15 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 11:42:15 -0600 Subject: [Csv] What's our status? In-Reply-To: <1046280720.27223.9.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> Message-ID: <15964.64631.161113.441183@montanaro.dyndns.org> Cliff> I'm fairly happy with the state of the csv parser and the PEP. Cliff> I'm working on csvutils.py right now. Let's not wait terribly long to get things in the mill. If I remember correctly, 2.3b1 will be out around mid-March. I'd like to ask Guido to pronounce on the PEP and code in the next few days if possible. I will post a note to python-dev asking people to take a look at the code and the PEP. Skip From skip at pobox.com Wed Feb 26 19:03:20 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 12:03:20 -0600 Subject: [Csv] PEP 305 - CSV File API - please have a look Message-ID: <15965.360.659692.788321@montanaro.dyndns.org> Folks, In advance of asking Guido to review and pronounce on PEP 305 and its related code, I'd like to ask you to take a few minutes to review what we've produced. There is the PEP, of course: http://www.python.org/peps/pep-0305.html but there is also source code, a large number of test cases and a libref section available in the CVS sandbox. Cliff Wells is working on a csvutils module which will contain adaptations of the "sniffing" routines from his DSV package. Just do a "csv up -dP ." in your nondist/sandbox directory to get the latest version of everything. Feel free to review and/or comment on any or all of it, but please please post your comments to the csv at mail.mojam.com mailing list. You can review our rather active correspondence at http://manatee.mojam.com/pipermail/csv/ or if you're really excited about CSV files, you can subscribe at http://manatee.mojam.com/mailman/listinfo/csv Thx, Skip From guido at python.org Wed Feb 26 19:10:47 2003 From: guido at python.org (Guido van Rossum) Date: Wed, 26 Feb 2003 13:10:47 -0500 Subject: [Csv] Re: [Python-Dev] PEP 305 - CSV File API - please have a look In-Reply-To: Your message of "Wed, 26 Feb 2003 12:03:20 CST." <15965.360.659692.788321@montanaro.dyndns.org> References: <15965.360.659692.788321@montanaro.dyndns.org> Message-ID: <200302261810.h1QIAmT20744@odiug.zope.com> > Just do a "csv up -dP ." in your nondist/sandbox directory to get the latest You've been typing csv too much. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From skip at pobox.com Wed Feb 26 21:27:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 14:27:44 -0600 Subject: [Csv] Dialect.validate() Message-ID: <15965.9024.863156.732714@montanaro.dyndns.org> If you have a moment, please take a look at the simple-minded validate() method in the Dialect class. I'm sure it can be strengthened quite a bit. Thx, Skip From skip at pobox.com Wed Feb 26 21:37:32 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 14:37:32 -0600 Subject: [Csv] Dialect validation errors Message-ID: <15965.9612.508559.964220@montanaro.dyndns.org> Any thoughts on the way I'm generating Dialect validation errors as a list of strings? I'm starting to write test cases for that stuff and it occurs to me that checking for specific strings in the validation output is going to be fragile. Skip From skip at pobox.com Wed Feb 26 22:42:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 15:42:49 -0600 Subject: [Csv] ignore blank lines? In-Reply-To: References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> <15948.38242.974158.425677@montanaro.dyndns.org> <15949.16699.749757.280021@montanaro.dyndns.org> Message-ID: <15965.13529.765871.804925@montanaro.dyndns.org> Returning to an old pre-vacation topic... >> Well, I would argue that a row of commas just means a row of empty >> strings. John> It can mean that the database has a row with all values NULL, or John> some other equally distrubing circumstance. We've already established that there is no way to store NULL/None values in a CSV file and have them be reliably reconstituted when the file is read back in. >> Other than that, I agree, I wouldn't expect blank lines or lines with >> too few columns from properly function programs which are supposed to >> dump rows with constant numbers of columns. John> Exactly. Which makes me wonder why you have implemented defaults John> for short rows. Perhaps it's just overkill. >> I guess my Python aphorism for the day is "Practicality beats >> purity." John> I don't understand this comment. You are advocating (in fact have John> implemented) hiding disturbing circumstances from the callers. Do John> you classify this as practical or pure? If, for some reason, a row in a CSV file has a short line or a blank line I don't want the processing to barf. Most CSV files are program-generated, and in my opinion the likelihood of a user introducing more problems into the file by hand editing it are too high. I'd rather worm around problems in the files. On output, I think it would be convenient to not require dictionaries being dumped to the file to have a full complement of key-value pairs. Not all such data will be generated by a database which fully populates all fields. Skip From skip at pobox.com Wed Feb 26 22:58:17 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 15:58:17 -0600 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> Message-ID: <15965.14457.888418.349998@montanaro.dyndns.org> >> (4) Maybe the whole dialect thing is a bit too baroque and Byzantine >> -- see example 5 in dalke.py. The **dict_of_arguments gadget offers >> the "don't need to type long list of arguments" advantage claimed for >> dialect classes, and you get the same obscure error message if you >> stuff up the type of an argument (see example 6) -- all of this >> without writing all that register/validate/etc code. Dave> How much clearer would things be if the validation of dialects Dave> were pulled up into the Python? That was my intention all along. The problem I see is that someone might pass an instance as the dialect parameter to csv.reader() or csv.writer() which is not an instance of csv.Dialect. If we can get the Dialect._validate() method right, all the C code would have to do is make sure the object passed to the factory functions as the dialect parameter is an instance of csv.Dialect or that the class passed to register_dialect is a subclass of csv.Dialect. Skip From LogiplexSoftware at earthlink.net Wed Feb 26 23:10:48 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 26 Feb 2003 14:10:48 -0800 Subject: [Csv] What's our status? In-Reply-To: <1046280720.27223.9.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> Message-ID: <1046297447.27223.19.camel@software1.logiplex.internal> On Wed, 2003-02-26 at 09:32, Cliff Wells wrote: > I'm working on csvutils.py right now. The guessDelimiter() function > from DSV isn't really the best for our purposes as it expects a fairly > fixed number of columns and we're allowing for variable columns per row. > Also, allowing spaces around delimiters is going to throw > guessQuoteChar(). I've got some ideas for fixing guessQuoteChar() but > guessDelimiter is going to need an entirely new approach (which I think > I have an idea for =) Okay, here's my status: 1) I can sniff the quotechar. 2) I can sniff the delimiter IF: a) there is a quotechar [determine delimiter based on relation to quotechar]. or b) the data is regular, that is, the number of columns doesn't vary a lot from record to record [based upon number of occurrences of delimiter in each record, to grossly simplify things]. This is the method DSV uses. However, for the following I am so far unable to come up with a way to determine the delimiter: all,work,and,no,play,makes,jack,a,dull,boy all,work,and,no,play,makes,jack,a,dull boy all,work,and,no,play,makes,jack,a dull,boy all,work,and,no,play,makes,jack a,dull,boy all,work,and,no,play,makes jack,a,dull,boy all,work,and,no,play makes,jack,a,dull,boy all,work,and,no play,makes,jack,a,dull,boy all,work,and no,play,makes,jack,a,dull,boy Anyone have a suggestion? All work and no play makes jack a dull boy. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Wed Feb 26 23:12:47 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 16:12:47 -0600 Subject: Andrew Dalke's space example (was Re: [Csv] csv) In-Reply-To: <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro> <20030216231723.202913CC5C@coffee.object-craft.com.au> <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au> Message-ID: <15965.15327.198910.214585@montanaro.dyndns.org> Andrew> Note that the registry stuff is entirely optional. You can pass Andrew> a class or instance as the dialect, and it will work as Andrew> expected. The doco should probably be updated to mention this. So noted in the docs. Skip From skip at pobox.com Wed Feb 26 23:17:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 16:17:27 -0600 Subject: [Csv] What's our status? In-Reply-To: <1046297447.27223.19.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> Message-ID: <15965.15607.38055.873934@montanaro.dyndns.org> Cliff> Okay, here's my status: Cliff> 1) I can sniff the quotechar. Cliff> 2) I can sniff the delimiter IF: ... Cliff> However, for the following I am so far unable to come up with a Cliff> way to determine the delimiter: ... Can you check in what you have so we can poke it a bit? Also, I suspect this whole thing should be a package. That is, the csv utils module should be csv.utils not csvutils. Comments? Skip From djc at object-craft.com.au Thu Feb 27 00:03:35 2003 From: djc at object-craft.com.au (Dave Cole) Date: 27 Feb 2003 10:03:35 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: <15965.13529.765871.804925@montanaro.dyndns.org> References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> <15948.38242.974158.425677@montanaro.dyndns.org> <15949.16699.749757.280021@montanaro.dyndns.org> <15965.13529.765871.804925@montanaro.dyndns.org> Message-ID: >>>>> "Skip" == Skip Montanaro writes: Skip> Returning to an old pre-vacation topic... >>> Well, I would argue that a row of commas just means a row of empty >>> strings. John> It can mean that the database has a row with all values NULL, or John> some other equally distrubing circumstance. Skip> We've already established that there is no way to store Skip> NULL/None values in a CSV file and have them be reliably Skip> reconstituted when the file is read back in. That is not strictly true. We could come up with a dialect parameter which is unique to the Python csv module which does this: abc,null,def <-> ['abc', None, 'def'] abc,"null",def <-> ['abc', 'null', 'def'] abc,,def <-> ['abc', '', 'def'] This would allow us to provide a format which was even more useful for DB-API users. - Dave -- http://www.object-craft.com.au From skip at pobox.com Thu Feb 27 00:24:22 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 17:24:22 -0600 Subject: [Csv] ignore blank lines? In-Reply-To: References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> <15948.38242.974158.425677@montanaro.dyndns.org> <15949.16699.749757.280021@montanaro.dyndns.org> <15965.13529.765871.804925@montanaro.dyndns.org> Message-ID: <15965.19622.423269.684185@montanaro.dyndns.org> Dave> That is not strictly true. We could come up with a dialect Dave> parameter which is unique to the Python csv module which does Dave> this: Dave> abc,null,def <-> ['abc', None, 'def'] Dave> abc,"null",def <-> ['abc', 'null', 'def'] Dave> abc,,def <-> ['abc', '', 'def'] -1. Too much chance for confusion and mistakes. Quotes are for quoting, not for data typing. Skip From LogiplexSoftware at earthlink.net Thu Feb 27 02:10:24 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 26 Feb 2003 17:10:24 -0800 Subject: [Csv] What's our status? In-Reply-To: <1046297447.27223.19.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> Message-ID: <1046308224.27222.68.camel@software1.logiplex.internal> On Wed, 2003-02-26 at 14:10, Cliff Wells wrote: > On Wed, 2003-02-26 at 09:32, Cliff Wells wrote: > > > I'm working on csvutils.py right now. The guessDelimiter() function > > from DSV isn't really the best for our purposes as it expects a fairly > > fixed number of columns and we're allowing for variable columns per row. > > Also, allowing spaces around delimiters is going to throw > > guessQuoteChar(). I've got some ideas for fixing guessQuoteChar() but > > guessDelimiter is going to need an entirely new approach (which I think > > I have an idea for =) > > Okay, here's my status: > > 1) I can sniff the quotechar. > 2) I can sniff the delimiter IF: > a) there is a quotechar [determine delimiter based on relation to > quotechar]. > or > b) the data is regular, that is, the number of columns doesn't vary > a lot from record to record [based upon number of occurrences of > delimiter in each record, to grossly simplify things]. This is > the method DSV uses. > > However, for the following I am so far unable to come up with a way to > determine the delimiter: > > all,work,and,no,play,makes,jack,a,dull,boy > all,work,and,no,play,makes,jack,a,dull > boy > all,work,and,no,play,makes,jack,a > dull,boy > all,work,and,no,play,makes,jack > a,dull,boy > all,work,and,no,play,makes > jack,a,dull,boy > all,work,and,no,play > makes,jack,a,dull,boy > all,work,and,no > play,makes,jack,a,dull,boy > all,work,and > no,play,makes,jack,a,dull,boy Okay, banging my head against a wall here. Consider this "CSV" file: all work and no play makes jack a dull boy I don't see why this wouldn't be considered valid CSV, yet there is clearly no delimiter (assuming there would have been one had each row contained more than one column). It seems we could just pass ',' as the delimiter since it won't be used anyway until we encounter: redrum redrum redrum re,drum Where "," is actually part of the data (assume for a moment that \t was the delimiter. Further, consider that any of the characters ('r', 'e', 'd', 'u', 'm') could possibly be considered a delimiter (not likely though, and I'd be willing to limit possibilities to string.punctuation + string.whitespace for these situations if I thought it would really help). It's becoming clear to me that without the constraints I mentioned earlier (valid quotechar or the columns are of a mostly fixed length) there is no good way to sniff the format. This seems unfortunate because the formats that are unsniffable are the simplest possible cases. Sigh. Will think about it more but I'm becoming more pessimistic the longer I look at it. OTOH, I personally don't have a big problem with the constraints [just a small one]. The DSV sniffers have been used by a lot of people without complaint and they required fixed column widths regardless of whether there was a quotechar or not and we're actually doing a bit better than that right now. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Thu Feb 27 02:15:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 26 Feb 2003 19:15:07 -0600 Subject: [Csv] What's our status? In-Reply-To: <1046308224.27222.68.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046308224.27222.68.camel@software1.logiplex.internal> Message-ID: <15965.26267.85191.93543@montanaro.dyndns.org> Cliff> Okay, banging my head against a wall here. Consider this "CSV" Cliff> file: Cliff> all Cliff> work Cliff> and Cliff> no Cliff> play Cliff> makes Cliff> jack Cliff> a Cliff> dull Cliff> boy Is there something that suggests a sniffer can't fail to decide/guess? Cliff> OTOH, I personally don't have a big problem with the constraints Cliff> [just a small one]. The DSV sniffers have been used by a lot of Cliff> people without complaint and they required fixed column widths Cliff> regardless of whether there was a quotechar or not and we're Cliff> actually doing a bit better than that right now. So maybe we make constant number of columns a constraint? Skip From andrewm at object-craft.com.au Thu Feb 27 02:19:16 2003 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 27 Feb 2003 12:19:16 +1100 Subject: [Csv] What's our status? In-Reply-To: Message from Skip Montanaro <15965.26267.85191.93543@montanaro.dyndns.org> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046308224.27222.68.camel@software1.logiplex.internal> <15965.26267.85191.93543@montanaro.dyndns.org> Message-ID: <20030227011916.5E0453CC5C@coffee.object-craft.com.au> >Is there something that suggests a sniffer can't fail to decide/guess? That would be better than guessing wrong, I think. > Cliff> OTOH, I personally don't have a big problem with the constraints > Cliff> [just a small one]. The DSV sniffers have been used by a lot of > Cliff> people without complaint and they required fixed column widths > Cliff> regardless of whether there was a quotechar or not and we're > Cliff> actually doing a bit better than that right now. > >So maybe we make constant number of columns a constraint? Or even a hint. Maybe the user of the module can provide some "educated guesses" as to the nature of the file. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From LogiplexSoftware at earthlink.net Thu Feb 27 02:46:36 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 26 Feb 2003 17:46:36 -0800 Subject: [Csv] What's our status? In-Reply-To: <15965.26267.85191.93543@montanaro.dyndns.org> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046308224.27222.68.camel@software1.logiplex.internal> <15965.26267.85191.93543@montanaro.dyndns.org> Message-ID: <1046310395.27223.91.camel@software1.logiplex.internal> On Wed, 2003-02-26 at 17:15, Skip Montanaro wrote: > Cliff> Okay, banging my head against a wall here. Consider this "CSV" > Cliff> file: > > Cliff> all > Cliff> work > Cliff> and > Cliff> no > Cliff> play > Cliff> makes > Cliff> jack > Cliff> a > Cliff> dull > Cliff> boy > > Is there something that suggests a sniffer can't fail to decide/guess? No, it's just unfortunate that what appears to be the simple cases is where it fails. > > Cliff> OTOH, I personally don't have a big problem with the constraints > Cliff> [just a small one]. The DSV sniffers have been used by a lot of > Cliff> people without complaint and they required fixed column widths > Cliff> regardless of whether there was a quotechar or not and we're > Cliff> actually doing a bit better than that right now. > > So maybe we make constant number of columns a constraint? Number of columns or quoted. But perhaps that's confusing? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From djc at object-craft.com.au Thu Feb 27 09:33:14 2003 From: djc at object-craft.com.au (Dave Cole) Date: 27 Feb 2003 19:33:14 +1100 Subject: [Csv] ignore blank lines? In-Reply-To: <15965.19622.423269.684185@montanaro.dyndns.org> References: <15945.28678.631121.25754@montanaro.dyndns.org> <15945.37976.592926.369940@montanaro.dyndns.org> <15946.25610.202929.623399@montanaro.dyndns.org> <15948.38242.974158.425677@montanaro.dyndns.org> <15949.16699.749757.280021@montanaro.dyndns.org> <15965.13529.765871.804925@montanaro.dyndns.org> <15965.19622.423269.684185@montanaro.dyndns.org> Message-ID: Dave> That is not strictly true. We could come up with a dialect Dave> parameter which is unique to the Python csv module which does Dave> this: Dave> abc,null,def <-> ['abc', None, 'def'] Dave> abc,"null",def <-> ['abc', 'null', 'def'] Dave> abc,,def <-> ['abc', '', 'def'] Skip> -1. Too much chance for confusion and mistakes. Quotes are for Skip> quoting, not for data typing. The point is to provide a round-trip for the DB-API. I think you would have rocks in your head if you tried to use or create this data with anything other than the CSV module and the DB-API. Anyway, it was just a thought. - Dave -- http://www.object-craft.com.au From sjmachin at lexicon.net Thu Feb 27 13:12:16 2003 From: sjmachin at lexicon.net (John Machin) Date: Thu, 27 Feb 2003 23:12:16 +1100 Subject: [Csv] What's our status? In-Reply-To: <1046297447.27223.19.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> Message-ID: On 26 Feb 2003 14:10:48 -0800, Cliff Wells wrote: > > However, for the following I am so far unable to come up with a way to > determine the delimiter: > > all,work,and,no,play,makes,jack,a,dull,boy > all,work,and,no,play,makes,jack,a,dull > boy > all,work,and,no,play,makes,jack,a [snip] > > Anyone have a suggestion? All work and no play makes jack a dull boy. [Warning: late at night, OTTOMH, may contain babblings] Errrmmm, maybe I've missed the plot or lost the point or whatever, but a good start would be assuming that only in pathological cases would the delimiter or the quote be an alphanumeric character i.e. the file has been produced by an ordinary user, not a red-team tester. Try the most frequent two non-alphanumeric characters as the candidates for the delimiter and the quotechar? If there's only 1 non-alphanumeric character, then it's the delimiter. If there aren't any non-AN chars [an example in one of your messages], then there's only one field per record. Where there are two or more candidates for the delimiter and quotechar, you could use some plausibility heuristics e.g. " and ' are more likely to be quotes than delimiters however tab, comma, semicolon, colon, vertical bar, and tilde are plausible delimiters. Some cautions: (1) "Warning -- Europeans here";1,234;5,678 (2) Joe Blow~'The Vaults',456 Main St,Snowtown,SA,5999~31/12/1999~01/04/2000 # delimiter (tilde) occurs 3 times, no quotechar at all, data characters comma and slash occur 4 times each (more than delimiter). In any case, it appears to me that you can't pronounce on the result until you've parsed a large chunk of the file with each plausible hypothesis, especially if the hypothesis admits (quoted) newlines inside the data. Some possible decision criteria are (1) percentage of syntax errors (2) standard deviation of number of columns ... Hope this helps, John From LogiplexSoftware at earthlink.net Thu Feb 27 18:07:58 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Feb 2003 09:07:58 -0800 Subject: [Csv] What's our status? In-Reply-To: References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> Message-ID: <1046365677.27223.119.camel@software1.logiplex.internal> On Thu, 2003-02-27 at 04:12, John Machin wrote: > On 26 Feb 2003 14:10:48 -0800, Cliff Wells > wrote: > > > > > However, for the following I am so far unable to come up with a way to > > determine the delimiter: > > > > all,work,and,no,play,makes,jack,a,dull,boy > > all,work,and,no,play,makes,jack,a,dull > > boy > > all,work,and,no,play,makes,jack,a > [snip] > > > > > Anyone have a suggestion? All work and no play makes jack a dull boy. > > [Warning: late at night, OTTOMH, may contain babblings] I started babbling yesterday while working on this. Luckily the interns came and gave me my injection. However, it's difficult to type with these leather straps on and I can't quite reach the buckles with my teeth... > Errrmmm, maybe I've missed the plot or lost the point or whatever, but a > good start would be assuming that only in pathological cases would the > delimiter or the quote be an alphanumeric character i.e. the file has been > produced by an ordinary user, not a red-team tester. I'm willing to make that assumption for this case, but read on... > Try the most frequent two non-alphanumeric characters as the candidates for > the delimiter and the quotechar? If there's only 1 non-alphanumeric > character, then it's the delimiter. If we have a quotechar, then the problem is solved. Unfortunately the situation I expect here is that there will be more than one non-alphanumeric character per line. It's quite common to see dates/timestamps in *every* row of a csv file: data,2003/02/27,08:51:00 data,2003/02/27,08:52:00 data,2003/02/27,08:53:00 data,2003/02/27,08:54:00 In this case it is difficult to know whether ,/ or : is the delimiter. It's not entirely unreasonable to use a "preferred" list of delimiters but it's not entirely safe either ;) In fact, the current implementation will resort to a preferred list in this example and return , as the delimiter. However, given the following: 2003/02/27,08:51:00 data,2003/02/27,08:52:00 08:53:00 data,2003/02/27,08:54:00 It would most likely (without testing) return ":" as the delimiter as it occurs equally consistently with "/", but is higher in the preferred list. This is wrong as the delimiter is clearly ",". That being said, I would simply consider this file as being unsniffable as it has no real pattern. > If there aren't any non-AN chars [an example in one of your messages], then > there's only one field per record. Hm. That might actually be useful. > Where there are two or more candidates for the delimiter and quotechar, you > could use some plausibility heuristics e.g. " and ' are more likely to be > quotes than delimiters however tab, comma, semicolon, colon, vertical bar, > and tilde are plausible delimiters. As I mentioned earlier, quotes are already handled. If quotes are present, I think the current implementation is good enough to handle most files. > Some cautions: > > (1) "Warning -- Europeans here";1,234;5,678 So you see my point =) > (2) Joe Blow~'The Vaults',456 Main > St,Snowtown,SA,5999~31/12/1999~01/04/2000 > # delimiter (tilde) occurs 3 times, no quotechar at all, data characters > comma and slash occur 4 times each (more than delimiter). Yes, I've already decided that frequency by itself isn't a useful measurement. This particular example is invalid though, as ~'The Vaults',456 is an error (IMHO). 'The Vaults' appears quoted but isn't followed by a delimiter or a space. > In any case, it appears to me that you can't pronounce on the result until > you've parsed a large chunk of the file with each plausible hypothesis, > especially if the hypothesis admits (quoted) newlines inside the data. Some > possible decision criteria are (1) percentage of syntax errors (2) standard > deviation of number of columns ... Actually, the existing implementation is able to make a pronouncement after sniffing only a small portion of the file. I'm going to get it into the sandbox today so others can take a look at it. The only real snag is the exact scenario I mentioned earlier (no quoted data with varying numbers of fields per row). BTW, I'm +1 on Skip's suggestion to make the utils a package (cvs.utils) and will check it into CVS as such. Anyone object? -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Thu Feb 27 18:15:57 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 27 Feb 2003 11:15:57 -0600 Subject: [Csv] What's our status? In-Reply-To: <1046365677.27223.119.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046365677.27223.119.camel@software1.logiplex.internal> Message-ID: <15966.18381.859532.112020@montanaro.dyndns.org> Cliff> data,2003/02/27,08:51:00 Cliff> data,2003/02/27,08:52:00 Cliff> data,2003/02/27,08:53:00 Cliff> data,2003/02/27,08:54:00 Cliff> In this case it is difficult to know whether ,/ or : is the Cliff> delimiter. It's not entirely unreasonable to use a "preferred" Cliff> list of delimiters but it's not entirely safe either ;) In fact, Cliff> the current implementation will resort to a preferred list in Cliff> this example and return , as the delimiter. However, given the Cliff> following: Cliff> 2003/02/27,08:51:00 Cliff> data,2003/02/27,08:52:00 Cliff> 08:53:00 Cliff> data,2003/02/27,08:54:00 Cliff> It would most likely (without testing) return ":" as the Cliff> delimiter as it occurs equally consistently with "/", but is Cliff> higher in the preferred list. This is wrong as the delimiter is Cliff> clearly ",". That being said, I would simply consider this file Cliff> as being unsniffable as it has no real pattern. How about this. A candidate delimiter is preferred if two occurrences of it enclose other candidate delimiters. Conversely, a candidate delimiter in which two occurrences only surround alphanumeric characters is deemed "less worthy". Cliff> BTW, I'm +1 on Skip's suggestion to make the utils a package Cliff> (cvs.utils) and will check it into CVS as such. Anyone object? Nope, sorry I didn't get around to checking in the version you posted yesterday. Skip From LogiplexSoftware at earthlink.net Thu Feb 27 18:41:35 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Feb 2003 09:41:35 -0800 Subject: [Csv] What's our status? In-Reply-To: <15966.18381.859532.112020@montanaro.dyndns.org> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046365677.27223.119.camel@software1.logiplex.internal> <15966.18381.859532.112020@montanaro.dyndns.org> Message-ID: <1046367694.27222.124.camel@software1.logiplex.internal> On Thu, 2003-02-27 at 09:15, Skip Montanaro wrote: > How about this. A candidate delimiter is preferred if two occurrences of it > enclose other candidate delimiters. Conversely, a candidate delimiter in which > two occurrences only surround alphanumeric characters is deemed "less > worthy". Sounds like a possibility. But what about: $1,234;Wells,Cliff where ; is the delimiter? > > Cliff> BTW, I'm +1 on Skip's suggestion to make the utils a package > Cliff> (cvs.utils) and will check it into CVS as such. Anyone object? > > Nope, sorry I didn't get around to checking in the version you posted > yesterday. No problem. -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308 From skip at pobox.com Thu Feb 27 19:00:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 27 Feb 2003 12:00:36 -0600 Subject: [Csv] What's our status? In-Reply-To: <1046367694.27222.124.camel@software1.logiplex.internal> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046365677.27223.119.camel@software1.logiplex.internal> <15966.18381.859532.112020@montanaro.dyndns.org> <1046367694.27222.124.camel@software1.logiplex.internal> Message-ID: <15966.21060.408754.315103@montanaro.dyndns.org> Cliff> Sounds like a possibility. But what about: Cliff> $1,234;Wells,Cliff Cliff> where ; is the delimiter? Oh, I'm sure we can always construct perfectly reasonable (that is, not "red team") examples where any of these heuristics fail. That's why it's best to use the sniffers as hints, not the word of God. How about returning a list of candidate delimiters, ordered from most likely to least likely? How about counting the number of cells generated using different candidate delimiters and returning the candidate which creates the most cells or average row lengths with the smallest standard deviation? How about allowing the user to specify a sample cell value which occurs in the data (e.g., sample="benzene" in Andrew's example, which allows you to easily identify SPC as the delimiter)? I've never seen any spreadsheet-like application guess the delimiter without some user input. Importing CSV files in Gnumeric is rather fun. You select the delimiters and watch it split the input on-the-fly. It's cool to see it go from one jumbled column of data to a nicely aligned spreadsheet. Skip From LogiplexSoftware at earthlink.net Fri Feb 28 00:20:17 2003 From: LogiplexSoftware at earthlink.net (Cliff Wells) Date: 27 Feb 2003 15:20:17 -0800 Subject: [Csv] What's our status? In-Reply-To: <15966.21060.408754.315103@montanaro.dyndns.org> References: <15964.55486.242989.782539@montanaro.dyndns.org> <1046280720.27223.9.camel@software1.logiplex.internal> <1046297447.27223.19.camel@software1.logiplex.internal> <1046365677.27223.119.camel@software1.logiplex.internal> <15966.18381.859532.112020@montanaro.dyndns.org> <1046367694.27222.124.camel@software1.logiplex.internal> <15966.21060.408754.315103@montanaro.dyndns.org> Message-ID: <1046388017.29491.248.camel@software1.logiplex.internal> On Thu, 2003-02-27 at 10:00, Skip Montanaro wrote: > Cliff> Sounds like a possibility. But what about: > > Cliff> $1,234;Wells,Cliff > > Cliff> where ; is the delimiter? > > Oh, I'm sure we can always construct perfectly reasonable (that is, not "red > team") examples where any of these heuristics fail. That's why it's best to > use the sniffers as hints, not the word of God. Agreed. But I'd still like to think of some clever way of resolving the above. > How about returning a list of candidate delimiters, ordered from most likely > to least likely? How about counting the number of cells generated using > different candidate delimiters and returning the candidate which creates the > most cells or average row lengths with the smallest standard deviation? How This is basically what it does now. Except for the most cells bit, which I consider too unreliable. As long as the number of cells is supposed to be fairly consistent, it should work. > about allowing the user to specify a sample cell value which occurs in the > data (e.g., sample="benzene" in Andrew's example, which allows you to easily > identify SPC as the delimiter)? Returning a list is a possibility. I considered it when developing DSV but couldn't think of a good use for it since the user was going to confirm the selections anyway via the dialog. > I've never seen any spreadsheet-like application guess the delimiter without > some user input. Importing CSV files in Gnumeric is rather fun. You select > the delimiters and watch it split the input on-the-fly. It's cool to see it > go from one jumbled column of data to a nicely aligned spreadsheet. Hmph. And DSV gets no credit for doing the same? Actually, Excel (and DSV) make a pretty good stab at the delimiter and then let you modify their guesses via a preview dialog. That's pretty much how I always intended the sniffer to be used, so I suppose maybe I shouldn't worry about it too much. Can't seem to help it though ;) BTW, as far as making utils a sub-package of csv, do you intend this: csv.utils (contains all utils in csv/utils.py) or do you mean: csv.utils.sniffer (csv/utils/sniffer.py, etc) I personally prefer the latter as I can see utils encompassing a lot of stuff, perhaps not all of it directly related and a utils.py file would become rather large. However, my packaging skills aren't the greatest, so I'm a bit confused as to what __init__.py should contain so that we aren't required to type "from csv import csv" instead of just "import csv" -- Cliff Wells, Software Engineer Logiplex Corporation (www.logiplex.net) (503) 978-6726 x308 (800) 735-0555 x308