From skip at pobox.com Mon Mar 12 03:01:08 2007 From: skip at pobox.com (skip at pobox.com) Date: Sun, 11 Mar 2007 21:01:08 -0500 Subject: [Csv] These csv test cases seem incorrect to me... Message-ID: <17908.46180.251850.822433@montanaro.dyndns.org> I decided it would be worthwhile to have a csv module written in Python (no C underpinnings) for a number of reasons: * It will probably be easier to add Unicode support to a Python version * More people will be able to read/grok/modify/fix bugs in a Python implementation than in the current mixed Python/C implementation. * With alternative implementations of Python available (PyPy, IronPython, Jython) it makes sense to have a Python version they can use. I'm far from having anything which will pass the current test suite, but in diagnosing some of my current failures I noticed a couple test cases which seem wrong. In the TestDialectExcel class I see these two questionable tests: def test_quotes_and_more(self): self.readerAssertEqual('"a"b', [['ab']]) def test_quote_and_quote(self): self.readerAssertEqual('"a" "b"', [['a "b"']]) It seems to me that if a field starts with a quote it *has* to be a quoted field. Any quotes appearing within a quoted field have to be escaped and the field has to end with a quote. Both of these test cases fail on or the other assumption. If they are indeed both correct and I'm just looking at things crosseyed I think they at least deserve comments explaining why they are correct. Both test cases date from the first checkin. I performed the checkin because of the group developing the module I believe I was the only one with checkin privileges at the time, not because I wrote the test cases. Any ideas about why these test cases are in there? I can't imagine Excel generating either one. Thx, Skip From skip at pobox.com Mon Mar 12 03:56:00 2007 From: skip at pobox.com (skip at pobox.com) Date: Sun, 11 Mar 2007 21:56:00 -0500 Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me... In-Reply-To: <20070312024141.0050F1D403E@longblack.object-craft.com.au> References: <17908.46180.251850.822433@montanaro.dyndns.org> <20070312024141.0050F1D403E@longblack.object-craft.com.au> Message-ID: <17908.49472.531685.405836@montanaro.dyndns.org> >> I'm far from having anything which will pass the current test suite, >> but in diagnosing some of my current failures I noticed a couple test >> cases which seem wrong. In the TestDialectExcel class I see these >> two questionable tests: >> >> def test_quotes_and_more(self): >> self.readerAssertEqual('"a"b', [['ab']]) >> >> def test_quote_and_quote(self): >> self.readerAssertEqual('"a" "b"', [['a "b"']]) Andrew> The point was to produce the same results as Excel. Sure, Excel Andrew> probably doesn't generate crap like this itself, but 3rd parties Andrew> do, and people complain if we don't parse it just like Excel Andrew> (sigh). (sigh) indeed. Thanks, Skip From andrewm at object-craft.com.au Mon Mar 12 03:41:40 2007 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 12 Mar 2007 13:41:40 +1100 Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me... In-Reply-To: <17908.46180.251850.822433@montanaro.dyndns.org> References: <17908.46180.251850.822433@montanaro.dyndns.org> Message-ID: <20070312024141.0050F1D403E@longblack.object-craft.com.au> >I decided it would be worthwhile to have a csv module written in Python (no >C underpinnings) for a number of reasons: Several other people have already done this. I will forward you their e-mail address in a separate private e-mail. >I'm far from having anything which will pass the current test suite, but in >diagnosing some of my current failures I noticed a couple test cases which >seem wrong. In the TestDialectExcel class I see these two questionable >tests: > > def test_quotes_and_more(self): > self.readerAssertEqual('"a"b', [['ab']]) > > def test_quote_and_quote(self): > self.readerAssertEqual('"a" "b"', [['a "b"']]) [...] >Any ideas about why these test cases are in there? I can't imagine Excel >generating either one. The point was to produce the same results as Excel. Sure, Excel probably doesn't generate crap like this itself, but 3rd parties do, and people complain if we don't parse it just like Excel (sigh). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Mon Mar 12 04:28:21 2007 From: skip at pobox.com (skip at pobox.com) Date: Sun, 11 Mar 2007 22:28:21 -0500 Subject: [Csv] These csv test cases seem incorrect to me... In-Reply-To: <45F4C451.5070303@lexicon.net> References: <17908.46180.251850.822433@montanaro.dyndns.org> <45F4C451.5070303@lexicon.net> Message-ID: <17908.51413.179129.169426@montanaro.dyndns.org> John> IMHO these test cases are *WRONG* and it's a worry that they John> "work" with the current csv module :-( That's my take on things as well, though as Andrew pointed out, given those invalid inputs Excel will produce those wacky outputs. I verified that on my Mac a few minutes ago. I'm inclined to just skip those tests in my Python version, but I can understand that for backwards compatibility the current module needs to grok them. Skip From sjmachin at lexicon.net Mon Mar 12 04:09:05 2007 From: sjmachin at lexicon.net (John Machin) Date: Mon, 12 Mar 2007 14:09:05 +1100 Subject: [Csv] These csv test cases seem incorrect to me... In-Reply-To: <17908.46180.251850.822433@montanaro.dyndns.org> References: <17908.46180.251850.822433@montanaro.dyndns.org> Message-ID: <45F4C451.5070303@lexicon.net> On 12/03/2007 1:01 PM, skip at pobox.com wrote: > I decided it would be worthwhile to have a csv module written in Python (no > C underpinnings) for a number of reasons: > > * It will probably be easier to add Unicode support to a Python version > > * More people will be able to read/grok/modify/fix bugs in a Python > implementation than in the current mixed Python/C implementation. > > * With alternative implementations of Python available (PyPy, > IronPython, Jython) it makes sense to have a Python version they can > use. > > I'm far from having anything which will pass the current test suite, but in > diagnosing some of my current failures I noticed a couple test cases which > seem wrong. In the TestDialectExcel class I see these two questionable > tests: > > def test_quotes_and_more(self): > self.readerAssertEqual('"a"b', [['ab']]) > > def test_quote_and_quote(self): > self.readerAssertEqual('"a" "b"', [['a "b"']]) > > It seems to me that if a field starts with a quote it *has* to be a quoted > field. Any quotes appearing within a quoted field have to be escaped and > the field has to end with a quote. Both of these test cases fail on or the > other assumption. If they are indeed both correct and I'm just looking at > things crosseyed I think they at least deserve comments explaining why they > are correct. > > Both test cases date from the first checkin. I performed the checkin > because of the group developing the module I believe I was the only one with > checkin privileges at the time, not because I wrote the test cases. > > Any ideas about why these test cases are in there? I can't imagine Excel > generating either one. > Hi Skip, '"a"b' can't be produced by applying minimalist CSV writing rules to 'ab'. A non-minimalist writer could produce '"ab"', but certainly not '"a"b'. The second case is worse -- it's inconsistent; the reader is supposed to remove the quotes from "a" but not from "b"??? IMHO these test cases are *WRONG* and it's a worry that they "work" with the current csv module :-( Regards, John From sjmachin at lexicon.net Mon Mar 12 05:13:25 2007 From: sjmachin at lexicon.net (John Machin) Date: Mon, 12 Mar 2007 15:13:25 +1100 Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me... In-Reply-To: <20070312024141.0050F1D403E@longblack.object-craft.com.au> References: <17908.46180.251850.822433@montanaro.dyndns.org> <20070312024141.0050F1D403E@longblack.object-craft.com.au> Message-ID: <45F4D365.8090609@lexicon.net> On 12/03/2007 1:41 PM, Andrew McNamara wrote: > > The point was to produce the same results as Excel. Sure, Excel probably > doesn't generate crap like this itself, but 3rd parties do, and people > complain if we don't parse it just like Excel (sigh). Let's put a little flesh on those a's and b's: A typical example of the first case is where a database address line contains a quoted house name e.g. "Dunromin", 123 Main Street and the producer of the CSV file has not done any quoting at all. An example of the 2nd case is a database address line like this: C/o Mrs Jones, "Dunromin", 123 Main Street and the producer of the CSV file has merely wrapped quotes about it without doubling the existing quotes, to emit this: "C/o Mrs Jones, "Dunromin", 123 Main Street" which Excel and adherents would distort to two fields containing: 'C/o Mrs Jones, Dunromin"' and ' 123 Main Street"' -- aarrgghh!! People who complain as described are IMHO misguided; they are accepting crap and losing data (yes, the quotes in the above examples are *DATA*). Why should we heed their complaints? Perhaps we could consider a non-default "dopey_like_Excel" option for csv :-) BTW, it is possible to do a reasonable recovery job when the producer's protocol was to wrap quotes around the data without doubling existing quotes, providing there were an even number of quotes to start with. It just requires a quite different finite state machine. Cheers, John From andrewm at object-craft.com.au Mon Mar 12 05:46:04 2007 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 12 Mar 2007 15:46:04 +1100 Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me... In-Reply-To: <45F4C451.5070303@lexicon.net> References: <17908.46180.251850.822433@montanaro.dyndns.org> <45F4C451.5070303@lexicon.net> Message-ID: <20070312044605.27C481D403E@longblack.object-craft.com.au> >IMHO these test cases are *WRONG* and it's a worry that they "work" with >the current csv module :-( Those tests are not "wrong" - they verify that we produce the same result as Excel when presented with those inputs, which was one of the design goals of the module (and is an important consideration for many of it's users). While you might find the Excel team's choices bizare, they are stable, and in the absence of a formal specification for "CSV", Excel's behaviour is what most users want and expect. If you feel like extending the parser to optionally accept some other format, I have no problem. If you want to make this format the default, make sure you stick around to answer all the angry e-mail from users. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From mal at egenix.com Wed Mar 14 12:11:50 2007 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 14 Mar 2007 12:11:50 +0100 Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me... In-Reply-To: <17908.46180.251850.822433@montanaro.dyndns.org> References: <17908.46180.251850.822433@montanaro.dyndns.org> Message-ID: <45F7D876.3010003@egenix.com> Hi Skip, On 2007-03-12 03:01, skip at pobox.com wrote: > I decided it would be worthwhile to have a csv module written in Python (no > C underpinnings) for a number of reasons: > > * It will probably be easier to add Unicode support to a Python version > > * More people will be able to read/grok/modify/fix bugs in a Python > implementation than in the current mixed Python/C implementation. > > * With alternative implementations of Python available (PyPy, > IronPython, Jython) it makes sense to have a Python version they can > use. Lots of good reasons :-) I've written a Python-only Unicode aware CSV module for a client (mostly because CSV data tends to be quirky and I needed a quick way of dealing with corner cases). Perhaps I can get them to donate it to the PSF... > I'm far from having anything which will pass the current test suite, but in > diagnosing some of my current failures I noticed a couple test cases which > seem wrong. In the TestDialectExcel class I see these two questionable > tests: > > def test_quotes_and_more(self): > self.readerAssertEqual('"a"b', [['ab']]) > > def test_quote_and_quote(self): > self.readerAssertEqual('"a" "b"', [['a "b"']]) > > It seems to me that if a field starts with a quote it *has* to be a quoted > field. Any quotes appearing within a quoted field have to be escaped and > the field has to end with a quote. Both of these test cases fail on or the > other assumption. If they are indeed both correct and I'm just looking at > things crosseyed I think they at least deserve comments explaining why they > are correct. > > Both test cases date from the first checkin. I performed the checkin > because of the group developing the module I believe I was the only one with > checkin privileges at the time, not because I wrote the test cases. > > Any ideas about why these test cases are in there? I can't imagine Excel > generating either one. My recommendation: Let the module do whatever Excel does with such data. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 14 2007) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::