From skip at pobox.com  Mon Mar 12 03:01:08 2007
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 11 Mar 2007 21:01:08 -0500
Subject: [Csv] These csv test cases seem incorrect to me...
Message-ID: <17908.46180.251850.822433@montanaro.dyndns.org>


I decided it would be worthwhile to have a csv module written in Python (no
C underpinnings) for a number of reasons:

    * It will probably be easier to add Unicode support to a Python version

    * More people will be able to read/grok/modify/fix bugs in a Python
      implementation than in the current mixed Python/C implementation.

    * With alternative implementations of Python available (PyPy,
      IronPython, Jython) it makes sense to have a Python version they can
      use.

I'm far from having anything which will pass the current test suite, but in
diagnosing some of my current failures I noticed a couple test cases which
seem wrong.  In the TestDialectExcel class I see these two questionable
tests:

    def test_quotes_and_more(self):
        self.readerAssertEqual('"a"b', [['ab']])

    def test_quote_and_quote(self):
        self.readerAssertEqual('"a" "b"', [['a "b"']])

It seems to me that if a field starts with a quote it *has* to be a quoted
field.  Any quotes appearing within a quoted field have to be escaped and
the field has to end with a quote.  Both of these test cases fail on or the
other assumption.  If they are indeed both correct and I'm just looking at
things crosseyed I think they at least deserve comments explaining why they
are correct.

Both test cases date from the first checkin.  I performed the checkin
because of the group developing the module I believe I was the only one with
checkin privileges at the time, not because I wrote the test cases.

Any ideas about why these test cases are in there?  I can't imagine Excel
generating either one.

Thx,

Skip


From skip at pobox.com  Mon Mar 12 03:56:00 2007
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 11 Mar 2007 21:56:00 -0500
Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me...
In-Reply-To: <20070312024141.0050F1D403E@longblack.object-craft.com.au>
References: <17908.46180.251850.822433@montanaro.dyndns.org>
	<20070312024141.0050F1D403E@longblack.object-craft.com.au>
Message-ID: <17908.49472.531685.405836@montanaro.dyndns.org>


    >> I'm far from having anything which will pass the current test suite,
    >> but in diagnosing some of my current failures I noticed a couple test
    >> cases which seem wrong.  In the TestDialectExcel class I see these
    >> two questionable tests:
    >> 
    >> def test_quotes_and_more(self):
    >>     self.readerAssertEqual('"a"b', [['ab']])
    >> 
    >> def test_quote_and_quote(self):
    >>     self.readerAssertEqual('"a" "b"', [['a "b"']])

    Andrew> The point was to produce the same results as Excel. Sure, Excel
    Andrew> probably doesn't generate crap like this itself, but 3rd parties
    Andrew> do, and people complain if we don't parse it just like Excel
    Andrew> (sigh).

(sigh) indeed.

Thanks,

Skip

From andrewm at object-craft.com.au  Mon Mar 12 03:41:40 2007
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 12 Mar 2007 13:41:40 +1100
Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me...
In-Reply-To: <17908.46180.251850.822433@montanaro.dyndns.org> 
References: <17908.46180.251850.822433@montanaro.dyndns.org>
Message-ID: <20070312024141.0050F1D403E@longblack.object-craft.com.au>

>I decided it would be worthwhile to have a csv module written in Python (no
>C underpinnings) for a number of reasons:

Several other people have already done this. I will forward you their
e-mail address in a separate private e-mail.

>I'm far from having anything which will pass the current test suite, but in
>diagnosing some of my current failures I noticed a couple test cases which
>seem wrong.  In the TestDialectExcel class I see these two questionable
>tests:
>
>    def test_quotes_and_more(self):
>        self.readerAssertEqual('"a"b', [['ab']])
>
>    def test_quote_and_quote(self):
>        self.readerAssertEqual('"a" "b"', [['a "b"']])
[...]
>Any ideas about why these test cases are in there?  I can't imagine Excel
>generating either one.

The point was to produce the same results as Excel. Sure, Excel probably
doesn't generate crap like this itself, but 3rd parties do, and people
complain if we don't parse it just like Excel (sigh).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Mar 12 04:28:21 2007
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 11 Mar 2007 22:28:21 -0500
Subject: [Csv] These csv test cases seem incorrect to me...
In-Reply-To: <45F4C451.5070303@lexicon.net>
References: <17908.46180.251850.822433@montanaro.dyndns.org>
	<45F4C451.5070303@lexicon.net>
Message-ID: <17908.51413.179129.169426@montanaro.dyndns.org>


    John> IMHO these test cases are *WRONG* and it's a worry that they
    John> "work" with the current csv module :-(

That's my take on things as well, though as Andrew pointed out, given those
invalid inputs Excel will produce those wacky outputs.  I verified that on
my Mac a few minutes ago.

I'm inclined to just skip those tests in my Python version, but I can
understand that for backwards compatibility the current module needs to grok
them.

Skip

From sjmachin at lexicon.net  Mon Mar 12 04:09:05 2007
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 12 Mar 2007 14:09:05 +1100
Subject: [Csv] These csv test cases seem incorrect to me...
In-Reply-To: <17908.46180.251850.822433@montanaro.dyndns.org>
References: <17908.46180.251850.822433@montanaro.dyndns.org>
Message-ID: <45F4C451.5070303@lexicon.net>

On 12/03/2007 1:01 PM, skip at pobox.com wrote:
> I decided it would be worthwhile to have a csv module written in Python (no
> C underpinnings) for a number of reasons:
> 
>     * It will probably be easier to add Unicode support to a Python version
> 
>     * More people will be able to read/grok/modify/fix bugs in a Python
>       implementation than in the current mixed Python/C implementation.
> 
>     * With alternative implementations of Python available (PyPy,
>       IronPython, Jython) it makes sense to have a Python version they can
>       use.
> 
> I'm far from having anything which will pass the current test suite, but in
> diagnosing some of my current failures I noticed a couple test cases which
> seem wrong.  In the TestDialectExcel class I see these two questionable
> tests:
> 
>     def test_quotes_and_more(self):
>         self.readerAssertEqual('"a"b', [['ab']])
> 
>     def test_quote_and_quote(self):
>         self.readerAssertEqual('"a" "b"', [['a "b"']])
> 
> It seems to me that if a field starts with a quote it *has* to be a quoted
> field.  Any quotes appearing within a quoted field have to be escaped and
> the field has to end with a quote.  Both of these test cases fail on or the
> other assumption.  If they are indeed both correct and I'm just looking at
> things crosseyed I think they at least deserve comments explaining why they
> are correct.
> 
> Both test cases date from the first checkin.  I performed the checkin
> because of the group developing the module I believe I was the only one with
> checkin privileges at the time, not because I wrote the test cases.
> 
> Any ideas about why these test cases are in there?  I can't imagine Excel
> generating either one.
> 

Hi Skip,

'"a"b' can't be produced by applying minimalist CSV writing rules to 
'ab'. A non-minimalist writer could produce '"ab"', but certainly not 
'"a"b'.

The second case is worse -- it's inconsistent; the reader is supposed to 
remove the quotes from "a" but not from "b"???

IMHO these test cases are *WRONG* and it's a worry that they "work" with 
the current csv module :-(

Regards,

John


From sjmachin at lexicon.net  Mon Mar 12 05:13:25 2007
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 12 Mar 2007 15:13:25 +1100
Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me...
In-Reply-To: <20070312024141.0050F1D403E@longblack.object-craft.com.au>
References: <17908.46180.251850.822433@montanaro.dyndns.org>
	<20070312024141.0050F1D403E@longblack.object-craft.com.au>
Message-ID: <45F4D365.8090609@lexicon.net>

On 12/03/2007 1:41 PM, Andrew McNamara wrote:
> 
> The point was to produce the same results as Excel. Sure, Excel probably
> doesn't generate crap like this itself, but 3rd parties do, and people
> complain if we don't parse it just like Excel (sigh).

Let's put a little flesh on those a's and b's:

A typical example of the first case is where a database address line 
contains a quoted house name e.g.

"Dunromin", 123 Main Street

and the producer of the CSV file has not done any quoting at all.

An example of the 2nd case is a database address line like this:

C/o Mrs Jones, "Dunromin", 123 Main Street

and the producer of the CSV file has merely wrapped quotes about it 
without doubling the existing quotes, to emit this:

"C/o Mrs Jones, "Dunromin", 123 Main Street"

which Excel and adherents would distort to two fields containing:
'C/o Mrs Jones, Dunromin"' and ' 123 Main Street"' -- aarrgghh!!

People who complain as described are IMHO misguided; they are accepting 
crap and losing data (yes, the quotes in the above examples are *DATA*). 
Why should we heed their complaints?

Perhaps we could consider a non-default "dopey_like_Excel" option for 
csv :-)

BTW, it is possible to do a reasonable recovery job when the producer's 
protocol was to wrap quotes around the data without doubling existing 
quotes, providing there were an even number of quotes to start with. It 
just requires a quite different finite state machine.

Cheers,
John


From andrewm at object-craft.com.au  Mon Mar 12 05:46:04 2007
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 12 Mar 2007 15:46:04 +1100
Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me...
In-Reply-To: <45F4C451.5070303@lexicon.net> 
References: <17908.46180.251850.822433@montanaro.dyndns.org>
	<45F4C451.5070303@lexicon.net>
Message-ID: <20070312044605.27C481D403E@longblack.object-craft.com.au>

>IMHO these test cases are *WRONG* and it's a worry that they "work" with 
>the current csv module :-(

Those tests are not "wrong" - they verify that we produce the same result
as Excel when presented with those inputs, which was one of the design
goals of the module (and is an important consideration for many of it's
users).

While you might find the Excel team's choices bizare, they are stable,
and in the absence of a formal specification for "CSV", Excel's behaviour
is what most users want and expect.

If you feel like extending the parser to optionally accept some other
format, I have no problem. If you want to make this format the default,
make sure you stick around to answer all the angry e-mail from users.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From mal at egenix.com  Wed Mar 14 12:11:50 2007
From: mal at egenix.com (M.-A. Lemburg)
Date: Wed, 14 Mar 2007 12:11:50 +0100
Subject: [Csv] [Python-Dev] These csv test cases seem incorrect to me...
In-Reply-To: <17908.46180.251850.822433@montanaro.dyndns.org>
References: <17908.46180.251850.822433@montanaro.dyndns.org>
Message-ID: <45F7D876.3010003@egenix.com>

Hi Skip,

On 2007-03-12 03:01, skip at pobox.com wrote:
> I decided it would be worthwhile to have a csv module written in Python (no
> C underpinnings) for a number of reasons:
> 
>     * It will probably be easier to add Unicode support to a Python version
> 
>     * More people will be able to read/grok/modify/fix bugs in a Python
>       implementation than in the current mixed Python/C implementation.
> 
>     * With alternative implementations of Python available (PyPy,
>       IronPython, Jython) it makes sense to have a Python version they can
>       use.

Lots of good reasons :-)

I've written a Python-only Unicode aware CSV module for a client (mostly
because CSV data tends to be quirky and I needed a quick way of dealing
with corner cases). Perhaps I can get them to donate it to the PSF...

> I'm far from having anything which will pass the current test suite, but in
> diagnosing some of my current failures I noticed a couple test cases which
> seem wrong.  In the TestDialectExcel class I see these two questionable
> tests:
> 
>     def test_quotes_and_more(self):
>         self.readerAssertEqual('"a"b', [['ab']])
> 
>     def test_quote_and_quote(self):
>         self.readerAssertEqual('"a" "b"', [['a "b"']])
> 
> It seems to me that if a field starts with a quote it *has* to be a quoted
> field.  Any quotes appearing within a quoted field have to be escaped and
> the field has to end with a quote.  Both of these test cases fail on or the
> other assumption.  If they are indeed both correct and I'm just looking at
> things crosseyed I think they at least deserve comments explaining why they
> are correct.
> 
> Both test cases date from the first checkin.  I performed the checkin
> because of the group developing the module I believe I was the only one with
> checkin privileges at the time, not because I wrote the test cases.
> 
> Any ideas about why these test cases are in there?  I can't imagine Excel
> generating either one.

My recommendation: Let the module do whatever Excel does with such data.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 14 2007)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::