From gnb at itga.com.au  Mon Jun  7 06:34:05 2004
From: gnb at itga.com.au (Gregory Bond)
Date: Mon, 07 Jun 2004 14:34:05 +1000
Subject: [Csv] PEP 305
Message-ID: <200406070434.OAA25102@lightning.itga.com.au>

I've a problem that I can't make the new CSV module fix - embedded \r's in 
fields.  I'm parsing a format that allows \r and \n to be part of a field, if 
the field is quoted with "".  Looking at Modules/_csv.c, this is probably 
impossible....

(Python 2.3.1)

Take the following:

meldev$ cat tcsv.py

import csv

d = 'fld1,fld2,"fld3 ",fld4\r\n'
d2 = 'fld1,fld2,"fld3 \r",fld4\r\n'

r = csv.reader([d, d2])
for f in r:
        print f

meldev$ python tcsv.py 
['fld1', 'fld2', 'fld3 ', 'fld4']
Traceback (most recent call last):
  File "tcsv.py", line 9, in ?
    for f in r:
_csv.Error: newline inside string


From andrewm at object-craft.com.au  Mon Jun  7 06:47:58 2004
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 07 Jun 2004 14:47:58 +1000
Subject: [Csv] PEP 305 
In-Reply-To: Message from Gregory Bond <gnb@itga.com.au> 
	<200406070434.OAA25102@lightning.itga.com.au> 
References: <200406070434.OAA25102@lightning.itga.com.au> 
Message-ID: <20040607044758.87F173C1CF@coffee.object-craft.com.au>

>I've a problem that I can't make the new CSV module fix - embedded \r's in 
>fields.  I'm parsing a format that allows \r and \n to be part of a field, if 
>the field is quoted with "".  Looking at Modules/_csv.c, this is probably 
>impossible....

If I remember correctly, you are correct - the current parser won't allow
you to do this.

One thing that became apparent very early on in the life of the
csv parser is that there is no end to variety of formats that call
themselves CSV!  We settled for something as close as we could make it
to Excel's behaviour, with the odd concession to Access, and any other
formats that were "easy", but that still leaves plenty of out in the cold.

Now that it's part of the Python core, it's a royal pain in the arse to
change anything, although your change is probably harmless, and we have
plenty of test cases. 

Dave - any idea why we disallowed CR within a quoted field?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From gnb at itga.com.au  Mon Jun  7 06:50:27 2004
From: gnb at itga.com.au (Gregory Bond)
Date: Mon, 07 Jun 2004 14:50:27 +1000
Subject: [Csv] PEP 305 
In-Reply-To: Your message of Mon, 07 Jun 2004 14:47:58 +1000.
Message-ID: <200406070450.OAA25860@lightning.itga.com.au>

BTW:  I posted this as sourceforge bug # 967934 


From djc at object-craft.com.au  Mon Jun  7 07:10:34 2004
From: djc at object-craft.com.au (Dave Cole)
Date: Mon, 07 Jun 2004 15:10:34 +1000
Subject: [Csv] PEP 305
In-Reply-To: <20040607044758.87F173C1CF@coffee.object-craft.com.au>
References: <200406070434.OAA25102@lightning.itga.com.au>
	<20040607044758.87F173C1CF@coffee.object-craft.com.au>
Message-ID: <40C3F8CA.1050102@object-craft.com.au>

Andrew McNamara wrote:
>>I've a problem that I can't make the new CSV module fix - embedded \r's in 
>>fields.  I'm parsing a format that allows \r and \n to be part of a field, if 
>>the field is quoted with "".  Looking at Modules/_csv.c, this is probably 
>>impossible....
> 
> 
> If I remember correctly, you are correct - the current parser won't allow
> you to do this.
> 
> One thing that became apparent very early on in the life of the
> csv parser is that there is no end to variety of formats that call
> themselves CSV!  We settled for something as close as we could make it
> to Excel's behaviour, with the odd concession to Access, and any other
> formats that were "easy", but that still leaves plenty of out in the cold.
> 
> Now that it's part of the Python core, it's a royal pain in the arse to
> change anything, although your change is probably harmless, and we have
> plenty of test cases. 
> 
> Dave - any idea why we disallowed CR within a quoted field?

Because I assumed that the only end-of-line related characters were 
actually ends of line.  I then assumed that you would feed the parser 
one line at a time.  I suppose the weak part of this "logic" is when you 
have data with different styles of end-of-line characters.

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Wed Jun 16 04:14:11 2004
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 15 Jun 2004 21:14:11 -0500
Subject: [Csv] Switch to universal mode?
Message-ID: <16591.44275.393347.582050@montanaro.dyndns.org>


I've been thinking we should enforce universal mode in the csv module.  I
think it could simplify the reader a bit (all EOLs become '\n', right?).
Unfortunately, universal mode is a read-only thing (PEP 278 disallows 'wU'
though the file object doesn't currently enforce that).  Users would still
have to open files for writing in binary mode.

Accordingly, I think we should provide a little help for users in the form
of mode checking and exception raising where possible.  I don't know what
might be possible for file-like objects (e.g., StringIO) that don't have
modes.  Does the attached context diff look reasonable?  All it does is
enforce the relevant modes.  It doesn't attempt to take advantage of the
'rU' assumption to simplify any code.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: _csv.c.diff
Type: application/octet-stream
Size: 2426 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20040615/c568dd61/attachment.obj 

From andrewm at object-craft.com.au  Wed Jun 16 04:25:23 2004
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 16 Jun 2004 12:25:23 +1000
Subject: [Csv] Switch to universal mode? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16591.44275.393347.582050@montanaro.dyndns.org> 
References: <16591.44275.393347.582050@montanaro.dyndns.org> 
Message-ID: <20040616022523.0DD873C02E@coffee.object-craft.com.au>

>I've been thinking we should enforce universal mode in the csv module.  I
>think it could simplify the reader a bit (all EOLs become '\n', right?).
>Unfortunately, universal mode is a read-only thing (PEP 278 disallows 'wU'
>though the file object doesn't currently enforce that).  Users would still
>have to open files for writing in binary mode.
>
>Accordingly, I think we should provide a little help for users in the form
>of mode checking and exception raising where possible.  I don't know what
>might be possible for file-like objects (e.g., StringIO) that don't have
>modes.  Does the attached context diff look reasonable?  All it does is
>enforce the relevant modes.  It doesn't attempt to take advantage of the
>'rU' assumption to simplify any code.

I'm not convinced this is necessary or desirable - what will the universal
newline code do to a CR or LF embedded in a quoted field (it's important
to preserve these verbatim)? The resulting simplifications to the parser
are relatively minor, I think.

Certainly the parser needs some tweaking in this area - I just haven't
had time to get back into it. There were also a bunch of issues raised
some time back regarding GC that we should review.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Jun 16 17:51:56 2004
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 16 Jun 2004 10:51:56 -0500
Subject: [Csv] Switch to universal mode? 
In-Reply-To: <20040616022523.0DD873C02E@coffee.object-craft.com.au>
References: <16591.44275.393347.582050@montanaro.dyndns.org>
        <20040616022523.0DD873C02E@coffee.object-craft.com.au>
Message-ID: <16592.27804.745439.587736@montanaro.dyndns.org>


    Andrew> I'm not convinced this is necessary or desirable - what will the
    Andrew> universal newline code do to a CR or LF embedded in a quoted
    Andrew> field (it's important to preserve these verbatim)? The resulting
    Andrew> simplifications to the parser are relatively minor, I think.

You're right.  Universal newline mode would hose those characters in
different ways on different platforms.  That makes binary mode required.

I still think we should enforce what we need in our code instead of relying
on users to get it right.  Most of the problems I've seen people have go
away when they open the files properly.  Opening files with just "r" or "w"
works properly most of the time, but on occasion doesn't (when the file
winds up containing embedded CR or LF characters).

Skip

From andrewm at object-craft.com.au  Thu Jun 17 03:19:59 2004
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 17 Jun 2004 11:19:59 +1000
Subject: [Csv] Switch to universal mode? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16592.27804.745439.587736@montanaro.dyndns.org> 
References: <16591.44275.393347.582050@montanaro.dyndns.org>
	<20040616022523.0DD873C02E@coffee.object-craft.com.au>
	<16592.27804.745439.587736@montanaro.dyndns.org> 
Message-ID: <20040617011959.CD1213C02E@coffee.object-craft.com.au>

>You're right.  Universal newline mode would hose those characters in
>different ways on different platforms.  That makes binary mode required.
>
>I still think we should enforce what we need in our code instead of relying
>on users to get it right.  Most of the problems I've seen people have go
>away when they open the files properly.  Opening files with just "r" or "w"
>works properly most of the time, but on occasion doesn't (when the file
>winds up containing embedded CR or LF characters).

I would argue that if you data has odd newline conventions and you care,
then you know about binary mode - otherwise you get what you paid for... 8-)

Yes, the newline handling in the csv module is "lumpy" - but that's
because it's a difficult problem (a non-existent spec, and almost infinite
variety in implementations): there is never going to be a single right
answer.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/