From gnb at itga.com.au Mon Jun 7 06:34:05 2004 From: gnb at itga.com.au (Gregory Bond) Date: Mon, 07 Jun 2004 14:34:05 +1000 Subject: [Csv] PEP 305 Message-ID: <200406070434.OAA25102@lightning.itga.com.au> I've a problem that I can't make the new CSV module fix - embedded \r's in fields. I'm parsing a format that allows \r and \n to be part of a field, if the field is quoted with "". Looking at Modules/_csv.c, this is probably impossible.... (Python 2.3.1) Take the following: meldev$ cat tcsv.py import csv d = 'fld1,fld2,"fld3 ",fld4\r\n' d2 = 'fld1,fld2,"fld3 \r",fld4\r\n' r = csv.reader([d, d2]) for f in r: print f meldev$ python tcsv.py ['fld1', 'fld2', 'fld3 ', 'fld4'] Traceback (most recent call last): File "tcsv.py", line 9, in ? for f in r: _csv.Error: newline inside string From andrewm at object-craft.com.au Mon Jun 7 06:47:58 2004 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Mon, 07 Jun 2004 14:47:58 +1000 Subject: [Csv] PEP 305 In-Reply-To: Message from Gregory Bond <200406070434.OAA25102@lightning.itga.com.au> References: <200406070434.OAA25102@lightning.itga.com.au> Message-ID: <20040607044758.87F173C1CF@coffee.object-craft.com.au> >I've a problem that I can't make the new CSV module fix - embedded \r's in >fields. I'm parsing a format that allows \r and \n to be part of a field, if >the field is quoted with "". Looking at Modules/_csv.c, this is probably >impossible.... If I remember correctly, you are correct - the current parser won't allow you to do this. One thing that became apparent very early on in the life of the csv parser is that there is no end to variety of formats that call themselves CSV! We settled for something as close as we could make it to Excel's behaviour, with the odd concession to Access, and any other formats that were "easy", but that still leaves plenty of out in the cold. Now that it's part of the Python core, it's a royal pain in the arse to change anything, although your change is probably harmless, and we have plenty of test cases. Dave - any idea why we disallowed CR within a quoted field? -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From gnb at itga.com.au Mon Jun 7 06:50:27 2004 From: gnb at itga.com.au (Gregory Bond) Date: Mon, 07 Jun 2004 14:50:27 +1000 Subject: [Csv] PEP 305 In-Reply-To: Your message of Mon, 07 Jun 2004 14:47:58 +1000. Message-ID: <200406070450.OAA25860@lightning.itga.com.au> BTW: I posted this as sourceforge bug # 967934 From djc at object-craft.com.au Mon Jun 7 07:10:34 2004 From: djc at object-craft.com.au (Dave Cole) Date: Mon, 07 Jun 2004 15:10:34 +1000 Subject: [Csv] PEP 305 In-Reply-To: <20040607044758.87F173C1CF@coffee.object-craft.com.au> References: <200406070434.OAA25102@lightning.itga.com.au> <20040607044758.87F173C1CF@coffee.object-craft.com.au> Message-ID: <40C3F8CA.1050102@object-craft.com.au> Andrew McNamara wrote: >>I've a problem that I can't make the new CSV module fix - embedded \r's in >>fields. I'm parsing a format that allows \r and \n to be part of a field, if >>the field is quoted with "". Looking at Modules/_csv.c, this is probably >>impossible.... > > > If I remember correctly, you are correct - the current parser won't allow > you to do this. > > One thing that became apparent very early on in the life of the > csv parser is that there is no end to variety of formats that call > themselves CSV! We settled for something as close as we could make it > to Excel's behaviour, with the odd concession to Access, and any other > formats that were "easy", but that still leaves plenty of out in the cold. > > Now that it's part of the Python core, it's a royal pain in the arse to > change anything, although your change is probably harmless, and we have > plenty of test cases. > > Dave - any idea why we disallowed CR within a quoted field? Because I assumed that the only end-of-line related characters were actually ends of line. I then assumed that you would feed the parser one line at a time. I suppose the weak part of this "logic" is when you have data with different styles of end-of-line characters. - Dave -- http://www.object-craft.com.au From skip at pobox.com Wed Jun 16 04:14:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue, 15 Jun 2004 21:14:11 -0500 Subject: [Csv] Switch to universal mode? Message-ID: <16591.44275.393347.582050@montanaro.dyndns.org> I've been thinking we should enforce universal mode in the csv module. I think it could simplify the reader a bit (all EOLs become '\n', right?). Unfortunately, universal mode is a read-only thing (PEP 278 disallows 'wU' though the file object doesn't currently enforce that). Users would still have to open files for writing in binary mode. Accordingly, I think we should provide a little help for users in the form of mode checking and exception raising where possible. I don't know what might be possible for file-like objects (e.g., StringIO) that don't have modes. Does the attached context diff look reasonable? All it does is enforce the relevant modes. It doesn't attempt to take advantage of the 'rU' assumption to simplify any code. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: _csv.c.diff Type: application/octet-stream Size: 2426 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20040615/c568dd61/attachment.obj From andrewm at object-craft.com.au Wed Jun 16 04:25:23 2004 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Wed, 16 Jun 2004 12:25:23 +1000 Subject: [Csv] Switch to universal mode? In-Reply-To: Message from Skip Montanaro <16591.44275.393347.582050@montanaro.dyndns.org> References: <16591.44275.393347.582050@montanaro.dyndns.org> Message-ID: <20040616022523.0DD873C02E@coffee.object-craft.com.au> >I've been thinking we should enforce universal mode in the csv module. I >think it could simplify the reader a bit (all EOLs become '\n', right?). >Unfortunately, universal mode is a read-only thing (PEP 278 disallows 'wU' >though the file object doesn't currently enforce that). Users would still >have to open files for writing in binary mode. > >Accordingly, I think we should provide a little help for users in the form >of mode checking and exception raising where possible. I don't know what >might be possible for file-like objects (e.g., StringIO) that don't have >modes. Does the attached context diff look reasonable? All it does is >enforce the relevant modes. It doesn't attempt to take advantage of the >'rU' assumption to simplify any code. I'm not convinced this is necessary or desirable - what will the universal newline code do to a CR or LF embedded in a quoted field (it's important to preserve these verbatim)? The resulting simplifications to the parser are relatively minor, I think. Certainly the parser needs some tweaking in this area - I just haven't had time to get back into it. There were also a bunch of issues raised some time back regarding GC that we should review. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ From skip at pobox.com Wed Jun 16 17:51:56 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed, 16 Jun 2004 10:51:56 -0500 Subject: [Csv] Switch to universal mode? In-Reply-To: <20040616022523.0DD873C02E@coffee.object-craft.com.au> References: <16591.44275.393347.582050@montanaro.dyndns.org> <20040616022523.0DD873C02E@coffee.object-craft.com.au> Message-ID: <16592.27804.745439.587736@montanaro.dyndns.org> Andrew> I'm not convinced this is necessary or desirable - what will the Andrew> universal newline code do to a CR or LF embedded in a quoted Andrew> field (it's important to preserve these verbatim)? The resulting Andrew> simplifications to the parser are relatively minor, I think. You're right. Universal newline mode would hose those characters in different ways on different platforms. That makes binary mode required. I still think we should enforce what we need in our code instead of relying on users to get it right. Most of the problems I've seen people have go away when they open the files properly. Opening files with just "r" or "w" works properly most of the time, but on occasion doesn't (when the file winds up containing embedded CR or LF characters). Skip From andrewm at object-craft.com.au Thu Jun 17 03:19:59 2004 From: andrewm at object-craft.com.au (Andrew McNamara) Date: Thu, 17 Jun 2004 11:19:59 +1000 Subject: [Csv] Switch to universal mode? In-Reply-To: Message from Skip Montanaro <16592.27804.745439.587736@montanaro.dyndns.org> References: <16591.44275.393347.582050@montanaro.dyndns.org> <20040616022523.0DD873C02E@coffee.object-craft.com.au> <16592.27804.745439.587736@montanaro.dyndns.org> Message-ID: <20040617011959.CD1213C02E@coffee.object-craft.com.au> >You're right. Universal newline mode would hose those characters in >different ways on different platforms. That makes binary mode required. > >I still think we should enforce what we need in our code instead of relying >on users to get it right. Most of the problems I've seen people have go >away when they open the files properly. Opening files with just "r" or "w" >works properly most of the time, but on occasion doesn't (when the file >winds up containing embedded CR or LF characters). I would argue that if you data has odd newline conventions and you care, then you know about binary mode - otherwise you get what you paid for... 8-) Yes, the newline handling in the csv module is "lumpy" - but that's because it's a difficult problem (a non-existent spec, and almost infinite variety in implementations): there is never going to be a single right answer. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/