From robonato at tiscali.it Wed Oct 1 12:47:32 2003 From: robonato at tiscali.it (Roberto Bonato) Date: Wed, 01 Oct 2003 12:47:32 +0200 Subject: [Csv] PEP 305 Message-ID: <3F7AB0C4.4090402@tiscali.it> Hi all I'm kind of disappointed by the csv module for Python 2.3. The following line comes from a .csv files generated by Stockscreener Deluxe (moneycentral.msn.com) "INTC.""Intel Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37""" the following class: class deluxe_screener(excel): delimiter = '.' quotechar = '"' doublequote = True cannot produce as an output anything better than ['INTC', 'Intel Corporation""', '1""', '2,07""', '0,22""', '13,00""', '53', '669', '700""', '28,37""'] I'm disappointed by how the double quotes are dealt with, but above all by the fact that ""53.669.700"" is split into three separated tokens. Am I doing something wrong? Any help is appreciated, thanks Roberto From skip at pobox.com Wed Oct 1 18:29:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed, 1 Oct 2003 11:29:11 -0500 Subject: [Csv] PEP 305 In-Reply-To: <3F7AB0C4.4090402@tiscali.it> References: <3F7AB0C4.4090402@tiscali.it> Message-ID: <16251.215.52504.461511@montanaro.dyndns.org> Roberto> I'm kind of disappointed by the csv module for Python 2.3. The Roberto> following line comes from a .csv files generated by Roberto> Stockscreener Deluxe (moneycentral.msn.com) Roberto> "INTC.""Intel Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37""" Roberto> the following class: Roberto> class deluxe_screener(excel): Roberto> delimiter = '.' Roberto> quotechar = '"' Roberto> doublequote = True Roberto> cannot produce as an output anything better than Roberto> ['INTC', 'Intel Corporation""', '1""', '2,07""', '0,22""', '13,00""', Roberto> '53', '669', '700""', '28,37""'] Roberto> I'm disappointed by how the double quotes are dealt with, but Roberto> above all by the fact that ""53.669.700"" is split into three Roberto> separated tokens. Roberto> Am I doing something wrong? Roberto, I'm not sure you're doing anything wrong. The CSV file looks invalid to me, even considering that you are using a European locale. Can you send me (skip at pobox.com) a CSV file as an attachment so we can be sure it's not mangled during transmission? Here's why I think it's invalid. If the quotechar is '"', that means any time you have a space or the delimiter in a field, the field must be quoted. Furthermore, if the field contains a literal quotechar, it must be doubled. Accordingly, as you transmitted that row in your message, I see only a single field. The first field is opened by the '"' character. All the other '"' characters except the last are doubled, meaning they are part of the field. The line is closed with a tripled '"', indicating an embedded quotation mark followed by a '"' to end the field. Using the attached CSV file (which I think is correct and uses your screener object, I get ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37'] which looks fine to me. -- Skip Montanaro Got gigs? http://www.musi-cal.com/ http://www.mojam.com/ Got spam? http://spambayes.sf.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: intc.csv Type: application/octet-stream Size: 59 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20031001/01248d04/attachment.obj From skip at pobox.com Thu Oct 2 15:36:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 2 Oct 2003 08:36:01 -0500 Subject: [Csv] PEP 305 In-Reply-To: <3F7BD250.4030609@tiscali.it> References: <3F7AB0C4.4090402@tiscali.it> <16251.215.52504.461511@montanaro.dyndns.org> <3F7BD250.4030609@tiscali.it> Message-ID: <16252.10689.345360.482898@montanaro.dyndns.org> (Let's keep csv at mail.mojam.com in the loop. This is good input for all of us.) >> Using the attached CSV file (which I think is correct and uses your >> screener object, I get >> >> ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37'] >> >> which looks fine to me. >> Roberto> but it doesn't to me, because 53, 669, 700 are not three Roberto> different data, but the single number 53669700, only, as you Roberto> can see in the following line, is represented with dots as Roberto> usual in financial conventions. I understand that it wasn't quite right. I had to guess about the quoting. It's still all wrong. It's not just that there are extra quotation marks at the beginning and the end (the ones you stripped), it's that every other quotation mark is doubled. The parser only supports a single character quote character, so they are a problem. One thing you can do to make like easier is to write a generator function which sits between the file and the parser. It will strip the extra quotes in each line. I've attached a simple Python script (which requires Python 2.2 or 2.3) that seems to work correctly, as well as your longs.csv file (with the extra leading and trailing triple quotes) so the other developers can see it. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: longs.csv Type: application/octet-stream Size: 7191 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20031002/ad8c415d/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: longs.py Type: application/octet-stream Size: 658 bytes Desc: not available Url : http://mail.python.org/pipermail/csv/attachments/20031002/ad8c415d/attachment-0001.obj From skip at pobox.com Thu Oct 2 15:57:33 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu, 2 Oct 2003 08:57:33 -0500 Subject: [Csv] PEP 305 In-Reply-To: <3F7C2B7A.8080509@tiscali.it> References: <3F7AB0C4.4090402@tiscali.it> <16251.215.52504.461511@montanaro.dyndns.org> <3F7BD250.4030609@tiscali.it> <16252.10689.345360.482898@montanaro.dyndns.org> <3F7C2B7A.8080509@tiscali.it> Message-ID: <16252.11981.109016.372280@montanaro.dyndns.org> Roberto> One last question: I thought that the "doublequote" flag in the Roberto> definition of the Dialect class was supposed to deal with Roberto> "dirty" .csv files like mine (regarding the inner double Roberto> quotes, not the leading and trailing ones). So what is that Roberto> flag useful for? I may be misremembering, but I believe it tells the parser that the quote character is doubled when embedded inside a field. If that's false, the the escapechar field of the dialect must be set to a single-character string. Hmmm... Maybe try this: class screener_dialect(csv.excel): quotechar = '"' delimiter = '.' doublequote = False escapechar = '"' Weird, but it might also work. Skip From robonato at tiscali.it Thu Oct 2 15:43:22 2003 From: robonato at tiscali.it (Roberto Bonato) Date: Thu, 02 Oct 2003 15:43:22 +0200 Subject: [Csv] PEP 305 References: <3F7AB0C4.4090402@tiscali.it> <16251.215.52504.461511@montanaro.dyndns.org> <3F7BD250.4030609@tiscali.it> <16252.10689.345360.482898@montanaro.dyndns.org> Message-ID: <3F7C2B7A.8080509@tiscali.it> Hi Skip thank you very much for your help, I'll try and use your script, of course I had thought about writing that on my own, but this will spare me some work. One last question: I thought that the "doublequote" flag in the definition of the Dialect class was supposed to deal with "dirty" .csv files like mine (regarding the inner double quotes, not the leading and trailing ones). So what is that flag useful for? Roberto Skip Montanaro wrote: >(Let's keep csv at mail.mojam.com in the loop. This is good input for all of >us.) > > >> Using the attached CSV file (which I think is correct and uses your > >> screener object, I get > >> > >> ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37'] > >> > >> which looks fine to me. > >> > Roberto> but it doesn't to me, because 53, 669, 700 are not three > Roberto> different data, but the single number 53669700, only, as you > Roberto> can see in the following line, is represented with dots as > Roberto> usual in financial conventions. > >I understand that it wasn't quite right. I had to guess about the quoting. >It's still all wrong. It's not just that there are extra quotation marks at >the beginning and the end (the ones you stripped), it's that every other >quotation mark is doubled. The parser only supports a single character >quote character, so they are a problem. > >One thing you can do to make like easier is to write a generator function >which sits between the file and the parser. It will strip the extra quotes >in each line. > >I've attached a simple Python script (which requires Python 2.2 or 2.3) that >seems to work correctly, as well as your longs.csv file (with the extra >leading and trailing triple quotes) so the other developers can see it. > >Skip > > > From sjmachin at lexicon.net Fri Oct 3 00:45:53 2003 From: sjmachin at lexicon.net (sjmachin at lexicon.net) Date: Fri, 03 Oct 2003 08:45:53 +1000 Subject: [Csv] PEP 305 In-Reply-To: <16252.10689.345360.482898@montanaro.dyndns.org> References: <3F7BD250.4030609@tiscali.it> Message-ID: <3F7D3741.13262.C0E656@localhost> The data in longs.csv has suffered a triple-witching, and could be recovered easily by reversing the spells: (1) remove two instances of " from front and back of string (2) CSV decoding with quote char of " and delimiter = [anything not in string, e.g TAB character] (3) normal European CSV decoding with quote char of "" and period/dot as the delimiter Well easily using my homebrew 'delimited' module anyway :-) >>> import delimited >>> guff = '"""INTC.""Intel Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37"""""' >>> unpk1 = delimited.unpacker(delimiter="\t") >>> unpk2 = delimited.unpacker(delimiter=".") >>> guff2 = guff[2:-2] >>> guff2 '"INTC.""Intel Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37"""' >>> guff3 = unpk1(guff2) >>> guff3 ['INTC."Intel Corporation"."1"."2,07"."0,22"."13,00"."53.669.700"."28,37"'] # interesting that the ticker code (INTC) is *not* quoted >>> guff4 = unpk2(guff3[0]) >>> guff4 ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53.669.700', '28,37'] which appears to be what Roberto expected. > > (Let's keep csv at mail.mojam.com in the loop. This is good input for > all of us.) > > >> Using the attached CSV file (which I think is correct and uses > your >> screener object, I get >> >> ['INTC', 'Intel Corporation', > '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37'] >> >> > which looks fine to me. >> Roberto> but it doesn't to me, because > 53, 669, 700 are not three Roberto> different data, but the single > number 53669700, only, as you Roberto> can see in the following > line, is represented with dots as Roberto> usual in financial > conventions. > > I understand that it wasn't quite right. I had to guess about the > quoting. It's still all wrong. It's not just that there are extra > quotation marks at the beginning and the end (the ones you stripped), > it's that every other quotation mark is doubled. The parser only > supports a single character quote character, so they are a problem. > > One thing you can do to make like easier is to write a generator > function which sits between the file and the parser. It will strip > the extra quotes in each line. > > I've attached a simple Python script (which requires Python 2.2 or > 2.3) that seems to work correctly, as well as your longs.csv file > (with the extra leading and trailing triple quotes) so the other > developers can see it. > > Skip > > From djc at object-craft.com.au Fri Oct 3 02:40:35 2003 From: djc at object-craft.com.au (Dave Cole) Date: 03 Oct 2003 10:40:35 +1000 Subject: [Csv] PEP 305 In-Reply-To: <16252.11981.109016.372280@montanaro.dyndns.org> References: <3F7AB0C4.4090402@tiscali.it> <16251.215.52504.461511@montanaro.dyndns.org> <3F7BD250.4030609@tiscali.it> <16252.10689.345360.482898@montanaro.dyndns.org> <3F7C2B7A.8080509@tiscali.it> <16252.11981.109016.372280@montanaro.dyndns.org> Message-ID: > Roberto> One last question: I thought that the "doublequote" > Roberto> flag in the definition of the Dialect class was > Roberto> supposed to deal with "dirty" .csv files like mine > Roberto> (regarding the inner double quotes, not the leading and > Roberto> trailing ones). So what is that flag useful for? > > I may be misremembering, but I believe it tells the parser that the > quote character is doubled when embedded inside a field. If that's > false, the the escapechar field of the dialect must be set to a > single-character string. That is correct. > Hmmm... Maybe try this: > > class screener_dialect(csv.excel): > quotechar = '"' > delimiter = '.' > doublequote = False > escapechar = '"' > > Weird, but it might also work. Looking at the data I am not sure that you can build a bullet proof parser using the csv module. The csv parser can only use single characters for each of the quotechar, delimiter, etc. The input file is using two double quotes as the quotechar. This begs the question of how the file format would cope with the following field value (as a Python string): 'a field ""."" value' In the parent example, by removing all double quotes you break fields that contain embedded double quote characters. Is there any documentation for the file format that would suggest some pre-processing that could be performed to transform the two character "quote char" into a single character? - Dave -- http://www.object-craft.com.au From skip at pobox.com Fri Oct 3 15:50:50 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 3 Oct 2003 08:50:50 -0500 Subject: [Csv] PEP 305 In-Reply-To: References: <3F7AB0C4.4090402@tiscali.it> <16251.215.52504.461511@montanaro.dyndns.org> <3F7BD250.4030609@tiscali.it> <16252.10689.345360.482898@montanaro.dyndns.org> <3F7C2B7A.8080509@tiscali.it> <16252.11981.109016.372280@montanaro.dyndns.org> Message-ID: <16253.32442.348861.664072@montanaro.dyndns.org> Dave> Is there any documentation for the file format that would suggest Dave> some pre-processing that could be performed to transform the two Dave> character "quote char" into a single character? Not seeing any docs, I proposed a guess yesterday in the form of a generator which sits as a shim between the real data and the csv module. For this limited example it seems to work, but I agree it's not optimal. For one thing, it relies on the fact that the data Roberto posted doesn't actually use quotation marks as data, so it can simply strip them out. I suspect a more sophisticated generator could be written which performs the necessary voodoo using regular expressions and so forth. Skip From skip at pobox.com Fri Oct 3 16:05:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri, 3 Oct 2003 09:05:07 -0500 Subject: [Csv] fieldnames made option for csv.DictReader Message-ID: <16253.33299.376352.477227@montanaro.dyndns.org> I just checked in a change to the csv.DictReader class. The fieldnames argument to the constructor is now optional. Any time the reader's next() method is called when self.fieldnames is None, the row read will be assigned to it and another row returned. This means the programmer doesn't need to know the fieldnames ahead of time. Skip From robonato at tiscali.it Fri Oct 3 16:16:51 2003 From: robonato at tiscali.it (Roberto Bonato) Date: Fri, 03 Oct 2003 16:16:51 +0200 Subject: [Csv] PEP 305 References: <3F7AB0C4.4090402@tiscali.it> <16251.215.52504.461511@montanaro.dyndns.org> <3F7BD250.4030609@tiscali.it> <16252.10689.345360.482898@montanaro.dyndns.org> <3F7C2B7A.8080509@tiscali.it> <16252.11981.109016.372280@montanaro.dyndns.org> Message-ID: <3F7D84D3.8060807@tiscali.it> Dave Cole wrote: >Is there any documentation for the file format that would suggest some >pre-processing that could be performed to transform the two character >"quote char" into a single character? > >- Dave > This data was produces by an Activex Control that you can download (if you have Internet Explorer) at the following url: http://moneycentral.msn.com/articles/common/finderpro.asp It downloads data about stocks that you select according to user criteria, then I used the "export toward excel" function. The (poor) result is what I've sent you, I don't think this correpond to a particular standard. I was mislead to believe that because I wrongly interpreted the meaning of the "doublequote" flag in the csv module. Thanks to everybody for your help. Roberto From jbauer at rubic.com Wed Oct 22 15:22:26 2003 From: jbauer at rubic.com (Jeff Bauer) Date: Wed, 22 Oct 2003 08:22:26 -0500 Subject: [Csv] PEP 305 Message-ID: <3F968492.33076D7D@rubic.com> Hi Skip. I was reading PEP 305 and noticed that its status was listed as "Draft". It is also list in the PEP index as "Open" (under consideration). Since it is now part of the Python distribution, I would have thought it finalized, but perhaps there are still open issues? Regards, Jeff