From robonato at tiscali.it  Wed Oct  1 12:47:32 2003
From: robonato at tiscali.it (Roberto Bonato)
Date: Wed, 01 Oct 2003 12:47:32 +0200
Subject: [Csv] PEP 305
Message-ID: <3F7AB0C4.4090402@tiscali.it>

Hi all


        I'm kind of disappointed by the csv module for Python 2.3.
The following line comes from a .csv files generated by Stockscreener 
Deluxe (moneycentral.msn.com)

"INTC.""Intel 
Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37"""

the following class:

class deluxe_screener(excel):
    delimiter = '.'
    quotechar = '"'
    doublequote = True

cannot produce as an output anything better than

['INTC', 'Intel Corporation""', '1""', '2,07""', '0,22""', '13,00""', 
'53', '669', '700""', '28,37""']

I'm disappointed by how the double quotes are dealt with, but above all 
by the fact that ""53.669.700"" is split into three separated tokens.

Am I doing something wrong?
Any help is appreciated, thanks

Roberto


From skip at pobox.com  Wed Oct  1 18:29:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 1 Oct 2003 11:29:11 -0500
Subject: [Csv] PEP 305
In-Reply-To: <3F7AB0C4.4090402@tiscali.it>
References: <3F7AB0C4.4090402@tiscali.it>
Message-ID: <16251.215.52504.461511@montanaro.dyndns.org>


    Roberto> I'm kind of disappointed by the csv module for Python 2.3.  The
    Roberto> following line comes from a .csv files generated by
    Roberto> Stockscreener Deluxe (moneycentral.msn.com)

    Roberto> "INTC.""Intel Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37"""

    Roberto> the following class:

    Roberto> class deluxe_screener(excel):
    Roberto>     delimiter = '.'
    Roberto>     quotechar = '"'
    Roberto>     doublequote = True

    Roberto> cannot produce as an output anything better than

    Roberto> ['INTC', 'Intel Corporation""', '1""', '2,07""', '0,22""', '13,00""', 
    Roberto> '53', '669', '700""', '28,37""']

    Roberto> I'm disappointed by how the double quotes are dealt with, but
    Roberto> above all by the fact that ""53.669.700"" is split into three
    Roberto> separated tokens.

    Roberto> Am I doing something wrong?

Roberto,

I'm not sure you're doing anything wrong.  The CSV file looks invalid to me,
even considering that you are using a European locale.  Can you send me
(skip at pobox.com) a CSV file as an attachment so we can be sure it's not
mangled during transmission?

Here's why I think it's invalid.  If the quotechar is '"', that means any
time you have a space or the delimiter in a field, the field must be quoted.
Furthermore, if the field contains a literal quotechar, it must be doubled.
Accordingly, as you transmitted that row in your message, I see only a
single field.  The first field is opened by the '"' character.  All the
other '"' characters except the last are doubled, meaning they are part of
the field.  The line is closed with a tripled '"', indicating an embedded
quotation mark followed by a '"' to end the field.

Using the attached CSV file (which I think is correct and uses your screener
object, I get

    ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37']

which looks fine to me.

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/
          http://www.mojam.com/
Got spam? http://spambayes.sf.net/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: intc.csv
Type: application/octet-stream
Size: 59 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20031001/01248d04/attachment.obj 

From skip at pobox.com  Thu Oct  2 15:36:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 2 Oct 2003 08:36:01 -0500
Subject: [Csv] PEP 305
In-Reply-To: <3F7BD250.4030609@tiscali.it>
References: <3F7AB0C4.4090402@tiscali.it>
        <16251.215.52504.461511@montanaro.dyndns.org>
        <3F7BD250.4030609@tiscali.it>
Message-ID: <16252.10689.345360.482898@montanaro.dyndns.org>


(Let's keep csv at mail.mojam.com in the loop.  This is good input for all of
us.) 

    >> Using the attached CSV file (which I think is correct and uses your
    >> screener object, I get
    >> 
    >> ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37']
    >> 
    >> which looks fine to me.
    >> 
    Roberto> but it doesn't to me, because 53, 669, 700 are not three
    Roberto> different data, but the single number 53669700, only, as you
    Roberto> can see in the following line, is represented with dots as
    Roberto> usual in financial conventions.

I understand that it wasn't quite right.  I had to guess about the quoting.
It's still all wrong.  It's not just that there are extra quotation marks at
the beginning and the end (the ones you stripped), it's that every other
quotation mark is doubled.  The parser only supports a single character
quote character, so they are a problem.

One thing you can do to make like easier is to write a generator function
which sits between the file and the parser.  It will strip the extra quotes
in each line.

I've attached a simple Python script (which requires Python 2.2 or 2.3) that
seems to work correctly, as well as your longs.csv file (with the extra
leading and trailing triple quotes) so the other developers can see it.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: longs.csv
Type: application/octet-stream
Size: 7191 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20031002/ad8c415d/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: longs.py
Type: application/octet-stream
Size: 658 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20031002/ad8c415d/attachment-0001.obj 

From skip at pobox.com  Thu Oct  2 15:57:33 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 2 Oct 2003 08:57:33 -0500
Subject: [Csv] PEP 305
In-Reply-To: <3F7C2B7A.8080509@tiscali.it>
References: <3F7AB0C4.4090402@tiscali.it>
        <16251.215.52504.461511@montanaro.dyndns.org>
        <3F7BD250.4030609@tiscali.it>
        <16252.10689.345360.482898@montanaro.dyndns.org>
        <3F7C2B7A.8080509@tiscali.it>
Message-ID: <16252.11981.109016.372280@montanaro.dyndns.org>


    Roberto> One last question: I thought that the "doublequote" flag in the
    Roberto> definition of the Dialect class was supposed to deal with
    Roberto> "dirty" .csv files like mine (regarding the inner double
    Roberto> quotes, not the leading and trailing ones). So what is that
    Roberto> flag useful for?

I may be misremembering, but I believe it tells the parser that the quote
character is doubled when embedded inside a field.  If that's false, the the
escapechar field of the dialect must be set to a single-character string.

Hmmm...  Maybe try this:

    class screener_dialect(csv.excel):
        quotechar = '"'
        delimiter = '.'
        doublequote = False
        escapechar = '"'

Weird, but it might also work.

Skip

From robonato at tiscali.it  Thu Oct  2 15:43:22 2003
From: robonato at tiscali.it (Roberto Bonato)
Date: Thu, 02 Oct 2003 15:43:22 +0200
Subject: [Csv] PEP 305
References: <3F7AB0C4.4090402@tiscali.it>
	<16251.215.52504.461511@montanaro.dyndns.org>
	<3F7BD250.4030609@tiscali.it> <16252.10689.345360.482898@montanaro.dyndns.org>
Message-ID: <3F7C2B7A.8080509@tiscali.it>

Hi Skip

        thank you very much for your help, I'll try and use your script, 
of course I had thought about writing that on my own, but this will 
spare me some work.

        One last question: I thought that the "doublequote" flag in the 
definition of the Dialect class was supposed to deal with "dirty" .csv 
files like mine (regarding the inner double quotes, not the leading and 
trailing ones). So what is that flag useful for?

Roberto

Skip Montanaro wrote:

>(Let's keep csv at mail.mojam.com in the loop.  This is good input for all of
>us.) 
>
>    >> Using the attached CSV file (which I think is correct and uses your
>    >> screener object, I get
>    >> 
>    >> ['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37']
>    >> 
>    >> which looks fine to me.
>    >> 
>    Roberto> but it doesn't to me, because 53, 669, 700 are not three
>    Roberto> different data, but the single number 53669700, only, as you
>    Roberto> can see in the following line, is represented with dots as
>    Roberto> usual in financial conventions.
>
>I understand that it wasn't quite right.  I had to guess about the quoting.
>It's still all wrong.  It's not just that there are extra quotation marks at
>the beginning and the end (the ones you stripped), it's that every other
>quotation mark is doubled.  The parser only supports a single character
>quote character, so they are a problem.
>
>One thing you can do to make like easier is to write a generator function
>which sits between the file and the parser.  It will strip the extra quotes
>in each line.
>
>I've attached a simple Python script (which requires Python 2.2 or 2.3) that
>seems to work correctly, as well as your longs.csv file (with the extra
>leading and trailing triple quotes) so the other developers can see it.
>
>Skip
>
>  
>


From sjmachin at lexicon.net  Fri Oct  3 00:45:53 2003
From: sjmachin at lexicon.net (sjmachin at lexicon.net)
Date: Fri, 03 Oct 2003 08:45:53 +1000
Subject: [Csv] PEP 305
In-Reply-To: <16252.10689.345360.482898@montanaro.dyndns.org>
References: <3F7BD250.4030609@tiscali.it>
Message-ID: <3F7D3741.13262.C0E656@localhost>

The data in longs.csv has suffered a triple-witching, and could be recovered easily by 
reversing the spells:
(1) remove two instances of " from front and back of string
(2) CSV decoding with quote char of " and delimiter = [anything not in string, e.g TAB 
character]
(3) normal European CSV decoding with quote char of "" and period/dot as the delimiter

Well easily using my homebrew 'delimited' module anyway :-)

>>> import delimited
>>> guff = '"""INTC.""Intel 
Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37"""""'
>>> unpk1 = delimited.unpacker(delimiter="\t")
>>> unpk2 = delimited.unpacker(delimiter=".")
>>> guff2 = guff[2:-2]
>>> guff2
'"INTC.""Intel Corporation"".""1"".""2,07"".""0,22"".""13,00"".""53.669.700"".""28,37"""'
>>> guff3 = unpk1(guff2)
>>> guff3
['INTC."Intel Corporation"."1"."2,07"."0,22"."13,00"."53.669.700"."28,37"']
# interesting that the ticker code (INTC) is *not* quoted
>>> guff4 = unpk2(guff3[0])
>>> guff4
['INTC', 'Intel Corporation', '1', '2,07', '0,22', '13,00', '53.669.700', '28,37']

which appears to be what Roberto expected.

> 
> (Let's keep csv at mail.mojam.com in the loop.  This is good input for
> all of us.) 
> 
>     >> Using the attached CSV file (which I think is correct and uses
>     your >> screener object, I get >> >> ['INTC', 'Intel Corporation',
>     '1', '2,07', '0,22', '13,00', '53', '669', '700', '28,37'] >> >>
>     which looks fine to me. >> Roberto> but it doesn't to me, because
>     53, 669, 700 are not three Roberto> different data, but the single
>     number 53669700, only, as you Roberto> can see in the following
>     line, is represented with dots as Roberto> usual in financial
>     conventions.
> 
> I understand that it wasn't quite right.  I had to guess about the
> quoting. It's still all wrong.  It's not just that there are extra
> quotation marks at the beginning and the end (the ones you stripped),
> it's that every other quotation mark is doubled.  The parser only
> supports a single character quote character, so they are a problem.
> 
> One thing you can do to make like easier is to write a generator
> function which sits between the file and the parser.  It will strip
> the extra quotes in each line.
> 
> I've attached a simple Python script (which requires Python 2.2 or
> 2.3) that seems to work correctly, as well as your longs.csv file
> (with the extra leading and trailing triple quotes) so the other
> developers can see it.
> 
> Skip
> 
> 


From djc at object-craft.com.au  Fri Oct  3 02:40:35 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 03 Oct 2003 10:40:35 +1000
Subject: [Csv] PEP 305
In-Reply-To: <16252.11981.109016.372280@montanaro.dyndns.org>
References: <3F7AB0C4.4090402@tiscali.it>
	<16251.215.52504.461511@montanaro.dyndns.org>
	<3F7BD250.4030609@tiscali.it>
	<16252.10689.345360.482898@montanaro.dyndns.org>
	<3F7C2B7A.8080509@tiscali.it>
	<16252.11981.109016.372280@montanaro.dyndns.org>
Message-ID: <m3k77nnofw.fsf@echidna.object-craft.com.au>


>     Roberto> One last question: I thought that the "doublequote"
>     Roberto> flag in the definition of the Dialect class was
>     Roberto> supposed to deal with "dirty" .csv files like mine
>     Roberto> (regarding the inner double quotes, not the leading and
>     Roberto> trailing ones). So what is that flag useful for?
> 
> I may be misremembering, but I believe it tells the parser that the
> quote character is doubled when embedded inside a field.  If that's
> false, the the escapechar field of the dialect must be set to a
> single-character string.

That is correct.

> Hmmm...  Maybe try this:
> 
>     class screener_dialect(csv.excel):
>         quotechar = '"'
>         delimiter = '.'
>         doublequote = False
>         escapechar = '"'
> 
> Weird, but it might also work.

Looking at the data I am not sure that you can build a bullet proof
parser using the csv module.  The csv parser can only use single
characters for each of the quotechar, delimiter, etc.  The input file
is using two double quotes as the quotechar.  This begs the question
of how the file format would cope with the following field value (as a
Python string):

        'a field ""."" value'

In the parent example, by removing all double quotes you break fields
that contain embedded double quote characters.

Is there any documentation for the file format that would suggest some
pre-processing that could be performed to transform the two character
"quote char" into a single character?

- Dave

-- 
http://www.object-craft.com.au


From skip at pobox.com  Fri Oct  3 15:50:50 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 3 Oct 2003 08:50:50 -0500
Subject: [Csv] PEP 305
In-Reply-To: <m3k77nnofw.fsf@echidna.object-craft.com.au>
References: <3F7AB0C4.4090402@tiscali.it>
        <16251.215.52504.461511@montanaro.dyndns.org>
        <3F7BD250.4030609@tiscali.it>
        <16252.10689.345360.482898@montanaro.dyndns.org>
        <3F7C2B7A.8080509@tiscali.it>
        <16252.11981.109016.372280@montanaro.dyndns.org>
        <m3k77nnofw.fsf@echidna.object-craft.com.au>
Message-ID: <16253.32442.348861.664072@montanaro.dyndns.org>


    Dave> Is there any documentation for the file format that would suggest
    Dave> some pre-processing that could be performed to transform the two
    Dave> character "quote char" into a single character?

Not seeing any docs, I proposed a guess yesterday in the form of a generator
which sits as a shim between the real data and the csv module.  For this
limited example it seems to work, but I agree it's not optimal.  For one
thing, it relies on the fact that the data Roberto posted doesn't actually
use quotation marks as data, so it can simply strip them out.  I suspect a
more sophisticated generator could be written which performs the necessary
voodoo using regular expressions and so forth.

Skip


From skip at pobox.com  Fri Oct  3 16:05:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 3 Oct 2003 09:05:07 -0500
Subject: [Csv] fieldnames made option for csv.DictReader
Message-ID: <16253.33299.376352.477227@montanaro.dyndns.org>

I just checked in a change to the csv.DictReader class.  The fieldnames
argument to the constructor is now optional.  Any time the reader's next()
method is called when self.fieldnames is None, the row read will be assigned
to it and another row returned.  This means the programmer doesn't need to
know the fieldnames ahead of time.

Skip

From robonato at tiscali.it  Fri Oct  3 16:16:51 2003
From: robonato at tiscali.it (Roberto Bonato)
Date: Fri, 03 Oct 2003 16:16:51 +0200
Subject: [Csv] PEP 305
References: <3F7AB0C4.4090402@tiscali.it>
	<16251.215.52504.461511@montanaro.dyndns.org>	<3F7BD250.4030609@tiscali.it>
	<16252.10689.345360.482898@montanaro.dyndns.org>	<3F7C2B7A.8080509@tiscali.it>
	<16252.11981.109016.372280@montanaro.dyndns.org>
	<m3k77nnofw.fsf@echidna.object-craft.com.au>
Message-ID: <3F7D84D3.8060807@tiscali.it>


Dave Cole wrote:

>Is there any documentation for the file format that would suggest some
>pre-processing that could be performed to transform the two character
>"quote char" into a single character?
>
>- Dave
>
This data was produces by an Activex Control that you can download (if 
you have Internet Explorer) at the following url:

http://moneycentral.msn.com/articles/common/finderpro.asp

It downloads data about stocks that you select according to user 
criteria, then I used the "export toward excel" function. The (poor) 
result is what I've sent you, I don't think this correpond to a 
particular standard. I was mislead to believe that because I wrongly 
interpreted the meaning of the "doublequote" flag in the csv module.

Thanks to everybody for your help.

Roberto


From jbauer at rubic.com  Wed Oct 22 15:22:26 2003
From: jbauer at rubic.com (Jeff Bauer)
Date: Wed, 22 Oct 2003 08:22:26 -0500
Subject: [Csv] PEP 305
Message-ID: <3F968492.33076D7D@rubic.com>

Hi Skip.

I was reading PEP 305 and noticed that its status
was listed as "Draft".  It is also list in the PEP
index as "Open" (under consideration).

Since it is now part of the Python distribution, I
would have thought it finalized, but perhaps there
are still open issues?

Regards,

Jeff